Tutorial CWB
by Marco Baroni
Extracting Frequency Lists
Corpora used in this tutorial:
BNCV4 → The British National Corpus (Visit website)
REPUBBLICA2 → la Repubblica/SSLMIT Corpus (Visit website)
Basics
The most efficient way to extract frequency data from a CWB-encoded corpus is via the command line tool cwb-scan-corpus (recall that CWB is a toolkit).
As with any command line utility, you can use tab completion when typing the name of this command. Thus, if you type cwb-sc (or any longer substring) and then press the tab key, the terminal will complete the command name for you.
Remember that you can always use the arrow pointing upwards to recall a previous command.
To stop a command while it is running, press the ctrl and c keys together.
To remove a file:
rm filename
and then y when the system asks for confirmation. This is important because your accounts' settings block file overwrite, and thus if you create a file by mistake you cannot create another file with the same name until you remove the old one.
Basic syntax
cwb-scan-corpus CORPUS query > output.file
The query can pertain to arbitrary combinations of positional and structural attributes.
As a simple example, you can collect lemma frequencies as follows:
cwb-scan-corpus BNCV4 lemma > unigram.fq.notsorted.txt
Notice that output is not ordered by frequency. You can manage the output data with your favourite program (e.g., a spreadsheet tool), or (better) directly on the command line, e.g.:
cwb-scan-corpus BNCV4 lemma | sort -nrk1 > unigram.fq.txt
If you are interested in a sequence of tokens, you have to specify, for each query element, the position of the token it belongs to in the sequence (counting from 0). Thus, to look for bigram wordforms you can do:
cwb-scan-corpus BNCV4 word+0 word+1 | sort -nrk1 > bigram.fq.txt
Former will be rather slow: use ctrl+c to interrupt.
Exploiting morpho-syntactic information
The BNC tagset: http://sslmit.unibo.it/~baroni/collocazioni/bnctagset.txt
The la Repubblica tagset: http://sslmit.unibo.it/~baroni/collocazioni/itwac.tagset.txt
POS tags are often employed in the specification of “constraints”, i.e., the corresponding query elements are prefixed by a question mark, and they are conditions that the matches must meet, but will not be part of the output.
For example, in the following way we extract all bigram noun-noun lemma sequences from the BNC (more on the “.*” syntax later):
cwb-scan-corpus BNCV4 ?pos+0=/NN.*/ lemma+0 ?pos+1=/NN.*/ lemma+1 | sort -nrk1 > nn.fq.list.txt
The modifiers of compounds headed by “donna” in la Repubblica:
cwb-scan-corpus REPUBBLICA2 ?lemma+0=/donna/ ?pos+1=/NOUN/ lemma+1 | sort -nrk1 > donnaN.txt
Regular expressions
A fragment of the rich regular expression syntax supported by CWB:
. | Any character |
* | 0 or more of the preceding character |
+ | 1 or more of the preceding character |
? | 0 or 1 of preceding character |
[aeiou] | Character must be from this set |
[a-zA-Z] | Same, with range notation |
[^aeiouàèéìòóù] | Character must NOT be from list |
What can the last regular expression above be used for?
Nominal lemmas ending in -ment in the BNC:
cwb-scan-corpus BNCV4 lemma=/.*ment/ ?pos=/NN.*/ | sort -nrk1 > ment.fq.txt
What does the "NN.*" syntax mean, and why do I use it?
Nominal lemmas ending in -ment, with at least two vowels occurring before ment:
cwb-scan-corpus BNCV4 lemma=/.*[aeiou].*[aeiou].*ment/ ?pos=/NN.*/ | sort -nrk1 > ment.fq.txt
Look for all the nouns (lemmas) ending in izzazione in la Repubblica.
Now, look for (and count) lemmas with the following characteristics: * begin with de * have one optional dash after de * end in izzare * have at least one vowel between de and izzare
An example of the usage of the negation symbol: looking for candidate N+N compounds in the BNC.
What does the following do?
cwb-scan-corpus REPUBBLICA2 ?pos+0=/[^N].*/ ?pos+1=/NN.*/ lemma+1 ?pos+2=/NN.*/ lemma+2 ?pos+3=/[^N].*/ | sort -nrk1 > nn.fq.txt
Structural attributes
BNC
Some of the BNC structural attributes, with their values:
text_domain: S_Demog_AB, S_Demog_C1, S_Demog_C2, S_Demog_DE, S_Demog_Unclassified, S_cg_business, S_cg_education, S_cg_leisure, S_cg_public_instit, W_app_science, W_arts, W_belief_thought, W_commerce, W_imaginative, W_leisure, W_nat_science, W_soc_science, W_world_affairs
text_mode: S, W
text_author_sex: —, Female, Male, Mixed, Unknown
text_interaction_type: —, Dialogue, Monologue
Repubblica
Some of the la Repubblica structural attributes with their values:
article_author: author name
article_gen: news and commento
article_top: chiesa, cronaca, cultura, economia, meteo, politica, scienze, scuola, società, sport, NOCAT
article_year: 1985-2000
Frequency of parts of speech in spoken and written English:
cwb-scan-corpus BNCV4 pos ?text_mode=/W/ > pos_dist.written.txt
cwb-scan-corpus BNCV4 pos ?text_mode=/S/ > pos_dist.written.txt
Repeat the de...izzare query in la Repubblica, this time looking at occurrences in sports articles only.
Now, look for the distribution of de…izzare verbs across years (i.e., the output file should contain frequencies for pairs of de…izzare verb + year). In this case, it is probably more sensible to sort by the year column, rather than by frequency.
Running cwb-scan-corpus in the background
A cwb-scan-corpus query can be quite time consuming. Fortunately, you can use the standard Unix syntax “nohup … &” to let the program run in the background.
For example:
nohup cwb-scan-corpus BNCV4 word+0 word+1 | sort -nrk1 > bigram.fq.txt &
Now, check that the program is active with:
ps x
At this point, you can quit your ssh session (which, for most of you, will mean closing putty), and you can check if the output file is ready when you get home (again, you can use “ps x” to see that the query is done)!
For more information on the Unix command line, please take a look at:
http://sslmit.unibo.it/~baroni/compling04f/compling_materials.html
and in particular to the handouts Unix per Linguisti and Unix for Linguists Quick Reference, and to the Unix for Poets tutorial by Ken Church.