Tutorial CWB

by Marco Baroni

Corpora used in this tutorial:

BNCV4 → The British National Corpus (Visit website)
REPUBBLICA2 → la Repubblica/SSLMIT Corpus (Visit website)

The most efficient way to extract frequency data from a CWB-encoded corpus is via the command line tool cwb-scan-corpus (recall that CWB is a toolkit).

As with any command line utility, you can use tab completion when typing the name of this command. Thus, if you type cwb-sc (or any longer substring) and then press the tab key, the terminal will complete the command name for you.

Remember that you can always use the arrow pointing upwards to recall a previous command.

To stop a command while it is running, press the ctrl and c keys together.

To remove a file:

rm filename

and then y when the system asks for confirmation. This is important because your accounts' settings block file overwrite, and thus if you create a file by mistake you cannot create another file with the same name until you remove the old one.

Basic syntax

cwb-scan-corpus CORPUS query > output.file

The query can pertain to arbitrary combinations of positional and structural attributes.

As a simple example, you can collect lemma frequencies as follows:

cwb-scan-corpus BNCV4 lemma > unigram.fq.notsorted.txt

Notice that output is not ordered by frequency. You can manage the output data with your favourite program (e.g., a spreadsheet tool), or (better) directly on the command line, e.g.:

cwb-scan-corpus BNCV4 lemma | sort -nrk1 > unigram.fq.txt

If you are interested in a sequence of tokens, you have to specify, for each query element, the position of the token it belongs to in the sequence (counting from 0). Thus, to look for bigram wordforms you can do:

cwb-scan-corpus BNCV4 word+0 word+1 | sort -nrk1 > bigram.fq.txt

Former will be rather slow: use ctrl+c to interrupt.

The BNC tagset: http://sslmit.unibo.it/~baroni/collocazioni/bnctagset.txt

The la Repubblica tagset: http://sslmit.unibo.it/~baroni/collocazioni/itwac.tagset.txt

POS tags are often employed in the specification of “constraints”, i.e., the corresponding query elements are prefixed by a question mark, and they are conditions that the matches must meet, but will not be part of the output.

For example, in the following way we extract all bigram noun-noun lemma sequences from the BNC (more on the “.*” syntax later):

cwb-scan-corpus BNCV4 ?pos+0=/NN.*/ lemma+0 ?pos+1=/NN.*/ lemma+1 | sort -nrk1 > nn.fq.list.txt

The modifiers of compounds headed by “donna” in la Repubblica:

cwb-scan-corpus REPUBBLICA2 ?lemma+0=/donna/ ?pos+1=/NOUN/ lemma+1 | sort -nrk1 > donnaN.txt

A fragment of the rich regular expression syntax supported by CWB:

.	Any character
*	0 or more of the preceding character
+	1 or more of the preceding character
?	0 or 1 of preceding character
[aeiou]	Character must be from this set
[a-zA-Z]	Same, with range notation
[^aeiouàèéìòóù]	Character must NOT be from list

What can the last regular expression above be used for?

Nominal lemmas ending in -ment in the BNC:

cwb-scan-corpus BNCV4 lemma=/.*ment/ ?pos=/NN.*/ | sort -nrk1 > ment.fq.txt

What does the "NN.*" syntax mean, and why do I use it?

Nominal lemmas ending in -ment, with at least two vowels occurring before ment:

cwb-scan-corpus BNCV4 lemma=/.*[aeiou].*[aeiou].*ment/ ?pos=/NN.*/ | sort -nrk1 > ment.fq.txt

Look for all the nouns (lemmas) ending in izzazione in la Repubblica.

Now, look for (and count) lemmas with the following characteristics:

  * begin with de
  * have one optional dash after de
  * end in izzare
  * have at least one vowel between de and izzare

An example of the usage of the negation symbol: looking for candidate N+N compounds in the BNC.

What does the following do?

cwb-scan-corpus REPUBBLICA2 ?pos+0=/[^N].*/ ?pos+1=/NN.*/ lemma+1 ?pos+2=/NN.*/ lemma+2 ?pos+3=/[^N].*/ | sort -nrk1 > nn.fq.txt

BNC

Some of the BNC structural attributes, with their values:

text_domain: S_Demog_AB, S_Demog_C1, S_Demog_C2, S_Demog_DE, S_Demog_Unclassified, S_cg_business, S_cg_education, S_cg_leisure, S_cg_public_instit, W_app_science, W_arts, W_belief_thought, W_commerce, W_imaginative, W_leisure, W_nat_science, W_soc_science, W_world_affairs

text_mode: S, W

text_author_sex: —, Female, Male, Mixed, Unknown

text_interaction_type: —, Dialogue, Monologue

Repubblica

Some of the la Repubblica structural attributes with their values:

article_author: author name

article_gen: news and commento

article_top: chiesa, cronaca, cultura, economia, meteo, politica, scienze, scuola, società, sport, NOCAT

article_year: 1985-2000

Frequency of parts of speech in spoken and written English:

cwb-scan-corpus BNCV4 pos ?text_mode=/W/ > pos_dist.written.txt

cwb-scan-corpus BNCV4 pos ?text_mode=/S/ > pos_dist.written.txt

Repeat the de...izzare query in la Repubblica, this time looking at occurrences in sports articles only.

Now, look for the distribution of de…izzare verbs across years (i.e., the output file should contain frequencies for pairs of de…izzare verb + year). In this case, it is probably more sensible to sort by the year column, rather than by frequency.

A cwb-scan-corpus query can be quite time consuming. Fortunately, you can use the standard Unix syntax “nohup … &” to let the program run in the background.

For example:

nohup cwb-scan-corpus BNCV4 word+0 word+1 | sort -nrk1 > bigram.fq.txt &

Now, check that the program is active with:

ps x

At this point, you can quit your ssh session (which, for most of you, will mean closing putty), and you can check if the output file is ready when you get home (again, you can use “ps x” to see that the query is done)!

For more information on the Unix command line, please take a look at:

http://sslmit.unibo.it/~baroni/compling04f/compling_materials.html

and in particular to the handouts Unix per Linguisti and Unix for Linguists Quick Reference, and to the Unix for Poets tutorial by Ken Church.