by Marco Baroni
Invoke CQP (press enter/return key after this and all other commands):
cqp -e
or:
cqp -eC
Exit from CQP:
exit
(If you did that, please enter again!)
show corpora;
While in CQP, keep in mind that some things work like on the Unix terminal – in particular, you can recall previous commands with the upwards-pointing arrow, and you navigate the kwic results with more/less-like syntax (space to move to next page, q to quit, etc.)
Select corpus (remember the semi-colon at the end of each command), e.g.:
BNCV4; REPUBBLICA; etc.
(Notice that tab completion works for corpus names.)
Let's stick to the BNC, for now.
A quick way to know how many tokens there are in a corpus:
info;
Simple kwic:
"food"; "food" %c; "good" "food";
If you have problems seeing accented characters (as in vowels with umlaut in German or with accents in Italian and Spanish), try:
set Pager more;
You can move through the kwic results like with a standard Unix pager: space to see next page, b to go back one page, q to exit kwic display.
Whenever q does not work, use ctrl+c to interrupt any command.
To see the frequency of occurrence of your last query:
size Last;
If you have too many results, it is a good idea to take a look at a random sample… First, “save” query into a variable:
A = "often";
Then, “reduce” A to the desired number of randomly selected contexts, e.g.:
reduce A to 20;
Finally, take a look at these contexts:
cat A;
Change context size:
set Context 60; set Context 5 words; set Context s; set Context 3 s; set Context default;
Other visualization options:
show +pos; show +lemma; show -pos -lemma; show -cpos; set PrintStructures text_domain; set PrintStructures "";
Doing queries using morphosyntactic annotation (if you've been experimenting with show and set, now it's a good moment to go back to a normal-looking kwic-display):
[word = "obsessive"] [pos = "NN.*"]; [word = "obsessive" %c] [pos = "NN.*"]; [word = "cause"]; [lemma = "cause"]; [lemma = "cause" & pos = "VV.*"];
Practice time: - look for candidate N+N compounds in Italian with "donna" as head (at the lemma level) - select a random sample of 100 hits
No need to try the following now, but here is how you can extract a frequency list for a collocate extracted from a “flexible” context (from the BNC, in this specific case):
[lemma = "cause" & pos = "VV.*"][pos="AT0"]?[pos="AJ.*"]*[pos="NN.*"]; count by lemma on matchend;
The former is something you cannot do with cwb-scan-corpus.
Nouns ending in izzazione (in la Repubblica):
A = [lemma = ".*izzazione" & pos = "NOUN"];
The lemma “opportunist” used by women and men in the BNC:
[lemma="opportunist"] :: match.text_author_sex="Female"; [lemma="opportunist"] :: match.text_author_sex="Male";
Save the results to an output file:
cat Last > "myconc.txt";
or, if you saved results in a variable:
cat A > "myconc.txt";
Practice: - save 100 random concordance lines for the donna+NOUN pattern in an external text file, with extended context (e.g., a 3 sentence window)
http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench/CQPTutorial/html
http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench/CQPTutorial/cqp-tutorial.pdf
Serge Sharoff's Internet Corpora: http://corpus.leeds.ac.uk/internet.html
CucWeb: http://ramsesii.upf.es/cgi-bin/cucweb/search-form.pl?lang=en_US
SSLMITDev: http://sslmitdev-online.sslmit.unibo.it/corpora/corpora.php
The Word Sketch Engine uses a syntax that is almost identical to the one of CQP!