====== Looking at KWIC lines with CQP ======
by Marco Baroni
===== The First Session =====
Invoke CQP (press enter/return key after this and all other commands):
cqp -e
or:
cqp -eC
Exit from CQP:
exit
(If you did that, please enter again!)
show corpora;
While in CQP, keep in mind that some things work like on the Unix terminal -- in particular, you can recall previous commands with the upwards-pointing arrow, and you navigate the kwic results with more/less-like syntax (space to move to next page, q to quit, etc.)
Select corpus (remember the semi-colon at the end of each command), e.g.:
BNCV4;
REPUBBLICA;
etc.
(Notice that tab completion works for corpus names.)
Let's stick to the BNC, for now.
A quick way to know how many tokens there are in a corpus:
info;
Simple kwic:
"food";
"food" %c;
"good" "food";
If you have problems seeing accented characters (as in vowels with umlaut in German or with accents in Italian and Spanish), try:
set Pager more;
You can move through the kwic results like with a standard Unix pager: **space** to see next page, **b** to go back one page, **q** to exit kwic display.
Whenever **q** does not work, use **ctrl+c** to interrupt any command.
To see the frequency of occurrence of your last query:
size Last;
If you have too many results, it is a good idea to take a look at a random sample... First, "save" query into a variable:
A = "often";
Then, "reduce" A to the desired number of randomly selected contexts, e.g.:
reduce A to 20;
Finally, take a look at these contexts:
cat A;
Change context size:
set Context 60;
set Context 5 words;
set Context s;
set Context 3 s;
set Context default;
Other visualization options:
show +pos;
show +lemma;
show -pos -lemma;
show -cpos;
set PrintStructures text_domain;
set PrintStructures "";
===== Exploiting morpho-syntactic annotation =====
Doing queries using morphosyntactic annotation (if you've been experimenting with show and set, now it's a good moment to go back to a normal-looking kwic-display):
[word = "obsessive"] [pos = "NN.*"];
[word = "obsessive" %c] [pos = "NN.*"];
[word = "cause"];
[lemma = "cause"];
[lemma = "cause" & pos = "VV.*"];
Practice time:
- look for candidate N+N compounds in Italian with "donna" as head (at the lemma level)
- select a random sample of 100 hits
No need to try the following now, but here is how you can extract a frequency list for a collocate extracted from a "flexible" context (from the BNC, in this specific case):
[lemma = "cause" & pos = "VV.*"][pos="AT0"]?[pos="AJ.*"]*[pos="NN.*"];
count by lemma on matchend;
The former is something you **cannot** do with cwb-scan-corpus.
===== Regular Expressions =====
Nouns ending in izzazione (in la Repubblica):
A = [lemma = ".*izzazione" & pos = "NOUN"];
===== Structural attributes =====
The lemma "opportunist" used by women and men in the BNC:
[lemma="opportunist"] :: match.text_author_sex="Female";
[lemma="opportunist"] :: match.text_author_sex="Male";
===== Saving results =====
Save the results to an output file:
cat Last > "myconc.txt";
or, if you saved results in a variable:
cat A > "myconc.txt";
Practice:
- save 100 random concordance lines for the donna+NOUN pattern in an external text file, with extended context (e.g., a 3 sentence window)
===== Useful links =====
==== Stefan Evert's CQP tutorial ====
[[http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench/CQPTutorial/html]] \\
[[http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench/CQPTutorial/cqp-tutorial.pdf]]
==== Some Web-based interfaces ====
Serge Sharoff's Internet Corpora: [[http://corpus.leeds.ac.uk/internet.html]]
CucWeb: [[http://ramsesii.upf.es/cgi-bin/cucweb/search-form.pl?lang=en_US]]
SSLMITDev: [[http://sslmitdev-online.sslmit.unibo.it/corpora/corpora.php]]
The Word Sketch Engine uses a syntax that is almost identical to the
one of CQP!