tutorial_cqp

Looking at KWIC lines with CQP

by Marco Baroni

Invoke CQP (press enter/return key after this and all other commands):

cqp -e

or:

cqp -eC

Exit from CQP:

exit

(If you did that, please enter again!)

show corpora;

While in CQP, keep in mind that some things work like on the Unix terminal – in particular, you can recall previous commands with the upwards-pointing arrow, and you navigate the kwic results with more/less-like syntax (space to move to next page, q to quit, etc.)

Select corpus (remember the semi-colon at the end of each command), e.g.:

BNCV4;
REPUBBLICA;
etc.

(Notice that tab completion works for corpus names.)

Let's stick to the BNC, for now.

A quick way to know how many tokens there are in a corpus:

info;

Simple kwic:

"food";
"food" %c;
"good" "food";

If you have problems seeing accented characters (as in vowels with umlaut in German or with accents in Italian and Spanish), try:

set Pager more;

You can move through the kwic results like with a standard Unix pager: space to see next page, b to go back one page, q to exit kwic display.

Whenever q does not work, use ctrl+c to interrupt any command.

To see the frequency of occurrence of your last query:

size Last;

If you have too many results, it is a good idea to take a look at a random sample… First, “save” query into a variable:

A = "often";

Then, “reduce” A to the desired number of randomly selected contexts, e.g.:

reduce A to 20;

Finally, take a look at these contexts:

cat A;

Change context size:

set Context 60;
set Context 5 words;
set Context s;
set Context 3 s;
set Context default;

Other visualization options:

show +pos;
show +lemma;
show -pos -lemma;
show -cpos;
set PrintStructures text_domain;
set PrintStructures "";

Doing queries using morphosyntactic annotation (if you've been experimenting with show and set, now it's a good moment to go back to a normal-looking kwic-display):

[word = "obsessive"] [pos = "NN.*"]; 
[word = "obsessive" %c] [pos = "NN.*"];

[word = "cause"];
[lemma = "cause"];
[lemma = "cause" & pos = "VV.*"];
Practice time:
- look for candidate N+N compounds in Italian with "donna" as head (at the lemma level)
- select a random sample of 100 hits

No need to try the following now, but here is how you can extract a frequency list for a collocate extracted from a “flexible” context (from the BNC, in this specific case):

[lemma = "cause" & pos = "VV.*"][pos="AT0"]?[pos="AJ.*"]*[pos="NN.*"];
count by lemma on matchend;

The former is something you cannot do with cwb-scan-corpus.

Nouns ending in izzazione (in la Repubblica):

A = [lemma = ".*izzazione" & pos = "NOUN"];

The lemma “opportunist” used by women and men in the BNC:

[lemma="opportunist"] :: match.text_author_sex="Female";
[lemma="opportunist"] :: match.text_author_sex="Male";

Save the results to an output file:

cat Last > "myconc.txt";

or, if you saved results in a variable:

cat A > "myconc.txt";
Practice:
- save 100 random concordance lines for the donna+NOUN pattern in an external text file, with extended context (e.g., a 3 sentence window)

Serge Sharoff's Internet Corpora: http://corpus.leeds.ac.uk/internet.html

CucWeb: http://ramsesii.upf.es/cgi-bin/cucweb/search-form.pl?lang=en_US

SSLMITDev: http://sslmitdev-online.sslmit.unibo.it/corpora/corpora.php

The Word Sketch Engine uses a syntax that is almost identical to the one of CQP!

  • tutorial_cqp.txt
  • Last modified: 2008/04/24 12:43
  • by eros