====== Looking at KWIC lines with CQP ====== by Marco Baroni ===== The First Session ===== Invoke CQP (press enter/return key after this and all other commands): cqp -e or: cqp -eC Exit from CQP: exit (If you did that, please enter again!) show corpora; While in CQP, keep in mind that some things work like on the Unix terminal -- in particular, you can recall previous commands with the upwards-pointing arrow, and you navigate the kwic results with more/less-like syntax (space to move to next page, q to quit, etc.) Select corpus (remember the semi-colon at the end of each command), e.g.: BNCV4; REPUBBLICA; etc. (Notice that tab completion works for corpus names.) Let's stick to the BNC, for now. A quick way to know how many tokens there are in a corpus: info; Simple kwic: "food"; "food" %c; "good" "food"; If you have problems seeing accented characters (as in vowels with umlaut in German or with accents in Italian and Spanish), try: set Pager more; You can move through the kwic results like with a standard Unix pager: **space** to see next page, **b** to go back one page, **q** to exit kwic display. Whenever **q** does not work, use **ctrl+c** to interrupt any command. To see the frequency of occurrence of your last query: size Last; If you have too many results, it is a good idea to take a look at a random sample... First, "save" query into a variable: A = "often"; Then, "reduce" A to the desired number of randomly selected contexts, e.g.: reduce A to 20; Finally, take a look at these contexts: cat A; Change context size: set Context 60; set Context 5 words; set Context s; set Context 3 s; set Context default; Other visualization options: show +pos; show +lemma; show -pos -lemma; show -cpos; set PrintStructures text_domain; set PrintStructures ""; ===== Exploiting morpho-syntactic annotation ===== Doing queries using morphosyntactic annotation (if you've been experimenting with show and set, now it's a good moment to go back to a normal-looking kwic-display): [word = "obsessive"] [pos = "NN.*"]; [word = "obsessive" %c] [pos = "NN.*"]; [word = "cause"]; [lemma = "cause"]; [lemma = "cause" & pos = "VV.*"]; Practice time: - look for candidate N+N compounds in Italian with "donna" as head (at the lemma level) - select a random sample of 100 hits No need to try the following now, but here is how you can extract a frequency list for a collocate extracted from a "flexible" context (from the BNC, in this specific case): [lemma = "cause" & pos = "VV.*"][pos="AT0"]?[pos="AJ.*"]*[pos="NN.*"]; count by lemma on matchend; The former is something you **cannot** do with cwb-scan-corpus. ===== Regular Expressions ===== Nouns ending in izzazione (in la Repubblica): A = [lemma = ".*izzazione" & pos = "NOUN"]; ===== Structural attributes ===== The lemma "opportunist" used by women and men in the BNC: [lemma="opportunist"] :: match.text_author_sex="Female"; [lemma="opportunist"] :: match.text_author_sex="Male"; ===== Saving results ===== Save the results to an output file: cat Last > "myconc.txt"; or, if you saved results in a variable: cat A > "myconc.txt"; Practice: - save 100 random concordance lines for the donna+NOUN pattern in an external text file, with extended context (e.g., a 3 sentence window) ===== Useful links ===== ==== Stefan Evert's CQP tutorial ==== [[http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench/CQPTutorial/html]] \\ [[http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench/CQPTutorial/cqp-tutorial.pdf]] ==== Some Web-based interfaces ==== Serge Sharoff's Internet Corpora: [[http://corpus.leeds.ac.uk/internet.html]] CucWeb: [[http://ramsesii.upf.es/cgi-bin/cucweb/search-form.pl?lang=en_US]] SSLMITDev: [[http://sslmitdev-online.sslmit.unibo.it/corpora/corpora.php]] The Word Sketch Engine uses a syntax that is almost identical to the one of CQP!