CQP installation and early operation notes for Mac OS X
by Emiliano Guevara (emiliano@lingue.unibo.it)
CWB beta version (pre-3.0), binaries for Mac OS X
Archive used | cwb-2.2.b91-powerpc-darwin.tar.gz |
System | Mac OS X 10.4.8 |
Machine Model | MacBookPro2,2 |
Processor Name | Intel Core 2 Duo |
Processor Speed | 2.16 GHz |
Number Of Processors | 1 |
Total Number Of Cores | 2 |
L2 Cache (per processor) | 4 MB |
Memory | 1 GB |
Bus Speed | 667 MHz |
Installation
To install, copy the appropriate archive for Mac OS X into you Desktop folder. Expand the archive (double-click on the Finder, or use the commands gunzip
and tar
on it).
You will then have a new directory on your Desktop named cwb-<version number>
. If you browse that folder you will see that it contains a number of subdirectories, each one having some subdirectories and binary files:
cwb-<version-number>/
bin/
cwb-align-encode
cqpcl
cqp
cwb-compress-rdx
cwb-align-show
cwb-s-decode
cwb-atoi
cwb-align
cwb-decode
cwb-makeall
cwb-lexdecode
cwb-describe-corpus
cwb-encode
cwb-huffcode
cwb-itoa
cwb-s-encode
cwb-scan-corpus
include/
cwb/
cl.h
lib/
libcl.a
man/
man1/
cqp.1
cwb-encode.1
Using Terminal.app, move all these files into the corresponding directories in your system, i.e. into /usr/local/
(make sure you don't substitute any existing directories or files, just add them, WARNING: if you are not familiar with the UNIX environment that is in you MAC OS X system, do not try this!!!):
/
usr/
local/
bin/
include/
lib/
man/
man1/
Now you will be able to type all of CWB's commands on your terminal, including the man pages for cqp
and cwb-encode
.
To find a corpus, CQP uses an environment variable $CORPUS_REGISTRY
. This has to point to a directory registry/
where the corpora on your system are defined.
In theory, registry/
could be located anywhere, but in my experience it is better to create the following directory tree in / (root):
/
corpora/
c1/
registry/
Then you must set your environment, using one of the following commands:
1)
setenv CORPUS_REGISTRY "/corpora/c1/registry"
export CORPUS_REGISTRY="/corpora/c1/registry"
Installing a corpus
If you receive a corpus that is already encoded with cwb-encode
(like the demo corpora), you will most probably have an archive containing a data/
directory and a registry/
directory.
Rename data/
to some thing more interesting (“dickens”, “german-law”, whatever makes you happy).
Move the renamed data/
directory into /corpora
.
Move the content of registry/
into /corpora/c1/registry
(just one text file, containing the information that cqp
needs to use the corpus).
Let's say you are installing the DICKENS demo corpus. Let's say you now have the following situation in you /corpora
directory:
/
corpora
dickens/ (old "data/")
book_num.avs
book_num.avx
book_num.rng
book.rng
chapter_num.avs
[...]
c1/
registry/
dickens
Now browse into the /corpora/c1/registry
directory and open the “dickens” file you just moved into it. The file's contents will include the following lines (just an example, it includes much more…):
#
# CWB registry entry for corpus DICKENS
#
NAME "IMS Corpus Workbench Demo Corpus (Novels by Charles Dickens)"
ID dickens
HOME data
Do not touch anything, except the line defining “HOME”.
Replace data
with the path to the new directory containing the data in /corpora
.
In our case, we replace data
with /corpora/dickens/
.
Save the file.
If you go back to the terminal, you will now be able to type the command cqp
and use the installed corpus:
$ cqp -e
[no corpus]> show corpora;
System corpora:
D: DICKENS
[no corpus]> DICKENS;
DICKENS> "charles";
0 matches.
DICKENS> "house";
3445: shutting up the counting- <house> arrived . With an ill-wi
3810: there when it was a young <house> , playing at hide-and-se
3890: black old gateway of the <house> , that it seemed as if t
4369: und resounded through the <house> like thunder . Every roo
5087: so did every bell in the <house> . This might have lasted
[...]
Encoding a corpus
If you have installed CWB as indicated above, encoding a new corpus is very straightforward. Go on and read the “Corpus Encoding Tutorial”.
Preparation
Make sure your corpus is formatted one token per line, as indicated in the “corpus encoding tutorial”, eventually with additional columns for positional attributes (POS, LEMMA, etc.).
If your corpus counts more than one file, it is advisable that you put all the files together in just one gzipped archive, e.g. using something like:
gzip -c *.txt > newcorpus.gz
Import "newcorpus"
Create a directory /corpora/newcorpus
(substitute “newcorpus” with your corpus's name…).
Browse to the directory where your corpus-files are stored.
$ cd /path/to/newcorpus/files
Issue the cwb-encode
command, remembering that your encoded data will “live” in /corpora/newcorpus
, and that the registry file for newcorpus will have to be saved as /corpora/c1/registry/newcorpus
.
You will also have to define the -P
and -S
flags according to the characteristics of newcorpus. We are using a simple example:
$ cwb-encode -d /corpora/newcorpus -f newcorpus.gz -R /corpora/c1/registry/newcorpus -P pos -P lemma -S text -S file -S s -S p
Your terminal will probably tell you if there have been any problems. If nothing critical happens, go on.
Index "newcorpus"
If you've gotten this far, then you're almost done.
You are just missing the indexes for cqp
to be able to use the imported data.
You will need to issue just one command: cwb-makeall -V NEWCORPUS
.
Beware: type “newcorpus” in uppercase… that is how cwb
likes it… I had errors with typing it lowercase.
$ cwb-makeall -V NEWCORPUS
=== Makeall: processing corpus NEWCORPUS ===
Registry directory: /corpora/c1/registry
ATTRIBUTE word
+ creating LEXSRT ... OK
- lexicon OK
+ creating FREQS ... OK
- frequencies OK
- token stream OK
+ creating REVCIDX ... OK
+ creating REVCORP ... OK
? validating REVCORP ... OK
- index OK
ATTRIBUTE pos
+ creating LEXSRT ... OK
- lexicon OK
+ creating FREQS ... OK
- frequencies OK
- token stream OK
+ creating REVCIDX ... OK
+ creating REVCORP ... OK
? validating REVCORP ... OK
- index OK
ATTRIBUTE lemma
+ creating LEXSRT ... OK
- lexicon OK
+ creating FREQS ... OK
- frequencies OK
- token stream OK
+ creating REVCIDX ... OK
+ creating REVCORP ... OK
? validating REVCORP ... OK
- index OK
========================================
That's it! Now NEWCORPUS will be available at cqp's interface:
$ cqp -e
[no corpus]> show corpora;
System corpora:
D: DICKENS
N: NEWCORPUS