CWB beta version (pre-3.0), binaries for Mac OS X
Installation and early operation notes for Mac OS X.
Emiliano Guevara, 08/12/2006.
Contact: emiliano@lingue.unibo.it
- Archive used: cwb-2.2.b91-powerpc-darwin.tar.gz
- System: Mac OS X 10.4.8
- Machine and processor: MacBook Pro, 2.16 GHz Intel Core 2 Duo
1. CWB Installation
To install, copy the appropriate archive for Mac OS X into you Desktop folder. Expand the archive (double-click on the Finder, or use the commands gunzip
and tar
on it).
You will then have a new directory on your Desktop named cwb-<version number>
. If you browse that folder you will see that it contains a number of subdirectories, each one having some subdirectories and binary files:
cwb-<version-number>/ bin/ cwb-align-encode cqpcl cqp cwb-compress-rdx cwb-align-show cwb-s-decode cwb-atoi cwb-align cwb-decode cwb-makeall cwb-lexdecode cwb-describe-corpus cwb-encode cwb-huffcode cwb-itoa cwb-s-encode cwb-scan-corpus include/ cwb/ cl.h lib/ libcl.a man/ man1/ cqp.1 cwb-encode.1
Using Terminal.app, move all these files into the corresponding directories in your system, i.e. into /usr/local/
(make sure you don't substitute any existing directories or files, just add them)1).
/ usr/ local/ bin/ include/ lib/ man/ man1/
You have to move the contents of cwb<version number>/bin
into /usr/local/bin
, then the contents of cwb<version number>/include
into /usr/local/include
, and so on.
Now you will be able to type all of CWB's commands on your terminal (they are the binary files you moved into /usr/local/bin
), and you will also be able to see the man pages for cqp
and cwb-encode
.
To find a corpus, CQP uses an environment variable $CORPUS_REGISTRY
. This has to point to a directory registry/
where the corpora on your system are defined.
In theory, registry/
could be located anywhere, but in my experience it is better to create the following directory tree in / (root):
/ corpora/ c1/ registry/
Then you must set your environment, using one of the following commands 2) :
- if you use the TCSH shell:
setenv CORPUS_REGISTRY "/corpora/c1/registry"
- if you use the BASH shell:
export CORPUS_REGISTRY="/corpora/c1/registry"
2. Installing an encoded/indexed corpus
If you receive a corpus that is already encoded with cwb-encode
(like the demo corpora), you will most probably receive an archive containing a data/
directory and a registry/
directory.
Rename data/
to some thing more interesting (“dickens”, “german-law”, whatever makes you happy).
Move the renamed data/
directory into /corpora
.
Move the content of registry/
into /corpora/c1/registry
(just one text file, containing the information that cqp
needs to use the corpus).
Let's say you are installing the DICKENS demo corpus. Let's say you now have the following situation in your /corpora
directory:
/ corpora dickens/ (old "data/") book_num.avs book_num.avx book_num.rng book.rng chapter_num.avs [...] c1/ registry/ dickens
Now browse into the /corpora/c1/registry
directory and open the “dickens” file you just moved into it. The file's contents will include the following lines (just an example, it includes much more…):
# # CWB registry entry for corpus DICKENS # NAME "IMS Corpus Workbench Demo Corpus (Novels by Charles Dickens)" ID dickens HOME data
Do not touch anything, except the line defining “HOME”. Replace data
with the path to the new directory containing the data in /corpora
. In our case, we replace data
with /corpora/dickens/
. Save the file.
If you go back to the terminal, you will now be able to type the command cqp
and use the installed corpus:
$ cqp -e [no corpus]> show corpora; System corpora: D: DICKENS [no corpus]> DICKENS; DICKENS> "charles"; 0 matches. DICKENS> "house"; 3445: shutting up the counting- <house> arrived . With an ill-wi 3810: there when it was a young <house> , playing at hide-and-se 3890: black old gateway of the <house> , that it seemed as if t 4369: und resounded through the <house> like thunder . Every roo 5087: so did every bell in the <house> . This might have lasted [...]
3. Encoding a corpus
If you have installed CWB as indicated above, encoding a new corpus is very straightforward. Go on and read the Corpus Encoding Tutorial.
3.1. Preparation
Make sure your corpus is formatted one token per line, as indicated in the corpus encoding tutorial, eventually with additional columns for positional attributes (POS, LEMMA, etc.).
If your corpus counts more than one file, it is advisable that you put all the files together in just one gzipped archive (e.g. using something like gzip -c *.txt > newcorpus.gz
).
3.2. Import "newcorpus"
Create a directory /corpora/newcorpus
(substitute newcorpus with your corpus's name…).
Browse to the directory where your corpus is stored.
$ cd /path/to/newcorpus/files
Issue the cwb-encode command, remembering that your encoded data will “live” in /corpora/newcorpus
, and that the registry file for newcorpus will have to be saved under /corpora/c1/registry
.
You will also have to define the -P
and -S
flags according to the characteristics of newcorpus. We are using a simple example:
$ cwb-encode -d /corpora/newcorpus -f newcorpus.gz -R /corpora/c1/registry/newcorpus -P pos -P lemma -S text -S file -S s -S p
Your terminal will probably tell you if there have been any problems. If nothing critical happens, go on.
3.3. Index "newcorpus"
If you gotten this far, then you're almost done.
You are just missing the indexes for cqp
to be able to use the imported data. You will need to issue just one command: cwb-makeall -V NEWCORPUS
.
Beware: type “newcorpus” in uppercase… I had errors with typing lowercase.
$ cwb-makeall -V NEWCORPUS === Makeall: processing corpus NEWCORPUS === Registry directory: /corpora/c1/registry ATTRIBUTE word + creating LEXSRT ... OK - lexicon OK + creating FREQS ... OK - frequencies OK - token stream OK + creating REVCIDX ... OK + creating REVCORP ... OK ? validating REVCORP ... OK - index OK ATTRIBUTE pos + creating LEXSRT ... OK - lexicon OK + creating FREQS ... OK - frequencies OK - token stream OK + creating REVCIDX ... OK + creating REVCORP ... OK ? validating REVCORP ... OK - index OK ATTRIBUTE lemma + creating LEXSRT ... OK - lexicon OK + creating FREQS ... OK - frequencies OK - token stream OK + creating REVCIDX ... OK + creating REVCORP ... OK ? validating REVCORP ... OK - index OK ========================================
That's it! Now NEWCORPUS
will be available at cqp's interface:
$ cqp -e [no corpus]> show corpora; System corpora: D: DICKENS N: NEWCORPUS
registry/
directory elsewhere, it will work smoothly until you try to use cwb-encode
with a new corpus… at that point, you will be told by cqp
that /corpora/c1/registry
is needed – this is probably a bug. This is what happened to me after putting my registry in ~/corpora/registry
and trying to encode a corpus with cwb-encode
:
$ cqp -e Warning: Couldn't open directory /corpora/c1/registry (continuing)After that message, I had to move everything to / (root). If you follow the instructions above, you shouldn't have these problems.