CQP installation and early operation notes for Mac OS X

by Emiliano Guevara (emiliano@lingue.unibo.it)

Archive used	cwb-2.2.b91-powerpc-darwin.tar.gz
System	Mac OS X 10.4.8
Machine Model	MacBookPro2,2
Processor Name	Intel Core 2 Duo
Processor Speed	2.16 GHz
Number Of Processors	1
Total Number Of Cores	2
L2 Cache (per processor)	4 MB
Memory	1 GB
Bus Speed	667 MHz

To install, copy the appropriate archive for Mac OS X into you Desktop folder. Expand the archive (double-click on the Finder, or use the commands gunzip and tar on it).

You will then have a new directory on your Desktop named cwb-<version number>. If you browse that folder you will see that it contains a number of subdirectories, each one having some subdirectories and binary files:

cwb-<version-number>/
  bin/
    cwb-align-encode
    cqpcl
    cqp
    cwb-compress-rdx
    cwb-align-show
    cwb-s-decode
    cwb-atoi
    cwb-align
    cwb-decode
    cwb-makeall
    cwb-lexdecode
    cwb-describe-corpus
    cwb-encode
    cwb-huffcode
    cwb-itoa
    cwb-s-encode
    cwb-scan-corpus
  include/
    cwb/
      cl.h
  lib/
    libcl.a
  man/
    man1/
      cqp.1
      cwb-encode.1

Using Terminal.app, move all these files into the corresponding directories in your system, i.e. into /usr/local/ (make sure you don't substitute any existing directories or files, just add them, WARNING: if you are not familiar with the UNIX environment that is in you MAC OS X system, do not try this!!!):

/
  usr/
    local/
      bin/
      include/
      lib/
      man/
        man1/

Now you will be able to type all of CWB's commands on your terminal, including the man pages for cqp and cwb-encode.

To find a corpus, CQP uses an environment variable $CORPUS_REGISTRY. This has to point to a directory registry/ where the corpora on your system are defined. In theory, registry/ could be located anywhere, but in my experience it is better to create the following directory tree in / (root):

/
  corpora/
    c1/
      registry/

Then you must set your environment, using one of the following commands: ¹⁾

if you use the TCSH shell:

  setenv CORPUS_REGISTRY "/corpora/c1/registry"

if you use the BASH shell:

  export CORPUS_REGISTRY="/corpora/c1/registry"

If you receive a corpus that is already encoded with cwb-encode (like the demo corpora), you will most probably have an archive containing a data/ directory and a registry/ directory.

Rename data/ to some thing more interesting (“dickens”, “german-law”, whatever makes you happy).
Move the renamed data/ directory into /corpora.
Move the content of registry/ into /corpora/c1/registry (just one text file, containing the information that cqp needs to use the corpus).

Let's say you are installing the DICKENS demo corpus. Let's say you now have the following situation in you /corpora directory:

/
  corpora
    dickens/                 (old "data/")
      book_num.avs
      book_num.avx
      book_num.rng
      book.rng
      chapter_num.avs
      [...]
    c1/
      registry/
        dickens

Now browse into the /corpora/c1/registry directory and open the “dickens” file you just moved into it. The file's contents will include the following lines (just an example, it includes much more…):

#
# CWB registry entry for corpus DICKENS
#

NAME "IMS Corpus Workbench Demo Corpus (Novels by Charles Dickens)"
ID dickens
HOME data

Do not touch anything, except the line defining “HOME”.

Replace data with the path to the new directory containing the data in /corpora.
In our case, we replace data with /corpora/dickens/.
Save the file.

If you go back to the terminal, you will now be able to type the command cqp and use the installed corpus:

$ cqp -e
[no corpus]> show corpora;
System corpora:
 D: DICKENS     
[no corpus]> DICKENS;
DICKENS> "charles";
0 matches.
DICKENS> "house";  
     3445: shutting up the counting- <house> arrived . With an ill-wi
     3810: there when it was a young <house> , playing at hide-and-se
     3890:  black old gateway of the <house> , that it seemed as if t
     4369: und resounded through the <house> like thunder . Every roo
     5087:  so did every bell in the <house> . This might have lasted
     [...]

If you have installed CWB as indicated above, encoding a new corpus is very straightforward. Go on and read the “Corpus Encoding Tutorial”.

Make sure your corpus is formatted one token per line, as indicated in the “corpus encoding tutorial”, eventually with additional columns for positional attributes (POS, LEMMA, etc.).

If your corpus counts more than one file, it is advisable that you put all the files together in just one gzipped archive, e.g. using something like:

  gzip -c *.txt > newcorpus.gz

Create a directory /corpora/newcorpus (substitute “newcorpus” with your corpus's name…).

Browse to the directory where your corpus-files are stored.

$ cd /path/to/newcorpus/files

Issue the cwb-encode command, remembering that your encoded data will “live” in /corpora/newcorpus, and that the registry file for newcorpus will have to be saved as /corpora/c1/registry/newcorpus.

You will also have to define the -P and -S flags according to the characteristics of newcorpus. We are using a simple example:

$ cwb-encode -d /corpora/newcorpus -f newcorpus.gz -R /corpora/c1/registry/newcorpus -P pos -P lemma -S text -S file -S s -S p

Your terminal will probably tell you if there have been any problems. If nothing critical happens, go on.

If you've gotten this far, then you're almost done.

You are just missing the indexes for cqp to be able to use the imported data.

You will need to issue just one command: cwb-makeall -V NEWCORPUS.

Beware: type “newcorpus” in uppercase… that is how cwb likes it… I had errors with typing it lowercase.

$ cwb-makeall -V NEWCORPUS
 
=== Makeall: processing corpus NEWCORPUS ===
Registry directory: /corpora/c1/registry
ATTRIBUTE word
 + creating LEXSRT ... OK
 - lexicon      OK
 + creating FREQS ... OK
 - frequencies  OK
 - token stream OK
 + creating REVCIDX ... OK
 + creating REVCORP ... OK
 ? validating REVCORP ... OK
 - index        OK
ATTRIBUTE pos
 + creating LEXSRT ... OK
 - lexicon      OK
 + creating FREQS ... OK
 - frequencies  OK
 - token stream OK
 + creating REVCIDX ... OK
 + creating REVCORP ... OK
 ? validating REVCORP ... OK
 - index        OK
ATTRIBUTE lemma
 + creating LEXSRT ... OK
 - lexicon      OK
 + creating FREQS ... OK
 - frequencies  OK
 - token stream OK
 + creating REVCIDX ... OK
 + creating REVCORP ... OK
 ? validating REVCORP ... OK
 - index        OK
========================================

That's it! Now NEWCORPUS will be available at cqp's interface:

$ cqp -e
[no corpus]> show corpora;
System corpora:
 D: DICKENS     
 N: NEWCORPUS

¹⁾

If you try putting your registry/ directory elsewhere, it will work smoothly until you try to use cwb-encode with a new corpus… at that point, you will be told by cqp that /corpora/c1/registry is needed – this is probably a bug. This is what happened to me after putting my registry in ~/corpora/registry and trying to encode a corpus with cwb-encode:

   $ cqp -e
   Warning: Couldn't open directory /corpora/c1/registry (continuing)

After that message, I had to move everything to / (root). If you follow the instructions above, you shouldn't have these problems.

CQP installation and early operation notes for Mac OS X

CWB beta version (pre-3.0), binaries for Mac OS X

Installation

Installing a corpus

Encoding a corpus

Preparation

Import "newcorpus"

Index "newcorpus"

CompoNet Wiki