CWB beta version (pre-3.0), binaries for Mac OS X

Installation and early operation notes for Mac OS X.

Emiliano Guevara, 08/12/2006.

Contact: emiliano@lingue.unibo.it

Archive used: cwb-2.2.b91-powerpc-darwin.tar.gz

System: Mac OS X 10.4.8

Machine and processor: MacBook Pro, 2.16 GHz Intel Core 2 Duo

To install, copy the appropriate archive for Mac OS X into you Desktop folder. Expand the archive (double-click on the Finder, or use the commands gunzip and tar on it).

You will then have a new directory on your Desktop named cwb-<version number>. If you browse that folder you will see that it contains a number of subdirectories, each one having some subdirectories and binary files:

  cwb-<version-number>/
    bin/
      cwb-align-encode
      cqpcl
      cqp
      cwb-compress-rdx
      cwb-align-show
      cwb-s-decode
      cwb-atoi
      cwb-align
      cwb-decode
      cwb-makeall
      cwb-lexdecode
      cwb-describe-corpus
      cwb-encode
      cwb-huffcode
      cwb-itoa
      cwb-s-encode
      cwb-scan-corpus
    include/
      cwb/
        cl.h
    lib/
      libcl.a
    man/
      man1/
        cqp.1
        cwb-encode.1

Using Terminal.app, move all these files into the corresponding directories in your system, i.e. into /usr/local/ (make sure you don't substitute any existing directories or files, just add them)¹⁾.

  /
    usr/
      local/
        bin/
        include/
        lib/
        man/
          man1/

You have to move the contents of cwb<version number>/bin into /usr/local/bin, then the contents of cwb<version number>/include into /usr/local/include, and so on.

Now you will be able to type all of CWB's commands on your terminal (they are the binary files you moved into /usr/local/bin), and you will also be able to see the man pages for cqp and cwb-encode.

To find a corpus, CQP uses an environment variable $CORPUS_REGISTRY. This has to point to a directory registry/ where the corpora on your system are defined.

In theory, registry/ could be located anywhere, but in my experience it is better to create the following directory tree in / (root):

  /
    corpora/
      c1/
        registry/

Then you must set your environment, using one of the following commands ²⁾ :

if you use the TCSH shell:

  setenv CORPUS_REGISTRY "/corpora/c1/registry"

if you use the BASH shell:

  export CORPUS_REGISTRY="/corpora/c1/registry"

If you receive a corpus that is already encoded with cwb-encode (like the demo corpora), you will most probably receive an archive containing a data/ directory and a registry/ directory.

Rename data/ to some thing more interesting (“dickens”, “german-law”, whatever makes you happy).

Move the renamed data/ directory into /corpora.

Move the content of registry/ into /corpora/c1/registry (just one text file, containing the information that cqp needs to use the corpus).

Let's say you are installing the DICKENS demo corpus. Let's say you now have the following situation in your /corpora directory:

  /
    corpora
      dickens/                 (old "data/")
        book_num.avs
        book_num.avx
        book_num.rng
        book.rng
        chapter_num.avs
        [...]
      c1/
        registry/
          dickens

Now browse into the /corpora/c1/registry directory and open the “dickens” file you just moved into it. The file's contents will include the following lines (just an example, it includes much more…):

  #
  # CWB registry entry for corpus DICKENS
  #
  
    NAME "IMS Corpus Workbench Demo Corpus (Novels by Charles Dickens)"
    ID dickens
    HOME data

Do not touch anything, except the line defining “HOME”. Replace data with the path to the new directory containing the data in /corpora. In our case, we replace data with /corpora/dickens/. Save the file.

If you go back to the terminal, you will now be able to type the command cqp and use the installed corpus:

   $ cqp -e
     [no corpus]> show corpora;
       System corpora:
         D: DICKENS     
     [no corpus]> DICKENS;
     DICKENS> "charles";
       0 matches.
     DICKENS> "house";  
       3445: shutting up the counting- <house> arrived . With an ill-wi
       3810: there when it was a young <house> , playing at hide-and-se
       3890:  black old gateway of the <house> , that it seemed as if t
       4369: und resounded through the <house> like thunder . Every roo
       5087:  so did every bell in the <house> . This might have lasted
       [...]

If you have installed CWB as indicated above, encoding a new corpus is very straightforward. Go on and read the Corpus Encoding Tutorial.

Make sure your corpus is formatted one token per line, as indicated in the corpus encoding tutorial, eventually with additional columns for positional attributes (POS, LEMMA, etc.).

If your corpus counts more than one file, it is advisable that you put all the files together in just one gzipped archive (e.g. using something like gzip -c *.txt > newcorpus.gz).

Create a directory /corpora/newcorpus (substitute newcorpus with your corpus's name…).

Browse to the directory where your corpus is stored.

  $ cd /path/to/newcorpus/files

Issue the cwb-encode command, remembering that your encoded data will “live” in /corpora/newcorpus, and that the registry file for newcorpus will have to be saved under /corpora/c1/registry.

You will also have to define the -P and -S flags according to the characteristics of newcorpus. We are using a simple example:

  $ cwb-encode -d /corpora/newcorpus -f newcorpus.gz -R /corpora/c1/registry/newcorpus -P pos -P lemma -S text -S file -S s -S p

Your terminal will probably tell you if there have been any problems. If nothing critical happens, go on.

If you gotten this far, then you're almost done.

You are just missing the indexes for cqp to be able to use the imported data. You will need to issue just one command: cwb-makeall -V NEWCORPUS.

Beware: type “newcorpus” in uppercase… I had errors with typing lowercase.

  $ cwb-makeall -V NEWCORPUS
  === Makeall: processing corpus NEWCORPUS ===
  Registry directory: /corpora/c1/registry
  ATTRIBUTE word
    + creating LEXSRT ... OK
    - lexicon      OK
    + creating FREQS ... OK
    - frequencies  OK
    - token stream OK
    + creating REVCIDX ... OK
    + creating REVCORP ... OK
    ? validating REVCORP ... OK
    - index        OK
  ATTRIBUTE pos
    + creating LEXSRT ... OK
    - lexicon      OK
    + creating FREQS ... OK
    - frequencies  OK
    - token stream OK
    + creating REVCIDX ... OK
    + creating REVCORP ... OK
    ? validating REVCORP ... OK
    - index        OK
  ATTRIBUTE lemma
    + creating LEXSRT ... OK
    - lexicon      OK
    + creating FREQS ... OK
    - frequencies  OK
    - token stream OK
    + creating REVCIDX ... OK
    + creating REVCORP ... OK
    ? validating REVCORP ... OK
    - index        OK
  ========================================

That's it! Now NEWCORPUS will be available at cqp's interface:

  $ cqp -e
    [no corpus]> show corpora;
      System corpora:
        D: DICKENS     
        N: NEWCORPUS

¹⁾

WARNING: if you are not familiar with the UNIX environment that is in you MAC OS X system, do not try this!!!

²⁾

If you try putting your registry/ directory elsewhere, it will work smoothly until you try to use cwb-encode with a new corpus… at that point, you will be told by cqp that /corpora/c1/registry is needed – this is probably a bug. This is what happened to me after putting my registry in ~/corpora/registry and trying to encode a corpus with cwb-encode:

  $ cqp -e
  Warning:
    Couldn't open directory /corpora/c1/registry (continuing)

After that message, I had to move everything to / (root). If you follow the instructions above, you shouldn't have these problems.

CWB beta version (pre-3.0), binaries for Mac OS X

1. CWB Installation

2. Installing an encoded/indexed corpus

3. Encoding a corpus

3.1. Preparation

3.2. Import "newcorpus"

3.3. Index "newcorpus"

CompoNet Wiki