Show pageOld revisionsBacklinksBack to top This page is read only. You can view the source, but not change it. Ask your administrator if you think this is wrong. ======CWB beta version (pre-3.0), binaries for Mac OS X====== Installation and early operation notes for Mac OS X. Emiliano Guevara, 08/12/2006. Contact: emiliano@lingue.unibo.it * **Archive used**: cwb-2.2.b91-powerpc-darwin.tar.gz * **System**: Mac OS X 10.4.8 * **Machine and processor**: MacBook Pro, 2.16 GHz Intel Core 2 Duo =====1. CWB Installation===== To install, copy the appropriate archive for Mac OS X into you Desktop folder. Expand the archive (double-click on the Finder, or use the commands ''gunzip'' and ''tar'' on it). You will then have a new directory on your Desktop named ''cwb-<version number>''. If you browse that folder you will see that it contains a number of subdirectories, each one having some subdirectories and binary files: cwb-<version-number>/ bin/ cwb-align-encode cqpcl cqp cwb-compress-rdx cwb-align-show cwb-s-decode cwb-atoi cwb-align cwb-decode cwb-makeall cwb-lexdecode cwb-describe-corpus cwb-encode cwb-huffcode cwb-itoa cwb-s-encode cwb-scan-corpus include/ cwb/ cl.h lib/ libcl.a man/ man1/ cqp.1 cwb-encode.1 Using //Terminal.app//, move all these files into the corresponding directories in your system, i.e. into ''/usr/local/'' (make sure you don't substitute any existing directories or files, just add them)((**WARNING**: if you are not familiar with the UNIX environment that is in you MAC OS X system, do not try this!!!)). / usr/ local/ bin/ include/ lib/ man/ man1/ You have to move the contents of ''cwb<version number>/bin'' into ''/usr/local/bin'', then the contents of ''cwb<version number>/include'' into ''/usr/local/include'', and so on. Now you will be able to type all of CWB's commands on your terminal (they are the binary files you moved into ''/usr/local/bin''), and you will also be able to see the man pages for ''cqp'' and ''cwb-encode''. To find a corpus, CQP uses an environment variable ''$CORPUS_REGISTRY''. This has to point to a directory ''registry/'' where the corpora on your system are defined. In theory, ''registry/'' could be located anywhere, but in my experience it is better to create the following directory tree in / (root): / corpora/ c1/ registry/ Then you must set your environment, using one of the following commands ((If you try putting your ''registry/'' directory elsewhere, it will work smoothly until you try to use ''cwb-encode'' with a new corpus... at that point, you will be told by ''cqp'' that ''/corpora/c1/registry'' is needed – this is probably a bug. This is what happened to me after putting my registry in ''~/corpora/registry'' and trying to encode a corpus with ''cwb-encode'': $ cqp -e Warning: Couldn't open directory /corpora/c1/registry (continuing) After that message, I had to move everything to / (root). If you follow the instructions above, you shouldn't have these problems.)) : * if you use the TCSH shell: setenv CORPUS_REGISTRY "/corpora/c1/registry" * if you use the BASH shell: export CORPUS_REGISTRY="/corpora/c1/registry" =====2. Installing an encoded/indexed corpus===== If you receive a corpus that is already encoded with ''cwb-encode'' (like the demo corpora), you will most probably receive an archive containing a ''data/'' directory and a ''registry/'' directory. Rename ''data/'' to some thing more interesting ("dickens", "german-law", whatever makes you happy). Move the renamed ''data/'' directory into ''/corpora''. Move the content of ''registry/'' into ''/corpora/c1/registry'' (just one text file, containing the information that ''cqp'' needs to use the corpus). Let's say you are installing the DICKENS demo corpus. Let's say you now have the following situation in your ''/corpora'' directory: / corpora dickens/ (old "data/") book_num.avs book_num.avx book_num.rng book.rng chapter_num.avs [...] c1/ registry/ dickens Now browse into the ''/corpora/c1/registry'' directory and open the "dickens" file you just moved into it. The file's contents will include the following lines (just an example, it includes much more...): # # CWB registry entry for corpus DICKENS # NAME "IMS Corpus Workbench Demo Corpus (Novels by Charles Dickens)" ID dickens HOME data Do not touch anything, except the line defining "HOME". Replace ''data'' with the path to the new directory containing the data in ''/corpora''. In our case, we replace ''data'' with ''/corpora/dickens/''. Save the file. If you go back to the terminal, you will now be able to type the command ''cqp'' and use the installed corpus: $ cqp -e [no corpus]> show corpora; System corpora: D: DICKENS [no corpus]> DICKENS; DICKENS> "charles"; 0 matches. DICKENS> "house"; 3445: shutting up the counting- <house> arrived . With an ill-wi 3810: there when it was a young <house> , playing at hide-and-se 3890: black old gateway of the <house> , that it seemed as if t 4369: und resounded through the <house> like thunder . Every roo 5087: so did every bell in the <house> . This might have lasted [...] =====3. Encoding a corpus===== If you have installed CWB as indicated above, encoding a new corpus is very straightforward. Go on and read the //Corpus Encoding Tutorial//. ====3.1. Preparation==== Make sure your corpus is formatted one token per line, as indicated in the //corpus encoding tutorial//, eventually with additional columns for positional attributes (POS, LEMMA, etc.). If your corpus counts more than one file, it is advisable that you put all the files together in just one gzipped archive (e.g. using something like ''gzip -c *.txt > newcorpus.gz''). ====3.2. Import "newcorpus"==== Create a directory ''/corpora/newcorpus'' (substitute //newcorpus// with your corpus's name...). Browse to the directory where your corpus is stored. $ cd /path/to/newcorpus/files Issue the cwb-encode command, remembering that your encoded data will "live" in ''/corpora/newcorpus'', and that the registry file for newcorpus will have to be saved under ''/corpora/c1/registry''. You will also have to define the ''-P'' and ''-S'' flags according to the characteristics of //newcorpus//. We are using a simple example: $ cwb-encode -d /corpora/newcorpus -f newcorpus.gz -R /corpora/c1/registry/newcorpus -P pos -P lemma -S text -S file -S s -S p Your terminal will probably tell you if there have been any problems. If nothing critical happens, go on. ====3.3. Index "newcorpus"==== If you gotten this far, then you're almost done. You are just missing the indexes for ''cqp'' to be able to use the imported data. You will need to issue just one command: ''cwb-makeall -V NEWCORPUS''. **Beware**: type "newcorpus" in uppercase... I had errors with typing lowercase. $ cwb-makeall -V NEWCORPUS === Makeall: processing corpus NEWCORPUS === Registry directory: /corpora/c1/registry ATTRIBUTE word + creating LEXSRT ... OK - lexicon OK + creating FREQS ... OK - frequencies OK - token stream OK + creating REVCIDX ... OK + creating REVCORP ... OK ? validating REVCORP ... OK - index OK ATTRIBUTE pos + creating LEXSRT ... OK - lexicon OK + creating FREQS ... OK - frequencies OK - token stream OK + creating REVCIDX ... OK + creating REVCORP ... OK ? validating REVCORP ... OK - index OK ATTRIBUTE lemma + creating LEXSRT ... OK - lexicon OK + creating FREQS ... OK - frequencies OK - token stream OK + creating REVCIDX ... OK + creating REVCORP ... OK ? validating REVCORP ... OK - index OK ======================================== That's it! Now ''NEWCORPUS'' will be available at cqp's interface: $ cqp -e [no corpus]> show corpora; System corpora: D: DICKENS N: NEWCORPUS tutorial_installazione_cwb_su_mac_os_x.txt Last modified: 2006/12/08 22:03by emiliano