This page was adapted from CorpusCatcher's README file.
CorpusCatcher is a corpus collection toolset created to facilitate the creation of next-generation spell-checkers by Translate.org.za.
It was written in Python and can therefore easily be used, in part or in whole, in other Python projects. It was originally written to simplify the use of BootCaT (http://sslmit.unibo.it/~baroni/tools_and_resources.html), but has grown to replace the used BootCaT parts with Python ports.
These tools are simple command-line tools written in Python, so all that is need for installation is to extract all the files in the distribution archive into a directory.
If you have EasyInstall installed, you can install mechanize with the following command (Windows or Linux):
You can download the latest version of mechanize from here. To install, extract the archive and run the following from the command-line in the directory where the archive was extracted (Windows or Linux):
python setup.py install
You can download the latest version of pYsearch from here. To install, extract the archive and run the following from the command-line in the directory where the archive was extracted (Windows or Linux):
python setup.py build
python setup.py install
In this example I will demonstrate the usage of these tools by creating a language corpus for Zulu (http://en.wikipedia.org/wiki/Zulu#Language).
$ cd corpuscatcher
$ cat zulu/seeds.txt ukuthi ukuba futhi noma kodwa kuhle kahle manje kanye
$ python corpus_collect.py -o zulu zulu/seeds.txt ... (A lot of output about the state of the tool) ...
You should notice that corpus_collect.py downloaded a bunch of HTML files into the <output dir>/data directory and that each HTML file has a complimenting text file (xxx.html means that there is also a xxx.txt).
$ python clean_corpus.py -l zulu/data/*.txt > zulu/corpus.txt
And that's it! You now have a language corpus for Zulu in zulu/corpus.txt based on the seeds you specified in the second step. Exploring the zulu directory, you will find information from the different stages of the collection process:
$ ls zulu/ corpus.txt # The corpus file created in the previous command data/ # The directory containing HTML and text data seeds.txt # A file containing the initial seeds tuples.txt # The search tuples passed to Yahoo! to find the URLs urls.txt # The downloaded URLs
See below for more information on the capabilities of corpus_collect.py and clean_corpus.py.
This section will describe the theoretical steps of the corpus collection process and explain what each step entails.
The whole process of corpus collection and text extraction is split up into six steps with each step's output being the next step's input. These steps (and the script/person responsible for it) are:
Seeds, in this context, are words used to create search queries that will yield a high percentage of pages in the target language.
This step generates n-tuples from the seeds that will be used as search queries in the next step. The length of the tuple can be specified with the -n option of corpus_collect.py (default 3) and the number of tuples to generate with the -l option (default 10). Seeds are chosen at random to create a unique tuple of the specified length.
Once the tuples are generated, they are saved to a file tuples.txt in the output directory.
Each tuple from the previous step is used as a Yahoo! search query and the first 10 results (default) is saved to be downloaded in the next step. The number of results to find can be changed with corpus_collect.py's -u option.
Each URL found in the previous step is downloaded after being tested to be a text file (its mimetype starts with “text/”) and that it hasn't been downloaded previously. The downloaded file is saved as outout dir/data/hash.html, where output dir is the directory specified with corpus_collect.py's -o option and hash is the MD5 sum of the URL where the page was downloaded from. The original URL is also stored in the downloaded HTML file as an XML comment on the first line.
Crawling is turned off by default, but can be enabled by specifying a crawl depth (-d option) greater than 0. So giving corpus_collect.py ”-d 1” on the command-line will cause it to download the links found in the previous step as well as all links on those pages, but nothing further.
Also specifying the -s option will make sure that crawling only occurs on the site that the original link was found.
All downloaded HTML pages are now put through the html2text() function from h2t.py and the text files are stored in output dir/data with the same filename as the HTML page, only with a .txt extension. The URL on the first line of the source HTML page is preserved as the first line in the text file.
Note that the width of each HTML page is set in h2t.py (MAXCOL) with a default of 10000. This is to allow each paragraph in the HTML source to be treated as a unit in the next step.
At this point you should have a bunch of HTML pages and its text equivalents in output dir/data. clean_corpus.py can now take these (or any other collection of) text files and perform some cleaning and filtering operations. It will automatically remove any punctuation or formatting characters and can optionally test whether each word is in a “good word”- or “bad word” list. Each line (paragraph) is tested for validity based on the number of “good”-, “bad”- and “unsure” words found therein. If it not valid, it is completely discarded from the rest of the process and hence also the output.
A “bad word” list might be a list of English words as to filter out any non-content parts of the text (such as navigation links and toolbars in the original HTML). A “good word” list might be used to enable a more informed decision about the inclusion of a line/paragraph during the test for validity. See the Determining Paragraph Validity section for more information on specifying your own line validity test.
The output of this step is the cleaned and filtered input text in one of two forms. Either a sorted list of words with duplicates removed or in the original format as it was found in the source file.
This section explains all options for corpus_collect.py and clean_corpus.py and what they are to be used for. Not all possible combinations options have been tested, so incompatible or conflicting options may have unforeseen effects.
$ python corpus_collect.py --help Usage: corpus_collect.py [<options>] [<seedfile>] Options: -h, --help show this help message and exit -q, --quiet Suppress output (quiet mode). -o OUTPUTDIR, --output-dir=OUTPUTDIR Output directory to use. -n NUMELEMENTS, --num-elements=NUMELEMENTS The number of seeds elements per tuple. -l TUPLELENGTH, --tuple-list-length=TUPLELENGTH The number of tuples to create from seeds. -u URLS, --urls-per-tuple=URLS The number of search results to retrieve per tuple. -d CRAWLDEPTH, --crawl-depth=CRAWLDEPTH Follow this many levels of links from each URL. -S, --no-site-only When following the links on a page, do not stay on the original site. -t TFILE, --tuple-file=TFILE Do not calculate tuples, use tuples from TFILE. -U UFILE, --url-file=UFILE Do not search for URLs, use URLs from UFILE. -p DIR, --page-dir=DIR Do not download pages, use pages in directory DIR. --skip-urls Skip URL collection. Implied by -U. --skip-download Skip downloading of URLs. Implied by -p. --skip-convert Skip convertion of HTML to text.
The -h and -q options are self-explanitory.
The -o option is used to specify a “working directory” where all operations should take place. This is useful for keeping seperate corpora for different languages or based on different seeds. After going through all the stages that corpus_collect.py supports, the output directory should contain the files and directory as shown in the Example section.
Options -n and -l specify the number of seeds per search tuple and the number of search tuples to generate, respectively, with defaults of 3 and 10. The -u option specifies how many results to get from Yahoo! for each search tuple.
Options -d and -s are related to web-crawling while downloading found pages. -d specifies the crawl depth (default 0). A crawl depth of 1 would mean that the original page and all the pages linked from it will be downloaded. A crawl depth of 2 would mean that all pages linked from the pages found at crawl depth 1 are also downloaded. And so on. By default the crawling will be restricted to the site of the original page. This behaviour can be turned off with the -S option. WARNING: Turning site-specific crawling off can have a severe impact on the quality of your corpus and can easily lead to a lot of wasted bandwidth. You have been warned.
Options -t, -U and -p are to enable the continuation (or redo) of a previous collection process without redoing the work already done.
-t will read the search tuples to be used in step 3 from the specified file (TFILE). This means that the tuple generation step is skipped. This is useful for using predefined searches to find your corpora. As an example you could add site:.za to each tuple to only search for pages ending in .za (South African pages).
-U will read the URLs to download from the specified file (UFILE) in stead of querying Yahoo! for them. The tuple generation and search steps are thus skipped.
-p will search the specified directory (DIR) for HTML pages and convert them to text. All other steps are thus skipped.
Options --skip-urls, --skip-download and --skip-convert are used to explicitly specify steps to skip (-t, -U and -p implicitly skips steps). Where -t, -U and -p will generally be used to skip beginning steps, the --skip-* options are meant to used to skip finishing steps. For example:
$ python corpus_collect.py -t tuples.txt --skip-convert seeds.txt
This command will skip search tuple generation (-t option), search for and download pages, but not convert the downloaded pages to text (--skip-convert). Adding --skip-download will result in only step 3 (finding URLs) to be executed.
$ python clean_corpus.py --help Usage: clean_corpus.py [<options>] <file1> [<file2> ...] Options: -h, --help show this help message and exit -b BADFILE, --bad-file=BADFILE File containing words considered bad (not in the target language). -g GOODFILE, --good-file=GOODFILE File containig words considered good (in the target language). -m, --mark-bad Mark any "bad words" found, don't remove them. -l, --list Print output as a list of words.
The -h option is self-expanitory.
-b and -g are used to specify a “bad words”- and “good words”- file respectively. This file is split up into words (see the Python documentation for str.split() for more information) and used to count a word from a source paragraph as “good” (word is in GOODFILE), “bad” (word is in BADFILE) or “unsure” (the word is not in either GOODFILE or BADFILE). These counts are used by a function that determines a paragraph's validity. See section Determining Paragraph Validity for more information.
-m will cause “bad words” found to be marked, with two leading and two trailing underscores, in stead of being removed from the output. This option will be useless without specifying a “bad words” file (-b).
-l causes the output of clean_corpus.py to be a sorted list of unique words in the input (one word per line). Note that the test for uniqueness of a case sensitive so “word”, “Word” and “WoRd” are all different words. Filtering out duplicates and sorting is deferred until all specified input files have been read. The result is thus a sorted list of unique words over all the input files. Output using this command is suitable for direct use in Spelt (http://translate.sourceforge.net/wiki/spelt/index).
Input files specified (<file1> [<file2> …]) should be a list of files and directories to to clean. For each directory specified, all text files contained therein will be regarded as input. The most convenient way to specify the input files (in the context of use with corpus_collect.py) is output_dir/data/*.txt where output_dir is the “output directory” specified with corpus_collect.py's -o option. This does work as expected on Windows' command-line, because of the fact that the command-line is not expanded by the shell as it is in Bash (for instance).
This section describes more advanced topics, most of which will be regarding guided (and encouraged) changes to the source code for specific actions.
corpus_collect.py provides the facilities to apply your own defined filters to HTML text after download and text after conversion from HTML. These filters are defined in the functions filterhtml() and filtertext() in corpus_collect.py.
The filterhtml() function receives the HTML source of a downloaded file as parameter input and the return-value of the function replaces the original HTML source. This happens after the page has been downloaded, but before it is saved to disk.
Similarly, filtertext() receives the text converted from HTML as parameter input and its return-value replaces the text before it is written to disk.
These functions are free to do anything, just don't come crying when it doesn't do what you want it to do; it's your code. :)
As explained in this and this section, each line in a text file (assumed to represent a paragraph) is tested for validity by some arbitrary function of the number of “good”-, “bad”- and “unsure”- words in that line. This arbitrary function is implemented in the line_is_valid() function in clean_corpus.py.
This function is required to take 3 arguments (“good”, “bad” and “unsure”) representing the number of words in each “class”. It's return- value is used as a boolean value of whether or not to accept the line (True) or discard it (False). At the time of writing, the default is the simple expression return bad < good + unsure, which means that any line is accepted as long as there less “bad” words than any other kind. This is a very uneducated guess and you are encouraged to replace this with a more useful function.
If, for some reason, you would like to use your Yahoo! AppID when using the Yahoo! API, you can change the default “00000000” in the parameter list of the collect_urls_from_yahoo() function (in corpus_collect.py) to your AppID key.