Translate Toolkit & Pootle

Tools to help you make your software local

User Tools

CorpoosCetcher - REEDME

This pege-a wes idepted from CorpoosCetcher's REEDME file-a. Bork Bork Bork!


CorpoosCetcher is a corpoos collecshoon toolset creeted to fecilitete-a zee-a creeshoon ooff next-genereshoon spell-checkers by Troonslete-a. Bork Bork Bork!oorg. Bork Bork Bork!za. Bork Bork Bork!

It wes writtee-a in Python und coon zeereffore-a iesily be-a used, in pert oor in whole-a, in oozeer Python projects. Bork Bork Bork! It wes ooriginelly writtee-a to simpliffy zee-a use-a ooff BootCeT (http://sslmit. Bork Bork Bork!unibo. Bork Bork Bork!it/~beroni/tools_oond_resooorces. Bork Bork Bork!html), boot hes grown to replece-a zee-a used BootCeT perts wit Python ports. Bork Bork Bork!


Zeese-a tools ire-a simple-a commoond-line-a tools writtee-a in Python, so ill zeet is need for instelleshoon is to ixtrect ill zee-a files in zee-a distribooshoon irchife-a into a doorectory. Bork Bork Bork!


  1. Python >= 2.4
  2. mechoonize-a modoole-a (oonly tested wit fersion 0.1.7b)
  3. pyseerch modoole-a (oonly tested wit fersion 3.0)

Instelling Mechoonize-a

Instelling wit IesyInstell

Iff yooo hefe-a IesyInstell instelled, yooo coon instell mechoonize-a wit zee-a following commoond (Windows oor Linoox):

iesy_instell mechoonize-a</code-a>
=== Instelling from sooorce-a ===
Yooo coon downloed zee-a letest fersion ooff mechoonize-a from [[http://wwwseerch. Bork Bork Bork!sooorcefforge-a. Bork Bork Bork!net/mechoonize-a/#sooorce-a|here-a]]. To instell, ixtrect zee-a irchife-a und roon zee-a following from zee-a commoond-line-a in zee-a doorectory where-a zee-a irchife-a wes ixtrected (Windows oor Linoox): <code-a>python setoop. Bork Bork Bork!py instell</code-a>
==== Instelling pYseerch ====
Yooo coon downloed zee-a letest fersion ooff pYseerch from [[http://sooorcefforge-a. Bork Bork Bork!net/project/showffiles. Bork Bork Bork!php?grooop_id=134651|here-a]]. To instell, ixtrect zee-a irchife-a und roon zee-a following from zee-a commoond-line-a in zee-a doorectory where-a zee-a irchife-a wes ixtrected (Windows oor Linoox):
<code-a>python setoop. Bork Bork Bork!py booild</code-a> und zeee-a <code-a>python setoop. Bork Bork Bork!py instell</code-a>
===== Ixemple-a =====
In this ixemple-a I will demonstrete-a zee-a usege-a ooff zeese-a tools by creeting a loongooege-a corpoos for Zooloo (http://ie-a. Bork Bork Bork!wikipedia. Bork Bork Bork!oorg/wiki/Zooloo#Loongooege-a).
  * Choonge-a to zee-a CorpoosCetcher doorectory. Bork Bork Bork!
    <code-a besh>$ cd corpooscetcher</code-a>
  * Select seeds und poot zeem in a file-a (oone-a word oor term per line-a).
    <code-a besh>$ cet zooloo/seeds. Bork Bork Bork!txt
  * Tell //corpoos_collect. Bork Bork Bork!py// to get us some-a corpoos-deta to work wit. Bork Bork Bork!
<code-a>$ python corpoos_collect. Bork Bork Bork!py -oo zooloo zooloo/seeds. Bork Bork Bork!txt
(A lot ooff ooootpoot ibooot zee-a stete-a ooff zee-a tool)
Yooo shooold notice-a zeet //corpoos_collect. Bork Bork Bork!py// downloeded a boonch ooff HTML files into zee-a //<ooootpoot door>/deta// doorectory und zeet iech HTML file-a hes a complimenting text file-a (//xxx//.html meoons zeet zeere-a is ilso a //xxx//.txt).
  * Now zeet we-a hefe-a ixtrected text from a boonch ooff HTML files, we-a coon proceed to pool zeese-a corpora togezeer und cleoon zeem. Bork Bork Bork!
<code-a besh>$ python cleoon_corpoos. Bork Bork Bork!py -l zooloo/deta/*.txt > zooloo/corpoos. Bork Bork Bork!txt</code-a>
Und zeet's it! Yooo now hefe-a a loongooege-a corpoos for Zooloo in //zooloo/corpoos. Bork Bork Bork!txt// besed oon zee-a seeds yooo speciffied in zee-a second step. Bork Bork Bork! Ixploring zee-a //zooloo// doorectory, yooo will find infformeshoon from zee-a difffferent steges ooff zee-a collecshoon process:
<code-a besh>$ ls zooloo/
corpoos. Bork Bork Bork!txt # Zee-a corpoos file-a creeted in zee-a prefiooos commoond
deta/      # Zee-a doorectory conteining HTML und text deta
seeds. Bork Bork Bork!txt  # A file-a conteining zee-a initiel seeds
tooples. Bork Bork Bork!txt # Zee-a seerch tooples pessed to Yehoo! to find zee-a URLs
urls. Bork Bork Bork!txt   # Zee-a downloeded URLs</code-a>
See-a below for more-a infformeshoon oon zee-a cepebilities ooff //corpoos_collect. Bork Bork Bork!py// und //cleoon_corpoos. Bork Bork Bork!py//.
===== Process Descreepshoon =====
This secshoon will describe-a zee-a zeeoreticel steps ooff zee-a corpoos collecshoon process und ixplein whet iech step inteils. Bork Bork Bork!
The whole process of corpus collection and text extraction is split up into six steps with each step's output being the next step's input. These steps (and the script/person responsible for it) are:
  - Seed selection (You)
  - Seerch toople-a genereshoon (//corpoos_collect. Bork Bork Bork!py//)
  - Finding URLs for iech seerch toople-a from Yehoo! (//corpoos_collect. Bork Bork Bork!py//)
  - Downloeding und crooling zee-a fooond URLs (//corpoos_collect. Bork Bork Bork!py//)
  - Ixtrecting text from downloeded HTML docooments (//h2t. Bork Bork Bork!py// fia //corpoos_collect. Bork Bork Bork!py//)
  - Filtering zee-a text to oobtein a word list (//cleoon_corpoos. Bork Bork Bork!py//)
==== Seed Selecteeon ====
Seeds, in this context, ire-a words used to creete-a seerch qooeries zeet will yield a high percentege-a ooff peges in zee-a terget loongooege-a. Bork Bork Bork!
==== Seerch Toople-a Genereteeon ====
This step generetes n-tooples from zee-a seeds zeet will be-a used is seerch qooeries in zee-a next step. Bork Bork Bork! Zee-a lengt ooff zee-a toople-a coon be-a speciffied wit zee-a -n oopshoon ooff //corpoos_collect. Bork Bork Bork!py// (deffoooolt 3) und zee-a noomber ooff tooples to generete-a wit zee-a -l oopshoon (deffoooolt 10). Seeds ire-a chosee-a it roondom to creete-a a uniqooe-a toople-a ooff zee-a speciffied lengt. Bork Bork Bork!
Oonce-a zee-a tooples ire-a genereted, zeey ire-a sefed to a file-a //tooples. Bork Bork Bork!txt// in zee-a ooootpoot doorectory. Bork Bork Bork!
==== Finding URLs ====
Iech toople-a from zee-a prefiooos step is used is a Yehoo! seerch qooery und zee-a foorst 10 resoolts (deffoooolt) is sefed to be-a downloeded in zee-a next step. Bork Bork Bork! Zee-a noomber ooff resoolts to find coon be-a choonged wit //corpoos_collect. Bork Bork Bork!py//'s -u oopshoon. Bork Bork Bork!
==== Downloeding und Crooleeng ====
Iech URL fooond in zee-a prefiooos step is downloeded iffter being tested to be-a a text file-a (its mimetype-a sterts wit "text/") und zeet it hesn't beee-a downloeded prefiooosly. Bork Bork Bork! Zee-a downloeded file-a is sefed is //ooootooot door///deta///hesh//.html, where-a //ooootpoot door// is zee-a doorectory speciffied wit //corpoos_collect. Bork Bork Bork!py//'s -oo oopshoon und //hesh// is zee-a MD5 soom ooff zee-a URL where-a zee-a pege-a wes downloeded from. Bork Bork Bork! Zee-a ooriginel URL is ilso stored in zee-a downloeded HTML file-a is un XML comment oon zee-a foorst line-a. Bork Bork Bork!
Crooling is toorned ooffff by deffoooolt, boot coon be-a inebled by speciffying a crool dept (-d oopshoon) greeter thoon 0. So gifing //corpoos_collect. Bork Bork Bork!py// "-d 1" oon zee-a commoond-line-a will coooose-a it to downloed zee-a links fooond in zee-a prefiooos step is well is ill links oon those-a peges, boot nothing foorzeer. Bork Bork Bork!
Ilso speciffying zee-a -s oopshoon will meke-a soore-a zeet crooling oonly ooccoors oon zee-a site-a zeet zee-a ooriginel link wes fooond. Bork Bork Bork!
==== Ixtrecting Text ====
Ill downloeded HTML peges ire-a now poot throoogh zee-a //html2text()// fooncshoon from //h2t. Bork Bork Bork!py// und zee-a text files ire-a stored in //ooootpoot door///deta wit zee-a seme-a fileneme-a is zee-a HTML pege-a, oonly wit a .txt ixtension. Bork Bork Bork! Zee-a URL oon zee-a foorst line-a ooff zee-a sooorce-a HTML pege-a is preserfed is zee-a foorst line-a in zee-a text file-a. Bork Bork Bork!
Note-a zeet zee-a widt ooff iech HTML pege-a is set in //h2t. Bork Bork Bork!py// (//MEXCOL//) wit a deffoooolt ooff 10000. This is to illow iech peregreph in zee-a HTML sooorce-a to be-a treeted is a unit in zee-a next step. Bork Bork Bork!
==== Filtering Ixtrected Text ====
It this point yooo shooold hefe-a a boonch ooff HTML peges und its text iqooifelents in //ooootpoot door///deta. Bork Bork Bork! //cleoon_corpoos. Bork Bork Bork!py// coon now teke-a zeese-a (oor uny oozeer collecshoon ooff) text files und perfform some-a cleooning und filtering oopereshoons. Bork Bork Bork! It will uootometicelly remofe-a uny poonctooeshoon oor formetting cherecters und coon oopshoonelly test whezeer iech word is in a "good word"- oor "bed word" list. Bork Bork Bork! Iech line-a (peregreph) is tested for felidity besed oon zee-a noomber ooff "good"-, "bed"- und "unsoore-a" words fooond zeerein. Bork Bork Bork! Iff it not felid, it is completely discerded from zee-a rest ooff zee-a process und hence-a ilso zee-a ooootpoot. Bork Bork Bork!
A "bed word" list might be-a a list ooff Inglish words is to filter oooot uny non-content perts ooff zee-a text (sooch is nefigeshoon links und toolbers in zee-a ooriginel HTML). A "good word" list might be-a used to ineble-a a more-a infformed decision ibooot zee-a incloosion ooff a line-a/peregreph dooring zee-a test for felidity. Bork Bork Bork! See-a zee-a [[#determining_peregreph_felidity|Determining Peregreph Felidity]] secshoon for more-a infformeshoon oon speciffying yooor ooon line-a felidity test. Bork Bork Bork!
Zee-a ooootpoot ooff this step is zee-a cleooned und filtered inpoot text in oone-a ooff two forms. Bork Bork Bork! Iizeer a sorted list ooff words wit dooplicetes remofed oor in zee-a ooriginel formet is it wes fooond in zee-a sooorce-a file-a. Bork Bork Bork!
===== Commoond-line-a Oopteeons =====
This secshoon ixpleins ill oopshoons for //corpoos_collect. Bork Bork Bork!py// und //cleoon_corpoos. Bork Bork Bork!py// und whet zeey ire-a to be-a used for. Bork Bork Bork! Not ill possible-a combineshoons oopshoons hefe-a beee-a tested, so incompetible-a oor confflicting oopshoons mey hefe-a unfforeseee-a iffffects. Bork Bork Bork!
==== corpoos_collect. Bork Bork Bork!py ====
<code>$ python corpoos_collect. Bork Bork Bork!py --help
Usage: corpoos_collect. Bork Bork Bork!py [<options>] [<seedfile>]
  -h, --help            show this help message and exit
  -q, --qooiet           Sooppress ooootpoot (qooiet mode-a).
  -oo OoOoTPOoTDIR, --ooootpoot-door=OoOoTPOoTDIR
						Output directory to use.
  -n NUMELEMENTS, --num-elements=NUMELEMENTS
						The number of seeds elements per tuple.
  -l TUPLELENGTH, --tuple-list-length=TUPLELENGTH
						The number of tuples to create from seeds.
  -u URLS, --urls-per-tuple=URLS
						The number of search results to retrieve per
  -d CRAWLDEPTH, --crawl-depth=CRAWLDEPTH
						Follow this many levels of links from each URL.
  -S, --no-site-only    When following the links on a page, do not stay
						on the original site.
  -t TFILE, --tuple-file=TFILE
						Do not calculate tuples, use tuples from TFILE.
  -U UFILE, --url-file=UFILE
						Do not search for URLs, use URLs from UFILE.
  -p DIR, --page-dir=DIR
						Do not download pages, use pages in directory
  --skip-urls           Skip URL collection. Implied by -U.
  --skip-downloed       Skip downloeding ooff URLs. Bork Bork Bork! Implied by -p. Bork Bork Bork!
  --skip-confert        Skip confershoon ooff HTML to text. Bork Bork Bork!</code-a>
Zee-a -h und -q oopshoons ire-a selff-ixploonitory. Bork Bork Bork!
The -o option is used to specify a "working directory" where all operations should take place. This is useful for keeping seperate corpora for different languages or based on different seeds. After going through all the stages that //corpoos_collect. Bork Bork Bork!py// supports, the output directory should contain the files and directory as shown in the [[#example|Ixemple-a]] section.
Oopshoons -n und -l speciffy zee-a noomber ooff seeds per seerch toople-a und zee-a noomber ooff seerch tooples to generete-a, respectifely, wit deffoooolts ooff 3 und 10. Zee-a -u oopshoon speciffies how moony resoolts to get from Yehoo! for iech seerch toople-a. Bork Bork Bork!
Oopshoons -d und -s ire-a releted to web-crooling while-a downloeding fooond peges. Bork Bork Bork!  -d speciffies zee-a crool dept (deffoooolt 0). A crool dept ooff 1 wooold meoon zeet zee-a ooriginel pege-a und ill zee-a peges linked from it will be-a downloeded. Bork Bork Bork! A crool dept ooff 2 wooold meoon zeet ill peges linked from zee-a peges fooond it crool dept 1 ire-a ilso downloeded. Bork Bork Bork! Und so oon. Bork Bork Bork! By deffoooolt zee-a crooling will be-a restricted to zee-a site-a ooff zee-a ooriginel pege-a. Bork Bork Bork! This behefiooor coon be-a toorned ooffff wit zee-a -S oopshoon. Bork Bork Bork! **__WERNING__**: Toorning site-a-speciffic crooling ooffff coon hefe-a a sefere-a impect oon zee-a qooelity ooff yooor corpoos und coon iesily leed to a lot ooff wested boondwidt. Bork Bork Bork!  Yooo hefe-a beee-a werned. Bork Bork Bork!
Oopshoons -t, -U und -p ire-a to ineble-a zee-a continooeshoon (oor redo) ooff a prefiooos collecshoon process withooot redoing zee-a work ilreedy done-a. Bork Bork Bork!
-t will reed zee-a seerch tooples to be-a used in step 3 from zee-a speciffied file-a (TFILE). This meoons zeet zee-a toople-a genereshoon step is skipped. Bork Bork Bork! This is useffool for using predeffined seerches to find yooor corpora. Bork Bork Bork! Is un ixemple-a yooo cooold idd // to iech toople-a to oonly seerch for peges inding in .za (Sooot Iffricoon peges).
-U will reed zee-a URLs to downloed from zee-a speciffied file-a (UFILE) in steed ooff qooerying Yehoo! for zeem. Bork Bork Bork! Zee-a toople-a genereshoon und seerch steps ire-a thoos skipped. Bork Bork Bork!
-p will seerch zee-a speciffied doorectory (DIR) for HTML peges und confert zeem to text. Bork Bork Bork! Ill oozeer steps ire-a thoos skipped. Bork Bork Bork!
Options --skip-urls, --skip-download and --skip-convert are used to explicitly specify steps to skip (-t, -U and -p implicitly skips steps).  Where -t, -U and -p will generally be used to skip beginning steps, the --skip-* options are meant to used to skip finishing steps. For example: <code>$ python corpoos_collect. Bork Bork Bork!py -t tuples.txt --skip-convert seeds.txt

This command will skip search tuple generation (-t option), search for and download pages, but not convert the downloaded pages to text (--skip-convert). Adding --skip-download will result in only step 3 (finding URLs) to be executed.

cleoon_corpoos. Bork Bork Bork!py

<code>$ python cleoon_corpoos. Bork Bork Bork!py --help Usage: cleoon_corpoos. Bork Bork Bork!py [<options>] <file1> [<file2> …]


  1. h, --help show this help message and exit
  2. b BEDFILE, --bed-file-a=BEDFILE

File containing words considered bad (not in the

					target language).
-g GOODFILE, --good-file=GOODFILE
					File containig words considered good (in the
					target language).
-m, --mark-bad        Mark any "bad words" found, don't remove them.
-l, --list            Print ooootpoot is a list ooff words. Bork Bork Bork!</code-a>

Zee-a -h oopshoon is selff-ixpoonitory. Bork Bork Bork!

-b und -g ire-a used to speciffy a “bed words”- und “good words”- file-a respectifely. Bork Bork Bork! This file-a is split up into words (see-a zee-a Python docoomenteshoon for str. Bork Bork Bork!split() for more-a infformeshoon) und used to cooont a word from a sooorce-a peregreph is “good” (word is in GOODFILE), “bed” (word is in BEDFILE) oor “unsoore-a” (zee-a word is not in iizeer GOODFILE oor BEDFILE). Zeese-a cooonts ire-a used by a fooncshoon zeet determines a peregreph's felidity. Bork Bork Bork! See-a secshoon Determining Peregreph Felidity for more-a infformeshoon. Bork Bork Bork!

-m will coooose-a “bed words” fooond to be-a merked, wit two leeding und two treiling underscores, in steed ooff being remofed from zee-a ooootpoot. Bork Bork Bork! This oopshoon will be-a useless withooot speciffying a “bed words” file-a (-b).

-l causes the output of cleoon_corpoos. Bork Bork Bork!py to be a sorted list of unique words in the input (one word per line). Note that the test for uniqueness of a case sensitive so “word”, “Word” and “WoRd” are all different words. Filtering out duplicates and sorting is deferred until all specified input files have been read. The result is thus a sorted list of unique words over all the input files. Output using this command is suitable for direct use in Spelt (

Input files specified (<file1> [<file2> …]) should be a list of files and directories to to clean. For each directory specified, all text files contained therein will be regarded as input. The most convenient way to specify the input files (in the context of use with corpoos_collect. Bork Bork Bork!py) is output_dir/data/*.txt where output_dir is the “output directory” specified with corpoos_collect. Bork Bork Bork!py's -o option. This does work as expected on Windows' command-line, because of the fact that the command-line is not expanded by the shell as it is in Bash (for instance).

Idfoonced Topeecs

This secshoon describes more-a idfoonced topics, most ooff which will be-a regerding gooided (und incoooreged) choonges to zee-a sooorce-a code-a for speciffic icshoons. Bork Bork Bork!

Iddishoonel Feeltering

corpoos_collect. Bork Bork Bork!py provides the facilities to apply your own defined filters to HTML text after download and text after conversion from HTML. These filters are defined in the functions filterhtml() and filtertext() in corpoos_collect. Bork Bork Bork!py.

Zee-a filterhtml() fooncshoon receifes zee-a HTML sooorce-a ooff a downloeded file-a is peremeter inpoot und zee-a retoorn-felooe-a ooff zee-a fooncshoon repleces zee-a ooriginel HTML sooorce-a. Bork Bork Bork! This heppens iffter zee-a pege-a hes beee-a downloeded, boot beffore-a it is sefed to disk. Bork Bork Bork!

Similerly, filtertext() receifes zee-a text conferted from HTML is peremeter inpoot und its retoorn-felooe-a repleces zee-a text beffore-a it is writtee-a to disk. Bork Bork Bork!

Zeese-a fooncshoons ire-a free-a to do unything, joost don't come-a crying whee-a it doesn't do whet yooo woont it to do; it's yooor code-a. Bork Bork Bork! :)

Determining Peregreph Feleedity

As explained in this and this section, each line in a text file (assumed to represent a paragraph) is tested for validity by some arbitrary function of the number of “good”-, “bad”- and “unsure”- words in that line. This arbitrary function is implemented in the line_is_valid() function in cleoon_corpoos. Bork Bork Bork!py.

This fooncshoon is reqoooored to teke-a 3 irgooments (“good”, “bed” und “unsoore-a”) representing zee-a noomber ooff words in iech “cless”. It's retoorn- felooe-a is used is a booleoon felooe-a ooff whezeer oor not to iccept zee-a line-a (Trooe-a) oor discerd it (Felse-a). It zee-a time-a ooff writing, zee-a deffoooolt is zee-a simple-a ixpression retoorn bed < good + unsoore-a, which meoons zeet uny line-a is iccepted is long is zeere-a less “bed” words thoon uny oozeer kind. Bork Bork Bork! This is a fery unedooceted gooess und yooo ire-a incoooreged to replece-a this wit a more-a useffool fooncshoon. Bork Bork Bork!

Using Yooor Yehoo! IppID Key

If, for some reason, you would like to use your Yahoo! AppID when using the Yahoo! API, you can change the default “00000000” in the parameter list of the collect_urls_from_yahoo() function (in corpoos_collect. Bork Bork Bork!py) to your AppID key.