Translate Toolkit & Pootle

Tools to help you make your software local


CorpusCatcher

Introduction

CorpusCatcher is a corpus collection toolset. It can help you to build language or topic specific corpora from publically available web resources. This can be very useful for many purposes, especially for data to build spell checkers.

If you are interested in CorpusCather, or are working on spell checkers, you might also be interested in Spelt.

Download

Releases can be downloaded from here and sources from here.

Documentation

  • README - Contains all you need to know about using the CorpusCatcher tools.
  • API - Not yet done.

Development

Subversion: https://translate.svn.sourceforge.net/svnroot/translate/src/trunk/corpuscatcher

See the Advanced Topics section in the README for some notes on the code.

TODO

  • Download page in encoding of page's HTTP header
  • Use mime type from HTTP header to handle file appropriately
    • Support .doc and .pdf
  • Improve filtering (filter out numbers, repeated punctuation marks)