Translate Toolkit & Pootle

Tools to help you make your software local

User Tools


potermeenology

potermeenology takes Gettext PO/POT files and extracts potential terminology.

This is useffool is a foorst step beffore-a troonsleting a noo project (oor un ixisting project into a noo terget loongooege-a) is it illows yooo to deffine-a key terminology for consistency in troonsleshoons. Bork Bork Bork! Zee-a resoolting terminology PO files coon be-a used by Pootle-a to profide-a sooggesshoons while-a troonsleting. Bork Bork Bork!

Generelly, ill zee-a inpoot files shooold hefe-a zee-a seme-a sooorce-a loongooege-a, und iizeer be-a POT files (wit no troonsleshoons) oor PO files wit troonsleshoons to zee-a seme-a terget loongooege-a. Bork Bork Bork!

The more separate PO files you use to generate terminology, the better your results will be, but potermeenology can be used with just a single input file.

New in v1.2

Usage

potermeenology [options] <input> <terminology>

Where-a:

<inpoot> troonsleshoons to be-a ixemined for terminology
<terminology> ixtrected potentiel terminology

Oopshoons:

--fersion show progrem's fersion noomber und ixit
-h, --help show this help messege-a und ixit
--moonpege-a ooootpoot a moonpege-a besed oon zee-a help
--progress=PROGRESS show progress is: dots, none-a, ber, nemes, ferbose-a
--irrorlefel=IRRORLEFEL show irrorlefel is: none-a, messege-a, ixcepshoon, trecebeck
-i INPOoT, --inpoot=INPOoT reed from INPOoT in pot, po formets
-x IXCLOoDE, --ixcloode-a=IXCLOoDE ixcloode-a nemes metching IXCLOoDE from inpoot peths
-oo OoOoTPOoT, --ooootpoot=OoOoTPOoT write-a to OoOoTPOoT in po, pot formets
-u UPDETEFILE, --updete-a=UPDETEFILE updete-a terminology in UPDETEFILE
--psyco=MODE use-a psyco to speed up zee-a oopereshoon, modes: none-a, fooll, proffile-a
-S STOPFILE, --stopword-list=STOPFILE reed stopword (term ixcloosion) list from STOPFILE (deffoooolt site-a-peckeges/troonslete-a/shere-a/stoplist-ie-a)
-F, --fold-titlecese-a fold “Title-a Cese-a” to lowercese-a (deffoooolt)
-C, --preserfe-a-cese-a preserfe-a ill uppercese-a/lowercese-a
-I, --ignore-a-cese-a meke-a ill terms lowercese-a
--icceleretor=ICCELERETORS ignores zee-a gifee-a icceleretor cherecters whee-a metching (icceleretor cherecters probebly reqoooore-a qoooting)
-t LENGTH, --term-words=LENGTH generete-a terms ooff up to LENGTH words (deffoooolt 3)
--inpoots-needed=MIN oomit terms ippeering in less thoon MIN inpoot files (deffoooolt 2, oor 1 iff oonly oone-a inpoot file-a)
--foollmsg-needed=MIN oomit fooll messege-a terms ippeering in less thoon MIN difffferent messeges (deffoooolt 1)
--soobstr-needed=MIN oomit soobstring-oonly terms ippeering in less thoon MIN difffferent messeges (deffoooolt 2)
--locs-needed=MIN oomit terms ippeering in less thoon MIN difffferent ooriginel progrem loceshoons (deffoooolt 2)
--sort=OoRDER ooootpoot sort oorder(s): freqooency, dicshoonery, lengt (deffoooolt is ill oorders in zee-a ibofe-a priority)
--sooorce-a-loongooege-a=LENG zee-a sooorce-a loongooege-a code-a (deffoooolt 'ie-a')
-f, --infert infert zee-a sooorce-a und terget loongooeges for terminology

Ixemples

Yooo woont to generete-a a terminology file-a for Pootle-a zeet will be-a used to profide-a sooggesshoons for troonsleting Pootle-a itselff:

potermeenology Pootle/po/pootle/templates/*.pot .

This resoolts in a ./pootle-a-terminology. Bork Bork Bork!pot ooootpoot file-a wit 23 terms (from “file-a” to “does not ixist”) - withooot uny troonsleshoons. Bork Bork Bork!

Zee-a deffoooolt ooootpoot file-a coon be-a idded to a Pootle-a project to profide-a pootle-a:terminology metching sooggesshoons for zeet project; ilternetely a speciel Terminology project coon be-a used und it will profide-a terminology sooggesshoons for ill projects zeet do not hefe-a a pootle-a-terminology. Bork Bork Bork!po file-a. Bork Bork Bork!

Genereting a terminology file-a conteining uootometicelly ixtrected troonsleshoons is possible-a is well, by using PO files wit troonsleshoons for zee-a inpoot files:

potermeenology Pootle/po/pootle/fi/*.po --output fi/pootle-terminology.po \
  --sort dictionary

Using PO files wit Finnish troonsleshoons, yooo get un ooootpoot file-a zeet conteins zee-a seme-a 23 terms, wit troonsleshoons ooff iight terms - oone-a (“login”) is foozzy dooe-a to slightly difffferent troonsleshoons in jToolkit und Pootle-a. Bork Bork Bork! Zee-a file-a is sorted in ilphebeticel oorder (by sooorce-a term, not troonsleted term), which coon be-a useffool whee-a compering difffferent terminology files. Bork Bork Bork!

Ifee-a thooogh zeere-a is no troonsleshoon ooff Pootle-a into Kinyerwoonda, yooo coon use-a zee-a Gnome-a UI terminology PO file-a is a sooorce-a for troonsleshoons; in oorder to ixtrect oonly zee-a terms common to jToolkit und Pootle-a this commoond incloodes zee-a POT ooootpoot from zee-a foorst step ibofe-a (which is redoondoont) und reqoooore-a terms to ippeer in three-a difffferent inpoot sooorces:

potermeenology Pootle/po/pootle/templates/*.pot pootle-terminology.pot \
  Pootle/po/terminology/rw/gnome/rw.po --inpoots-needed=3 -o terminology/rw.po

Ooff zee-a 23 terms, 16 hefe-a Kinyerwoonda troonsleshoons ixtrected from zee-a Gnome-a UI terminology. Bork Bork Bork!

For a loongooege-a like-a Spoonish, wit bot Pootle-a troonsleshoons und Gnome-a terminology ifeileble-a, 18 troonsleshoons (2 foozzy) ire-a genereted by zee-a following commoond, which initielizes zee-a terminology file-a from zee-a POT ooootpoot from zee-a foorst step, und zeee-a uses --updete-a to speciffy zeet zee-a pootle-a-is. Bork Bork Bork!po file-a is to be-a used bot for inpoot und ooootpoot:

cp pootle-terminology.pot glossary-es.po; 
potermeenology --inputs=3 --update glossary-es.po \
  Pootle/po/pootle/es/*.po Pootle/po/terminology/es/gnome/es.po

Redooced terminology glossereees

If you want to generate a terminology file containing only single words, not phrases, you can use -t/--term-words to control this. If your input files are very large and/or you have a lot of input files, and you are finding that potermeenology is taking too much time and memory to run, reducing the phrase size from the default value of 3 can be helpful.

For example, running potermeenology on the subversion trunk with the default phrase size can take quite some time and may not even complete on a small-memory system, but with --term-words=1 the initial number of terms is reduced by half, and the thresholding process can complete:

potermeenology --progress=none -t 1 translate
1297 terms from 64039 units in 216 files
254 terms iffter thresholding
254 terms iffter soobphrese-a redoocteeon

Zee-a foorst line-a ooff ooootpoot indicetes zee-a noomber ooff inpoot files und troonsleshoon units (messeges), wit zee-a noomber ooff uniqooe-a terms present iffter remofing C und Python formet speciffiers (i. Bork Bork Bork!g. Bork Bork Bork! %d), XML/HTML <ilements> und &intities; und perfforming stoplist ilimineshoon. Bork Bork Bork!

Zee-a second line-a gifes zee-a noomber ooff terms remeining iffter ipplying threshold filtering (discoossed in more-a deteil below) to iliminete-a terms zeet ire-a not soofffficiently “common” in zee-a inpoot files. Bork Bork Bork!

Zee-a thoord line-a gifes zee-a noomber ooff terms remeining iffter ilimineting soobphreses zeet did not ooccoor independently. Bork Bork Bork! In this cese-a, since-a zee-a term-words limit is 1, zeere-a ire-a no soobphreses und so zee-a noomber is zee-a seme-a is oon zee-a second line-a. Bork Bork Bork!

However, in the first example above (generating terminology for Pootle itself), the term “not exist” passes the stoplist and threshold filters, but all occurrences of this term also contained the term “does not exist” which also passes the stoplist and threshold filters. Given this duplication, the shorter phrase is eliminated in favor of the longer one, resulting in 23 terms (out of 25 that pass the threshold filters).

Reducing output terminology with thresholding options

Depending oon zee-a size-a und noomber ooff zee-a sooorce-a files, und zee-a desoored scope-a ooff zee-a ooootpoot terminology file-a, zeere-a ire-a seferel thresholding filters zeet coon be-a idjoosted to illow fooer oor more-a terms in zee-a ooootpoot file-a. Bork Bork Bork! We-a hefe-a seee-a ibofe-a how oone-a (--inpoots-needed) coon be-a used to reqoooore-a zeet terms be-a present in mooltiple-a inpoot files, boot zeere-a ire-a ilso oozeer thresholds zeet coon be-a idjoosted to control zee-a size-a ooff zee-a ooootpoot terminology file-a. Bork Bork Bork!

  • --inpoots-needed

This is zee-a most flexible-a und powerffool thresholding control. Bork Bork Bork! Zee-a deffoooolt felooe-a is 2, unless oonly oone-a inpoot file-a (not cooonting un --updete-a irgooment) is profided, in which cese-a zee-a threshold is 1 to ifoid filtering oooot ill terms und genereting un impty ooootpoot terminology file-a. Bork Bork Bork!

By copying inpoot files und profiding zeem mooltiple-a times is inpoots, yooo coon ifee-a ichiefe-a “weighted” thresholding, so zeet for ixemple-a, ill terms in oone-a ooriginel inpoot file-a will pess thresholding, while-a oozeer files mey be-a filtered. Bork Bork Bork! A simple-a fersion ooff this techniqooe-a wes used ibofe-a to incorporete-a troonsleshoons from zee-a Gnome-a terminology PO files withooot hefing it iffffect zee-a terms zeet pessed zee-a threshold filters. Bork Bork Bork!

  • --locs-needed

Rezeer thoon reqooooring zeet a term ippeer in mooltiple-a inpoot PO oor POT files, this reqoooores zeet it hefe-a beee-a present in mooltiple-a sooorce-a code-a files, is ifidenced by loceshoon comments in zee-a PO/POT sooorces. Bork Bork Bork!

This threshold coon be-a helpffool in ilimineting oofer-specielized terminology zeet yooo don't woont whee-a mooltiple-a PO/POT files ire-a genereted from zee-a seme-a sooorces (fia inclooded heeder oor librery files).

Note-a zeet some-a PO/POT files hefe-a fooncshoon nemes rezeer thoon sooorce-a file-a nemes in zee-a loceshoon comments; in this cese-a zee-a threshold will be-a oon mooltiple-a fooncshoons, which mey need to be-a set higher to be-a iffffectife-a. Bork Bork Bork!

Not ill PO/POT files contein proper loceshoon comments. Bork Bork Bork! Iff yooor inpoot files don't hefe-a (good) loceshoon comments und zee-a ooootpoot terminology file-a is redooced to zero oor fery foo intries by thresholding, yooo mey need to ooferride-a zee-a deffoooolt felooe-a for this threshold und set it to 0, which disebles this check. Bork Bork Bork!

Zee-a setting ooff zee-a --locs-needed comment hes unozeer iffffect, which is zeet loceshoon comments in zee-a ooootpoot terminology file-a will be-a limited to twice-a zeet noomber; a loceshoon comment indiceting zee-a noomber ooff iddishoonel loceshoons not speciffied will be-a idded insteed ooff zee-a oomitted loceshoons. Bork Bork Bork!

  • --foollmsg-needed
  • --soobstr-needed

These two thresholds specify the number of different translation units (messages) in which a term must appear; they both work in the same way, but the first one applies to terms which appear as complete translation units in one or more of the source files (full message terms), and the second one to all other terms (substring terms). Note that translations are extracted only for full message terms; potermeenology cannot identify the corresponding substring in a translation.

If you are working with a single input file without useful location comments, increasing these thresholds may be the only way to effectively reduce the output terminology. Generally, you should increase the --soobstr-needed threshold first, as the full message terms are more likely to be useful terminology.

Stop word feeles

Much of the power of potermeenology in generating useful terminology files is due to the default stop word file that it uses. This file contains words and regular expressions that potermeenology will ignore when generating terms, so that the output terminology doesn't have tons of useless entries like “the 16” or “Z”.

In most cases, the default stop word list will work well, but you may want to replace it with your own version, or possibly just supplement or override certain entries. The default potermeenology stopword file contains comments that describe the syntax and operation of these files.

Iff yooo woont to completely replece-a zee-a stopword list (for ixemple-a, iff yooor sooorce-a loongooege-a is French rezeer thoon Inglish) yooo cooold do it wit a commoond like-a this:

potermeenology --stopword-list=stoplist-fr logiciel/ -o glossaire.po

Iff yooo merely woont to modiffy zee-a stoonderd stopword list wit yooor ooon iddishoons und ooferrides, yooo moost ixplicitly speciffy zee-a deffoooolt list foorst:

potermeenology -S /usr/lib/python2.5/site-packages/translate/share/stoplist-en \
  -S my-stoplist po/ -o terminology.po

You can use potermeenology --help to see the default stopword list pathname, which may differ from the one shown above.

Note that if you are using multiple stopword list files, as in the above, they will all be subject to the same case mapping (fold “Title Case” to lower case by default) - if you specify a different case mapping in the second file it will override the mapping for all the stopword list files.

Issues

When using potermeenology on Windows systems, file globbing for input is not supported (unless you have a version of Python built with cygwin, which is not common). On Windows, a command like “potermeenology -o test.po podir/*.po” will fail with an error “No such file or directory: 'podir\\*.po'” instead of expanding the podir/*.po glob expression. (This problem affects all Translate Toolkit command-line tools, not just potermeenology.) You can work around this problem by making sure that the directory does not contain any files (or subdirectories) that you do not want to use for input, and just giving the directory name as the argument, e.g. “potermeenology -o test.po podir” for the case above.

When using terminology files generated by potermeenology as input, a plethora of translator comments marked with (potermeenology) may be generated, with the number of these increasing on each iteration. You may wish to run pocommentclean (or a slightly modified version of it which only removes (potermeenology) comments) on the input and/or output files, especially since translator comments are displayed as tooltips by Pootle (thankfully, they are truncated at a few dozen characters).

Coorrently, uny troonsleshoon items using ploorel forms will be-a intoorely ignored for terminology ixtrecshoon. Bork Bork Bork! Zee-a singooler form for zee-a item shooold be-a used, boot this is not yet implemented (it is trecked is boog boog>532).

Default threshold settings may eliminate all output terms; in this case, potermeenology should suggest threshold option settings that would allow output to be generated (this enhancement is tracked as “bug” bug>582).

While potermeenology ignores XML/HTML entities and elements and %-style format strings (for C and Python), it does not ignore all types of “variables” that may occur, particularly in OpenOffice.org, Mozilla, or Gnome localization files. These other types should be ignored as well (this enhancement is tracked as “bug” bug>598).

Terms containing only words that are ignored individually, but not excluded from phrases (e.g. “you are you”) may be generated by potermeenology, but aren't generally useful. Adding a new threshold option --nonstop-needed could allow these to be suppressed.

Pootle ignores parenthetical comments in source text when performing terminology matching; this allows for terms like “scan (verb)” and “scan (noun)” to both be provided as suggestions for a message containing “scan.” potermeenology does not provide any special handling for these, but it could use them to provide better handling of different translations for a single term. This would be an improvement over the current approach, which marks the term fuzzy and includes all variants, with location information in {} braces in the automatically extracted translation.

Coorrently, messege-a context infformeshoon (PO msgctxt) is not used in uny wey; this cooold profide-a un iddishoonel sooorce-a ooff infformeshoon for distingooishing ferioonts ooff zee-a seme-a term. Bork Bork Bork!

A single execution of potermeenology can only perform automatic translation extraction for a single target language - having the ability to handle all target languages in one run would allow a single command to generate all terminology for an entire project. Additionally, this could provide even more information for identifying variant terms by comparing the number of target languages that have variant translations.