Translate Toolkit & Pootle

Tools to help you make your software local

User Tools


Translation Memory database

This page contains notes about implementing a Translation Memory (TM) database driven by data warehousing principles for quick and effective TM lookup.

A Translation Memory database was implemented and is already in use by tmserver and therefore also Virtaal. It does not use all of these principles and ideas, so this page is mostly to collect ideas for future possibilities. Consult the source code and/or the developers about the TM database or using the TM server.

Principles

  • The database shouldn't be tied to a specific database
  • Database should follow data warehousing principles
    • Optimised for speed
    • Not transactional
    • Database is disposable and can be rebuilt from existing data

Other implementations

Some other things to check to either use for ideas or to adopt.

Warehouse design ideas

We don't need to store all matches in the actual warehouse since we would mostly only be interested in matches over 80% this storing for instance 15% match would be of no interest. We might want to store that data in a database that won't be accessed but that can be used if you update the information later.

The actual source text could be used in a dimension table with various permutations of the text. E.g. lowercase, uppercase, accelerator stripped, placeable stripped.

The data table would then simply store matches

Other dimension possibilities

  • Match percent: actual (0.955), range (80-90)
  • Domain: KDE, GNOME, Windows, etc.
  • Application: browser, word processor, help
  • Translator:
  • Date:

Pre calculation principle

The idea of using the warehouse is that we only check distances on source text. This source text is mostly unchanging (not quite true as it is slowly moving). Thus it is possible to gather all source text and pre calculate distances. As target text is created it can be linked to source text by the pre calculated distances.

This allows the database to be disposable. This is a principle that we would like to keep in mind, that you should be able to rebuild the database of TM matches from existing data. This data could be existing translations, other TM stores (e.g. TMX) and compiled translation (.mo files).

The assumption that we have a finite amount of source text ignores the issue that translation residing on a computer are often related to current released software, while the translator will be working on software to be released. Thus their source text is most likely not in the TM. This raises the possibility that in much translation work we will still have to deal with the situation where source text is not in the data warehouse.

Missing source text

Some ideas for dealing with missing source text:

  • Use text matching addons for the underlying database - this does remove the advantage of being database agnostic - to find best possible matches and then calculate required distances.
  • Find the region in which the string appears by using LIKE operator and then perform matching in the data warehouse.
  • Try a series of fallbacks to match lowercase, without placeables, etc.

Things to explore

Rapid processing of new entries by not matching everything

Is it possible to add new database entries without matching them against every single database entry. Thus if you had 10,000 entries and you want to add a new match would it be possible to find a close match and compare the string to the surrounding matches to satisfy the distance calculations. Or must you match against all entries?

Full text indexing

This is the approach currently used in tmserver. When the full text indexing is available in the underlying database, it is used to speed up the initial process of finding possible suggestions. Post-processing ensures that the good scoring is still used to select and rank the final set of suggestions.

There are some shortcomings to this approach, mostly enforced by the full text indexing of the database. For sqlite, the fts3 module is optional, new and does not support stemming for anything but English (using the Porter stemmer). In the default install of MySQL, words shorter than four characters are not indexed. This setting can be changed, but is a server-wide setting with obvious implications.

Obviously full text indexing has serious limitations for several languages. Since we are more concerned about the applicability for English (as the most common source language used in our tools), this is not terrible, but definitely not ideal.

Friedel has conceptual model to work around the limitations of full text indexing as explained above. Nothing has been implemented yet. Contact him if you are interested.