This page contains notes about implementing a Translation Memory (TM) database driven by data warehousing principles for quick and effective TM lookup.
Some other things to check to either use for ideas or to adopt.
We don't need to store all matches in the actual warehouse since we would mostly only be interested in matches over 80% this storing for instance 15% match would be of no interest. We might want to store that data in a database that won't be accessed but that can be used if you update the information later.
source text could be used in a dimension table with various permutations of the text. E.g. lowercase, uppercase, accelerator stripped, placeable stripped.
The data table would then simply store matches
The idea of using the warehouse is that we only check distances on source text. This source text is mostly unchanging (not quite true as it is slowly moving). Thus it is possible to gather all source text and pre calculate distances. As target text is created it can be linked to source text by the pre calculated distances.
This allows the database to be disposable. This is a principle that we would like to keep in mind, that you should be able to rebuild the database of TM matches from existing data. This data could be existing translations, other TM stores (e.g. TMX) and compiled translation (.mo files).
The assumption that we have a finite amount of source text ignores the issue that translation residing on a computer are often related to current released software, while the translator will be working on software to be released. Thus their source text is most likely not in the TM. This raises the possibility that in much translation work we will still have to deal with the situation where source text is not in the data warehouse.
Some ideas for dealing with missing source text:
Is it possible to add new database entries without matching them against every single database entry. Thus if you had 10,000 entries and you want to add a new match would it be possible to find a close match and compare the string to the surrounding matches to satisfy the distance calculations. Or must you match against all entries?
This is the approach currently used in tmserver. When the full text indexing is available in the underlying database, it is used to speed up the initial process of finding possible suggestions. Post-processing ensures that the good scoring is still used to select and rank the final set of suggestions.
There are some shortcomings to this approach, mostly enforced by the full text indexing of the database. For sqlite, the fts3 module is optional, new and does not support stemming for anything but English (using the Porter stemmer). In the default install of MySQL, words shorter than four characters are not indexed. This setting can be changed, but is a server-wide setting with obvious implications.
Obviously full text indexing has serious limitations for several languages. Since we are more concerned about the applicability for English (as the most common source language used in our tools), this is not terrible, but definitely not ideal.
Friedel has conceptual model to work around the limitations of full text indexing as explained above. Nothing has been implemented yet. Contact him if you are interested.