This page is outdated, and was only ever a draft concept planning idea. Don't take anything here as being related to any released version of Pootle.

Pootle Metadata Storage

This is an attempt to crystallize some of the discussions we have had recently on improving the Pootle architecture. If you have a different idea for how things should work, discussion on the mailing list is the right place, but clarifications etc are welcome in this page.

These are some of the interacting issues we need to consider in parallel:

  1. Scaling to multiple processes (e.g. when running under Apache) requires better locking of our file interaction, changes etc
  2. Scaling to larger numbers of files (e.g. 180000 for Debian) will probably require faster statistics generation etc (in general, translation metadata)
  3. Moving to generic API for translation storage (the base classes we use for PO, XLIFF, etc, etc) requires reworking our storage interaction

Locking and Base Classes are already being worked on in the Pootle-locking-branch. In addition because of Base Classes we have been factoring out the statistics generation etc which was horribly intertwined with PO file interaction.

It may be helpful to review the terms used in the base classes - they are outlined under terminology

Current status

Information Stored

Metadata here sinclude the following information that we currently store/use that is either not stored in the PO file, or is summary information:

  1. Counts of translation units (messages) in translation stores including
    • Number of strings in a translation store, and number of translated/untranslated strings
    • Number of words in original and translation of each string
    • Which strings in a file (referenced by position in the file) are translated/untranslated
    • Which strings in a file have suggestions waiting for processing
  2. Results of passing strings through checks
    • Which strings in each file have failed each of the checks (including fuzzy, untranslated, as well as all the punctuation checks etc)
  3. Assignment information
    • strings within a file can be assigned individually or in groups (e.g. the whole file) to translators for a particular purpose (e.g. translate, review, etc)
  4. Quick Statistics for a translation project (which is a set of translation stores translating a project into a language)
    • List of files
    • the number of words and strings and the number of translated words and strings for each file
  5. Goal information for a translation project
    • A number of goals can be defined for a project
    • each goal has a list of files or directories (implying all files within that directory) categorised in that goal
    • each goal has a list of users assigned to that goal
  6. Rights for a translation project
    • There are default rights, rights for a 'nobody' user (not logged in), and rights that can be assigned to specific users
    • These rights currently include view, suggest, translate, review, download archive, compile to mo, assign strings/goals, and administrate
  7. Users for Pootle
    • Authentication info: username, email address, hash of password, activation status
    • Site-wide rights (project administrator)
    • user preferences - selected projects and languages (for shortcuts)

Storage Formats

This information is currently stored in text files.

  • Counts and checks are stored in a text file called xxx.po.stats. This file also contains a timestamp for the po file and suggestions file it depends on (from when the stats were last updated) etc
  • Assigments are stored in another text file called xxx.po.assigns
  • Quick Statistics are stored in a translation project stats file in CSV format - pootle-$project-$language.stats
  • Goals and Rights are stored in a project prefs file (also a text file) - pootle-$project-$language.prefs
  • Users are stored in a project-wide users prefs file (also a text file) - users.prefs

This is all far too messy and we need to clean it up properly.

Other Data

Other data that is not stored in the actual translation file (but isn't strictly metadata):


  • These are suggestions that are waiting to be accepted / rejected
  • currently stored in a po file alongside the original po file called xxx.po.pending
  • for synchronization it is important that pending changes include the original source string, the original target translation as well as the new target translation. Otherwise we cannot pick up conflicts
  • Currently we only store the original source and new target, but this is really a topic for a separate page.

Text Indexes

  • We currently index all the strings and translations in a Lucene text index (if PyLucene enabled / available)
  • This really helps for fast text searching; Lucene is world class in this regard
  • Indexes are stored in one Lucene index per translation project.

Plan for Relational Database

This is a proposal to move to storing all of the above metadata (Counts, Checks, Quick Statistics, Assignment, Goals, Rights and User information), to a backend relational database. This move would also give us an opportunity to clean up exactly what metadata we need to store, how it interacts with changes and locking, etc, etc.

Contentious Issues

Discussions that this would raise:

  • Discussions about how we connect to the database, which databases we support, etc, etc
  • Discussions about whether we should store all the translations in the database as well rather than the current file-based system

These are the easiest things for people to suggest, without getting into the nitty-gritty of solving problems. Discussion of these should take place separately to this discussion and planning. Reasons for this:

  • Database Support is really up to the developers who actually implement this, although it is important that the right choice is made and we could give criteria here
  • Storing all translations in the Database basically amounts to a redesign of Pootle, if proposed as the only way of storing translations. There are also complex issues that relate to synchronising with version control, allowing download/upload of files etc. And it is a great advantage of Pootle that you can currently just run it on a bunch of translation files. As has been pointed out on the debian list, it makes more sense to later consider implementing TranslationStores that use the database as a backend, if anything.

We can only handle so much change at one time and we already have 3 or 4 major changes going on, so lets make sure some of our current improvements land before we take up too much time in discussing the above.

Other options considered

Other options we looked at for how to store metadata:

  • As they currently are (text files) - too clumsy, difficult to extend, complex to handle locking etc etc
  • In XML files - more extensible but otherwise all the above problems
  • Within the translation files - creates problems with working with upstream versions of files, etc
  • In a more simple non-relational database like bsddb (included in Python) - not much advantage over relational database except inclusion in Python, probably less scalable
  • Try and store within Lucene Text Index - not really designed for this purpose, makes Lucene a hard requirement
  • An Object DB / Python Persistence engine - not as standard, not necessarily as open to other tools

In the end it comes down to this is the kind of thing relational databases were designed for, so it seems a clear choice

Issues for Design

  • Extensibility (e.g. storing statistics on different things)
  • Portability
    • nice for people to be able to use something like Sqlite for small installations, more robust client-server databases for bigger installations)
    • We have the Python DB-API which helps here, and jToolkit (the web framework) also includes a suitable database portability layer
  • Locking etc
    • databases can help a lot here
    • we still need to make sure the metadata gets updated properly when changes are made to files simultaneously in different processes etc
  • It helps to separate out which data is basically summary info that can be regenerated and which is only stored in the database