Translate Toolkit & Pootle

Tools to help you make your software local

User Tools


Debian and Wordforge

This section documents the team effort to create an internationalization infrastructure for the Debian project, which implements Pootle.

This project is important for a number of reasons, not least of which is the need to explore the issues involved in creating a cohesive i18n process.

Background

At Debconf6 in Mexico, a number of events, discussions and decisions occurred to found this effort.

Here is Christian Perrier's summary of that information.

1. Summary

The work on i18n at Debconf6 has been particularly interesting and productive.

The main topic is of course, currently, discussion on i18n infrastructure, both summarizing existing features (most of them being summarized in the paper I published along with Javier Fernandez Sanguino) and future features.

The two main aims were to:

Two BOF sessions were organized, and the i18n talk by Javier and myself concluded the week.

Many other informal discussions also happened, with several people involved, among whom I'll cite:

  • Javier Fernandez Sanguino
  • Otavio Salvador
  • Michael Bramer
  • Nicolas François
  • Javer Sola
  • Gerfried Fuchs
  • Margarita Manterola
  • Raphaël Hertzog
  • Frans Pop

Discussions from the mailing list:

  • Denis Barbier
  • Gintautas Miliauskas aka “Gintas” (prospective GSoC student)

During our first session and the initial discussions, the main point was trying to setup ideas about the needs for the infrastructure: what its targets (a.k.a. users) were, and what features might be needed by each of these targets.

The second session happened after some informal work between contributors, and was aimed at being a summary of the ideas that were floating around.

All this lead to the following conclusions:

Infrastructure targets

We identified the following targets, or categories of users:

  • Administrators
    • They are in charge of managing the system and the users
  • Translators
    • They work on translations of original strings coming from “upstream”
  • Reviewers
    • They check the work from the translators and certify that it meet the standards of each translation team
  • Maintainers
    • Either originating inside Debian (the package maintainers) or outside (upstream software authors and maintainers), they are the source of translatable material and the destination of translations.
  • Visitors
    • They are occasional visitors to the web site, or potential users such as governmental institutions or non-government organizations (NGO).
  • Team coordinators
    • They are in charge of coordinating the work of translation for that language or project.

Needs of each user category

  • Administrators
    • add/manage projects
  • add/manage languages
    • delegate to backup admins, but also delegate tasks to translation-team coordinators.
  • Translators
    • get information about needs and priorities
    • “book” a translation. Reservation should be valid only for a certain amount of time. After that (calculation can be automated), the translation is released (posted by the robot in d-l0n-foo).
    • get material (web, mail, SVN…) or work online
    • derive translations (translate from languages other than English, e.g. Spanish, Russian, French, Afrikaans)
    • choose their preferred format (XLIFF, PO…)
    • license translations (???)
    • ability to merge reviews and ACK proposals one by one
    • avoid collisions with files translated in other projects
    • need to enforce the concept of “owner” of a translation
    • optionally more than one owner in some projects
    • glossary (able to propose several translations)
  • Reviewer
    • get work assigned, on request
    • express “Intent to review” (released after a set amount of time)
    • do work in public
    • see what other reviewers have proposed
  • Maintainers
    • modify the source location
    • send the material
    • ask for updates during release cycle by raising priorities
    • get the updated material
    • be notified of updates (opt-in). Options:
      • every commit
      • when “Ready”
  • Visitors
    • learn about the system (stats..) (references to i18n)
    • propose changes in translations
  • Team coordinator
    • can be per project and per language
    • get status and stats about their field of expertise
    • add projects
    • manage assignments (in addition to automated unassignments)
    • setup and use different processes from team to team (number of reviews, etc.)
    • group translations
      • set some goals, and see/show if the goal is achieved

These design goals raise the importance of the project to being as modular as possible. The core of the project should be a backend to which all other modules will plugin.

Going with WordForge

The presentation by Javier Sola about the Pootle project has finished convincing us that working with the members of the WordForge project is certainly the way to go. Their project is aimed mostly at these goals and at even more which we had not yet formalized completely.

That project doesn't suffer the concerns we have with the Rosetta project from Ubuntu/Canonical. We indeed have later confirmed that WordForge also works on open standard for communication between localisation projects, and that these could be used to communicate between the Debian infrastructure and that of Ubuntu.

These early specifications for features certainly have to be enhanced and completed in the future, but they will probably already allow Debian to merge them in the Pootle/WordForge specifications which are still under work.

We will cohere WordForge specifications with ours (needs agreement on how to “mark” Debian additions/needs).

Google Summer of Code project

The challenges with the GSoC project are multiple:

  • complete the project prepartion in a very short time (less than 1 week)
  • give a precise goal to the student
  • make it reasonable to achieve in the 3-month time frame

The first idea that came out has been requesting some work on some “bounties” of the WordForge project. However, some of us prefer that the specifications are ready before we enter such path. Given that they won't obviously be complete, we finally decided to launch a quite conservative project just to ensure that the work done has still some benefit for Debian without compromising the future.

As a consequence, it has been decided that the requested work will be separating the frontend from the backend in Pootle, which will allow future work to be concentrated on the backend used by all future software in the framework.

Some ideas about modules

This part is more prospective and will need some reorganization. It's more intended to be a summary of ideas that have been floating in the mailing list.

Import modules
  • Import from po-debconf (need standardization action to provide

up-to-date POT. At the minimum, reproduce the current layout) * Import from programs' PO (ditto) * Import from man pages converted with po4a (standard layout mandatory) * documentation (action from maintainers==opt-in) * Web site (no action, source under control)

Translation modules
  • translation teams define their own processes from a set of standardized actions
    • TTD (Translation To Do)
    • TTU (Translation To Update)
    • RFR (Request For Review)
    • Reviewed (with counter) == LCFC (Last Chance For Comments)
    • Pushed to maintainer via Debian BTS
    • Pushed to maintainer via another BTS, email, SVN
    • Ready for Use

* Different processes for different types of translations * Branching translations with merge features (manually or automatically for stable/testing/unstable)

Export modules

* Individual PO files * Set of PO files as a tarball * Individual XLIFF files * Online work

Interface modules

* Web interface * Mail interface

Communication modules

* Other Pootle servers * Rosetta servers * TP server (?)

Future plans for Debian i18n contributors

Reviving the DDTP

While the work on the new infrastructure advances, Grisu (Michael Bramer) will stabilize the current DDTP code to allow some maintenance of existing translations of package descriptions. Most of this work is already achieved, indeed.

In parallel, and because APT 0.6 now includes support for translated packages descriptions, the use of these will be promoted. Temporarily, the Translations files will be hosted on another server, namely ddtp.debian.net.

Some discussion has to happen with the ftpmasters team to decide whether the use of Translations files on FTP servers is considered suitable and when their inclusion can be possible. Of course, for this to happen, the DDTP must have a working maintenance system so that maintainers of packages can be sure that the bug reports they might receive from the DDTP, will be maintained.

The plan here is temporarily to use http://ddtp.debian.net as the demo case of what can be done with “Translations-*” files and the new APT. We should (re-)advertise this, have it used during a few weeks and then discuss with the ftpmaster having it integrated in the main repositories, along with Packages files.

Extremadura 2006

A meeting will be organized in Extremadura, from Thursday September 7th to Sunday September 10th.

This meeting will use the specifications previously finalized by the Debian i18n community to allow contributors to start building a consistent backend to the future Pootle system, benefitting from the work of Gintautas to separate both.

Inviting the Pootle developers to the meeting is considered highly desirable. Hopefully, in the meantime, enough will have been achieved, especially by Gintas during his work.

The goal could be setting up the first Debian Pootle server.

For all information about Extremadura sessions, the #extremadura-2006 channel can be used on freenode (irc.debian.org).

Further information

You can see the slides used during the 2nd BOF of Debconf about i18n infrastructure, plus the summary of the current ideas about the infrastructure. N.B. : this material is provided for interest: it is in being developed.

Christian and Javier gave a talk on the i18n infrastructure which may possibly be seen in this video.

The Debian wiki hosts i18n pages on Ideas for the Debian i18n Framework, The i18n Infrastructure, l10n Coordination and L10n Workflow.

Work started

At that point (20th May, 2006), Gintautas said:

“if my application is accepted, we should have something running in a few weeks (my first milestone is on June 20th).”

and Alberto Escudero said:

“I am working in infrastructure issues related to Wordforge with Javier and other people in the Wordforge gang. Lately I am defining a mechanism to connect Pootle backend to a OpenOffice.org build infrastructure (xml-rpc btw).”

BOF 2

On 22nd May 2006, Nicolas François posted this summary of the second BOF on the i18n infrastructure.

  • Another kind of users: institutions: they should fit in the Visitor category)
  • 2 kinds of team-coordinators: per project/per language
  • How to provide support to the end users: language packets
  • The final users' need to report bugs
  • Administrators' tasks
    • manage authorizations and ownership
    • manage users
    • remove inactive users
    • some tasks could be delegated
  • Translators need priorities
  • Several owners or a group of owners for a translation (e.g. Dutch D-I)
  • Need for glossary/glossary enforcement
    • associate multiple translations for a word, display them to the translator
    • when an interface is used, it could looks like a 3-paned window

| English English English | Glossary |

English English Word1
Translation Translation Translation1
Translation Translation2
  • The maintainers need notification (but send it only if they want it)
  • When should the notification be sent?
    • every commit
    • when the translator says “My translation is OK”
  • Maintainers need to increase the priority (I'm going to release), but also to decrease the priority (“I'm currently working on it”)
  • Some translation teams may need a stable version: use of branches

(for all teams?, only at some point in time?)

  • If there are branches, we need a “merge” feature
  • Statistics for testing is good, even if we cannot translate testing.
  • Reviewer needs: to be done
  • Team coordinator may need to set some goals, and see/show if each goal is achieved
  • This may require meta-projects (d-i level 1, …)
  • Users should be able to propose some changes
  • The architecture should support any kind of document (not only PO)

[note from another BOF, I don't know if this will be the case in the first implementation]

  • the release coordinator must be contacted to get some help from the FTP master to get the DDTP translation in the official archive. It should be added to their release goal. Some meta-data may be needed for each strings (e.g. string size constraints).
  • The Wiki has to be used for the specs
    • describe goals and details
  • Multiple instance of the infrastructure (e.g. an instance for a language team): need for communication between these instances. This can be important for countries where international communication is expensive.
  • Offline tools are also very important
  • Offline and Online reviews
  • Rosetta has/will have an XML RPC interface
  • Licence of the translations? This is an issue. A thread has to be started on debian-legal.

Some notes taken during Javier's presentation about Wordforge

  • Glossary enforcement
  • a problem in PO: the reviewers do not know which messages need to be reviewed
    • in the PO processes, we send patches for the review of big POs
    • this is possible in XLIFF
  • TM (Translation Memories) need small sentences.
  • There is a sentence separator in the XLIFF strings to help reuse strings
  • The XLIFF specification is stable. [Note: IIRC, the current spec is 1.1, a 2.0 will come with little change.]
  • An XLIFF file contains many types of content (the translatable strings, context, a glossary, a TM, results of tests).
    • XLIFF is(/will be) used as the backend of Wordforge
    • XLIFF is very suitable for offline translation (the users will have the glossary terms they need)
  • Pootle will be distributed (communication between multiple instances). (A server could receive POT, and send PO to the other instances.)

Speed and capacity

There has been considerable discussion on the most effective structure for the different components of the i18n infrastructure. Estimates of the number of files and users active on the proposed server at any one time are difficult to make at that stage, but the aim is to provide for different situations, and for growth.

gettext-0.15

Gettext 0.15 was released in August, 2006, containing, among other very useful improvements, a msgmerge capable of handling much bigger files much faster, and the long-awaited msgctxt (context) facility. Future planning for Pootle and the i18n interface needs to take this increased gettext capability into account.

scale

On 9th June 2006, Zejn Gasper said:

Not only databases, but also file systems can scale, e.g. you can set up a NFS cluster, so flat-files aren't necessarily a bottleneck. Of course, only Pootle is given access to files.

XML files do not need to be parsed, because Python has pickle, which can store a Python object in binary format in a file. This may be a rather risky, hackish trick, but could probably greatly accelerate XML-handling. Of course, plain XML also gets stored besides the pickled version. Also, since Pootle is converting to Kid templates, I'm not afraid it would be slow, since Kid uses the element-tree XML parsing library, which has a C implementation. Pickle could also be used for RPC.

There are some proven solutions which should be used, e.g. memcached. It can be used as a distributed cache spread across a number of servers' RAM, and has been in production use for a long time. The current cache system is less flexible and written in Python, while memcached is a dæmon written in C. The Python API binding is available, but not yet packaged in Debian.

Fuzzy matching should be implemented separately. This is usually a CPU-intensive task, so it could be handy to have 'fuzzy servers' separated from the main Pootle server that remains responsive at all times. This would also enable having e.g. one fuzzy server for Debian translations, one for GNOME, one for KDE … of course - if needed. Also, if any group wants for some reason to disable fuzzy matching, implementing it separately makes perfect sense.

Indexing would be split; statistics, translation states and similar metadata would go into a relational database for faster access. But for the PO/XLIFF files, I think I'd somehow rather go with (pickled) flat-files.

So there we'd have indexing, caching, storage backend, fuzzy matching, middleware (this being the part of Pootle handling mails and similar) and web frontend. The Pootle server would glue this together in a simple-to-use package.

communication

Zejn also said:

I think XMPP (Jabber) is a good candidate for RPC because:

  1. The protocol is already XML-based, and XML can be nested.
  2. It would allow triggered updates, where the master server would

notify slave servers of a new or fuzzy string needing translation, so there would be less need for polling the master server to stay updated. Of course, slave servers would check for new strings, e.g. daily and at Pootle restarts, but updates would still propagate much much faster this way.

  1. Jabber could be integrated into the web interface in many ways,

e.g. instead of a standard username/password, the Pootle server would require your Jabber username, and would only log you in if your account is reachable (this is a subset of the Jabber account states). Using a Jabber account as a login means fellow translators know where to reach you, and since you are signed in, they can quickly intervene if you start messing around. Using Jabber for authentication over HTTP is also a SoC project.

  1. Encryption, compression and many similar features are

already implemented: all the Pootle project needs to do is to set up a Jabber server.

I was mostly inspired by a blogpost seen on planetkde.org, but have been thinking about this earlier.

workflow

On 10th June 2006, Javier wrote:

I think that we can actually separate process information from workflow.

XLIFF files allow storage of <phase> elements, which can be “translation”, “authoritative review” (the reviewer can make changes in the target), “reviewer recommendations” (the reviewer adds comments in a file, but does not change the <target>), “approval” or any other that we define.

For example:

<phase phase-name="xxx" process-name="translation" date = "2006-01-25T21:06:00Z" contact-name="Alberto Martinez"
contact-email="alberto@martinez.com">
</phase>

These phases are elements that can be used in a given worflow. If your workflow is translation → review → approval, then Pootle will enforce that a review phase is mandatory after a translation phase, and then an approval phase. You could also require only a single translation stage, or a workflow in which a file can be reviewed by N reviewer-recommenders (who mainly add comments to the file), and record N phases in the Pootle file (interestingly, with this method, reviewers can act at the same time, and their information and phase is added to the file when they upload… then the file is sent back to the translator, who in this case can act as “approver” (his/her translation editor must be able to display reviewer comments when in approver mode).

If we can define the rôles, then we do not need to consider workflows in the encoding of the files, only in the process tool (Pootle).

Workflow can be defined for each project (a project being the set of files of a piece of software that will be translated to a given language, such as Gaim 1.5 for French). It can be inherited by default (from the language level), to simplify configuration, but remain changeable.

If this is combined with allowing commit access (upstream) to whichever rôle the team decides (translator, approver), it should provide enough flexibility for any type of workflow. We will have to test a number of them to ensure that this assumption is correct.

Every string in the file can be associated with a specific phase. Some strings in the same file could be unstranslated, some translated but not reviewed, some reviewed. There are loopholes in the XLIFF definition, but they can be overcome with implementation specifications.

The diagrams begin

On 9th June, 2006, during the “DB or file-based storage” discussion, Otavio posted:

I like the idea that a RDBMS is the way to store the data, and the “final” format (po or whatever) should be built just as a product of it. Also, I agree about the idea to have a “server layer” between the client and backend because it makes trivial for user to use the backend without much hassle and without the need to control locks and whatever of internal details of it.

One thing that come to my mind just now is that we might have the following design:

With that in mind, I realise that it might be easier to get collaboration with the Pootle people, since they can help us to design the layers below their compatibility layer, and at same time improve their system in a easier way without needing to be careful not to break our system. So, with that we both are basically free and able to help each other.

The only drawback to it that I can see, is that we won't contribute much code back to Pootle directly. Basically this makes the structure more “backend agnostic” and then plug our backend in there. That, IMHO, is a good conseguence, since they'll be free to choose and develop other backends without having to compromise only on us.

Javier answered:

I would like to propose a course of action.

  1. Gintautas works on the API (which has to include the separation of front-end and back-end), and on the XML-RPC. This will really clear up things for further work, and will facilitate the specification of a DB. This work is done in a separate branch, as required by the project.
  2. The WordForge present staff will integrate these changes in Head as soon as they are clear, so that Gintautas' branch and Head stay as close as possible, and integration is not a problem later.
  3. Independent of the work that is continuing on WordForge (front-end or file-based back-end), WordForge will be happy to fund somebody to work on designing and implementing a RDB back-end. This can be somebody working in parallel during the coming months, or Gintautas after he finishes (if he has time before going back to school). Then we just use — or continue working on — the best back-end that we have. We would work together with this person to ensure that all the data that we need in the project is present in the DB (even if most of should already be defined by the API).

I somehow have a feeling that the optimal result will be something between file-based and DB-based, but again, I could be completely wrong. We will not really know what it is until we try.

Regardless of whether the best storage is using files or a DB, the standard formats are precisely done so that they will last for a long time. XLIFF is an OASIS standard that will evolve, and will not be replaced in the coming decades. Just as with as Unicode, the standards are there to last… and, even if change occurs, you would still need to write converters. On the contrary, when these standards advance (within backwards compatibility), you will have to modify the database to handle new types of information, but you would not need to modify a file-based system.

I agree about the idea to have a “server layer” between the client and backend because that makes it trivial for the user to use the backend without much hassle, and without the need to control locks etc.

I would see the a separation between the Pootle web-based file server and the Pootle editor, maybe putting them on top of the API, even if this might not be the best place for the editor. Would this break the idea that you expressed with the server layer above?

Our idea — as I mentioned — is still that we work together in the different options. We would not like to see an “our back- end” and a “your back-end”. The advantage of working together is that there are more people thinking, and the good news is that we have sufficient funding to look at both options, and then pick up the one that we all agree is the best, and continue working in that direction.

Packaged for Debian

On the 15th July, 2006, Christian Perrier announced that he had created some very early Debian packages for python-jtoolkit and Pootle.

It turned out that Nicolas François had also produced Debian packages for python-jtoolkit and Pootle, which were uploaded by Otavio Salvador later that same day. Fast work, Debian! :)

New life for the DDTP

On the 4th August, Christian Perrier commented on the resuscitation of the DDTP (a web interface to translate Debian package descriptions):

It's good that the DDTP is now getting results.

However, the current translation process (the mail interface) is a bit different from the usual way translation teams are working. For instance, this explains why the French effort is currently mostly stopped for the DDTP.

What would be good now is getting an interface with the PO format so that translators can work on PO files for DDTP translations.

Before we get the nice infrastructure (probably based on Pootle) which we will discuss in Extremadura, it would be good if we could come up with a crude way to get PO files for the DDTP.

That requires writing something that will convert the Debian descriptions and their translations to PO format. IIRC, Otavio did write some stuff already, 1 or 3 years ago.

Maybe a po4a module would be possible…..this should be checked with the po4a package maintainers: Nicolas François, Thomas Huriaux, Martin Quinson et al.

Note that Pootle integrates the po4a package.

Michael Bramer commented:

For the DDTP in the new i18n system we need:

  1. some importer to get package Descriptions from the Packagefiles from the ftp server
  2. do the work in the new i18n system (po files, translation, review, etc.)
  3. some exporter to generate the Translation files for the ftp master

He later added:

For Pootle, we need a Packages2po and a po2Translations script for the daily work.

We need a Translation2po script to include the translated Packages in the Pootle system. But I can write some lines to generate the po files with the translated descriptions from the database too.

Thomas Huriaux reminded:

Please don't forget that intltool-debian scripts are already mainly doing this job. It's not a hack to use them: even if the main frontends (debconf-updatepo and po2debconf) have a name which is debconf-related, these tools are generic and can be used in this situation.

From the 6th August, Martijn van Oosterhout started to add functionality to the current DDTS:

I've written a sort-of frontend for the DDTS. It's a standalone system, it doesn't have anything to do with the DDTS directly, except that it knows how to drive the system via email and process the responses.

It's a web-interface-based system. What happens is that it tries to keep a configurable number of untranslated descriptions available. A user can click on it and provide the translation. After that is goes to the review list. From there. any number of people can review it. Once that has passed a configurable number (can be zero), it will be sent off to the DDTS.

Features include: you can configure a list of words that, if they appear in the untranslated text, will appear underlined and when you hover your mouse over each one, it provides a translation. This is good for words like script, command, executable, caller, library which tend to get varying translations depending on who's translating.

On the 17th, Jens added some documentation for the DDTP temporary system.

The DDTS temporary system has served two useful purposes: it has made it possible for some languages to update their package descriptions for the upcoming Debian Etch release, and its development and testing have brought up a number of process issues which will be addressed by the new i18n infrastructure. Although the Wordforge team has been dealing with these issue for some time in producing Pootle and the Translate Toolkit, the future users of the Deb18n infrastructure (team leaders, project leaders and translators) have had an opportunity to test a web interface, and to consider how it would work best for their needs. After all, the more we put into designing this system, the more we will all get out of it. ;)