Translate Toolkit & Pootle

Tools to help you make your software local

User Tools


Google Summer of Code 2008 - Ideas

The following are projects ideas for Google Summer of Code 2008. The project has applied to be accepted as a mentoring organisation and these are project ideas that we have gathered together.

Our main aim has been to identify work that can be completed in 3 months, that is useful to us as the project but that is also challenging and interesting.

Our software include:

  • Translate Toolkit - this is a set of tools and libraries that allows files to be transformed into translatable formats. Allows translatable formats to be managed and manipulated. The other tools use this as a library to allow management and manipulation of translatable formats.
  • Pootle - this is an online translation and translation management tool used by many projects including OpenOffice.org and Creative Commons.

Generic skill requirements:

  • Python - you must be able to code in Python, experience in another OOo language will help but might make the work harder for you
  • Experience in computational linguistics is useful in some projects but most do not need any specific language requirement
  • Experience in localisation is helpful as you can then understand the needs of a localiser.

Included in each project are:

  • A grading
  • A description of the task
  • Where to poke in the code
  • Further reading
  • Possible mentors

If you want to discuss any of these projects then try us on IRC freenode.net #pootle or mail the Translate Toolkit development mailing list.

If you want to apply as a student, you may also want to check out theofficial student application guide from Google.

Workflow in XLIFF

Grade: Hard

Description: The XLIFF standard is an XML based standard for localisation. It can store various state information and can be adapted to manage a translation workflow.

By Workflow we mean the simple process that moves from untranslated → translated → reviewed → approved. There are also processes for updating existing translations. These can be more complex where the review is 'authoritative' that is the reviewer can make changes vs 'non-authoritative' that is these are simply suggestions to the translator who then decides if she wishes to fix them.

This work would involve defining levels of workflow. Finding a suitable toolkit or implementing the workflow classes. Implementing the workflow and creating and adapting existing tools to manage the workflow.

What this is not is a workflow engine. Our goal is not to make a workflow editor, no matter how much people beg, but to create a set of standard workflows that meet the needs of current translations. If you do this well it will be possible for anyone to adapt this in the code and in time we should cover all conceivable workflows that could be created without burdening this with an editing tool.

Your main aim is to stay focused on the basics of the workflow and deliver a solution that implements workflow.

You might want to consider how notification forms part of the workflow, email, jabber, RSS. But this is not a crucial component.

Poke the code: Not much code to poke I'm afraid. phase - is useful to understand some tools used to manage process

Further reading:

Possible mentors:

Segmentation Tool

Grade: Medium

Description: Segmentation is the process of taking a block of text and breaking it into sentence segments. While initially this looks simple you might find problems when you want to segment e.g. and have to build rules and lists of words that are not to be segmented.

The main advantage of segmenting is that it allows us to match existing translations at a sentence level. Thus in a block of text you might have 3 sentences and 1 of which will match 100% while the others might match less and need to be reviewed. If you had not segmented you would probably not have matched anything.

Another advantage of segmentation is that it allows us to recover old translations. If someone translated some text but didn't keep the Translation Memory (the 1-to-1 map of translations) then we can use segmentation to break both source and target text into segments and try to align the texts.

Your main tasks in this project will be to:

  • Integrate ICU into the toolkit to allow us to use their segmentation rules (or find some similar established segmentation software, or expand the existing segmentation software in the toolkit)
  • Implement the SRX standard that allows segmentation rules to be specified in XML.
  • Create an alignment tools that will allow two pieces of text to be segmented and aligned. The tool should allow people to merge items that the tool segmented. Move them up or down and generally edit the text. The output will be in TMX format so that the text can be used as a Translation Memory.

An interesting aspect which you might want to include is some idea of automatic alignment that will try to guess which pieces of text should belong together.

Poke the code:

Further reading:

Possible mentors:

Glossary Tool

Grade: Medium

Description: In translation it is important to have glossaries as these guide existing and new translators to use the correct words. Glossaries are like dictionaries but usually very focused on a specific domain and don't need all the detail you would find in a traditional dictionary. Usually they contain only the Source and Target words. They might contain a definition or an indication of the part of speech.

The TBX format is a format for TermBase Exchange, i.e. to allow glossaries to be exchanged. The Translate Toolkit currently has very basic support for this format. Full or better support would allow much more detail and important information to be stored and shared.

Glossaries are immediately useful but they would be more useful if your translation tool was able to warn you when a word that should have been used has not been used. In order to do this it needs a stemmer so that it can find the root of the English text before looking up your words and phrases in the glossary.

For this project a student would have to:

  • Develop a glossary extraction tool
  • Implement the majority of the TBX specification
  • Implement a stemmer e.g. snowball stemmer
  • Using the stemmer create a terminology matching checker
  • A basic terminology editor

These should be integrated with or extend existing tools: Translate Toolkit, Pootle or our offline editor.

For those who want a greater challenge you should add a glossary populator. This would take the a blank glossary and try to populate it based on existing translations. These would be checked by a human but the population stage should save quite a bit of time.

Poke the code:

Further reading:

Possible mentors:

Format expansion

Grade: Easy to Medium dependent on scope

Description: One of the primary roles of the Translate Toolkit is to covert formats that you want to translate into translatable formats. Thus we can convert MediaWiki text into Gettext PO. We do this so that translators can use one set of tools instead of having to learn a new tools for every new format. In the same way that coders use one tool we think localisers deserve the same respect.

The toolkit already supports a large number of formats, most of these are focused on localisation, not content translation.

Your primary work would be to flesh out the current format support to allow many more formats to be supported. This is done by implementing the format on our base classes. Then the creation of a conversion tool that will convert the format to a translatable format.

The list of formats that we would want supported are the following:

  • PDF - so that the almost 300million PDF documents might become translatable
  • RC files - these are used by WINE and ReactOS, we'd like to localise those applications. It would also make many Windows applications localisable.
  • TTX - A proprietary format used by the proprietary Trados translation tools. Supporting this format would allow 1) tools using the Translate Toolkit to be used as commercial localisation tools, 2) allow translators using Trados to translate free software as we'll be able to convert to their format.
  • Qt .ts (expand to support v1.1), Qt .qm (allow correct compilation of these formats - this would allow all Qt related software to be translated and resources to be compiled.
  • Others - the formats page lists many others that we might want to support

Implementing the properties, PO and html formats to convert to XLIFF according to the XLIFF rep guides would help push this to a hard project. Defining rep guides for the more important formats listed here or currently handled by the toolkit would also turn this into a hard project.

Poke the code:

  • base.py - the base class for translation storage formats
  • convert.py - the conversion framework
  • po2xliff - an bilingual converter
  • ini2po - a monolingual converter

Further reading:

  • Not much on the formats as many of these formats are undocumented

Possible mentors:

Pootle OpenID and other data sharing and logging

Grade: Hard

Description: Implement OpenID authentication on Pootle (Pootle is a web-based translation tool built on the toolkit). This will allow authentication against any Pootle or OpenID server allowing people to login easily. The vision of Pootle was never to be centralised but to allow various Pootle servers to be created for various needs, thus many people would need to access multiple Pootle servers. Using OpenID they can suddenly access all of these servers.

OpenID also allows the exchange of various pieces of personal data including email, name, etc. These should also be shared so that the user only needs to maintain one set of personal details.

All of this would be implemented in the Translate Toolkit so that it can be shared by Pootle and by other users of the toolkit.

Pootle currently does not implement a very good system for tracking progress or individual translation work. Using SIMILE Timeplot would allow people to see very interesting aspects to the translation progress.

Pootle also does not do a very good job of notifying people about changes in state that needs their attention. New files for translation, someone has just registered, a suggested translation provided. These need to be either emails or jabber messages that can be clicked on and responded to by a Pootle user.

For this project the student would need to implement:

  • OpenID with data sharing
  • Better statistics gathering and rendering using Timeplot
  • Notification of events

Poke the code:

  • Check out the pootle code

Further reading:

Possible mentors:

  • Lars Kruse (sumpfralle) - I could assist or mentor (if necessary) a student

Placeable and Translation Memory Server

Grade: Medium

Description: Any previous translation is termed your translation memory (TM). Translators use/leverage these to give fuzzy matches. This ensures that they save time and also that they consistently translate across their translation tasks.

Placeables are pieces of 'text' that can be replaced in a TM without really altering the context of the text. A placeable would be any of the following:

  • Accelerator keys: e.g. & in KDE, _ in Gnome, ~ in OpenOffice.org
  • Variables: %s, $1, $var$, etc
  • Tags: <b>, <tag>, etc
  • Numbers: 1,000.00

When using a TM froM Gnome on KDE you want it to be able to recognise and alter the accelerator key thus ensuring a 100% match. For variables and tags you want the matcher to be able to discard and alter these intelligently. For number you want to be able to alter a number and even the formating of a number as needed.

Your work in this project would see you doing the following:

  • Creating a placeables framework such that any format can define these palceables
  • Getting features upstreamed to for instance Gettext to make your work easier.
  • Building an XML-RPC TM server that would store TMs, allows people to submit TMs, allow people to query and would do transformations on the fly
  • Make alterations to our TMX support as needed.
  • Implement full identification of placeable variables for Gettext PO and adapt po2xliff to include placeables in the resultant XLIFF file.

Poke the code:

Further reading:

  • PO XLIFF representation guide

Possible mentors:

Translation diff'ing tool

Grade: Medium

Description: While creating a diffing tool might seems rather redundant and in fact easy this project involves a few more things.

Normal diffing tools are in most cases useless for checking changes in translated files. Very often the context diff is cutoff so that in fact you can't see the full context. Slight changes in layout are shown as diffs yet the actual Source or Target text remain unchanged. Changes in the header are marked as difference when they are not really that important. All of these issues lead to noisy diffs that mask the actual content that should be examined.

Another useful area to see diffs is the new Gettext feature that allows previous messages to be stored. In this case as you have the current and previous translation you would be able to see a diff of these two. The same applies to alt-trans items in XLIFF. You can compare various fuzzy matches to see exactly how the source text of the suggested fuzzy match differs from your source text.

In Pootle we use the Python diff module quite effectively to show difference between suggested translations and the current translation. With character level diffs and the correct way to represent them these become very powerful.

In this project the student would be required to:

  • Pull all diff related code from Pootle into the Translate Toolkit
  • Create a PO diff tool to correctly create non-noisy diffs. Extend this to cover XLIFF
  • Add previous translation ability to Gettext PO support in the Toolkit
  • Implement a method to view the diffs in Pootle and in the offline editor
  • Optional: Implement the same for all alt-trans tags in XLIFF.

Poke the code:

Further reading:

Possible mentors:

Complete porting of Python PO parser to libgettextpo

Grade: Hard

Description: The Translate Toolkit has two parsers for PO files. The first written in Python that we call pypo and the second which uses libgettextpo from the Gettext package which is written in C which we call cpo.

The majority of the work to get cpo working is in place in the Translate Toolkit. But the hard parts are not yet complete. Currently you can run almost all of the Translate Toolkit commands and they will work. But we are not releasing memory correctly. Thus we cannot use this in Pootle.

There are also a number of features within cPO that are not yet implemented in pypo. These include previous messages and some new header functionality.

Your task would be to complete the porting to cpo. You will need to manage the correct releasing of memory within Python. We will test your implementation on the Pootle translation server. We believe that this will reduce both memory usage and improve speed. If succesfull the OpenOffice.org Pootle server will be your first grateful user.

Your other tasks in terms of fleshing out the PO coverage will include: implementing the feature in both cpo and pypo. Ensuring that pot2po works correctly with the new features. You will also need to examine po2xliff and implement the conversion of msgctxt and previous messages to XLIFF.

Poke the code:

Further reading:

Possible mentors:

Pootle using mod_python and file locking

Grade: Hard

Description: Pootle is a file based online translation tool. We don't use a database backend. This has many benefits but some performance disadvantages. The primary problem is that we can only run a single instance of Pootle and cannot rely on Apache with mod_python to allow the server to run several instances of Pootle.

The main issues behind this problem is that we do not fully implement locking of files within the server. Thus your first task will be to implement locking of files for reading and writing, we expect some performance impact so you might want to address that after implementing the locking.

Once locking is in place we can move onto working with Pootle on mod_python. Pootle uses a web toolkit called jToolkit that can already work with mod_python. Your task will be to setup and document how to make Pootle work with jToolkit on mod_python. Any bugs that occur and need correcting will be yours.

Lastly you will need to check and test that Translate Toolkit command line tools can be used on the Pootle installation. In other words these tools must also understand locking. You could easily address this by putting the locking logic in the toolkit itself which probably makes the most sense.

While not essential you may also want to ensure that Pootle can reload its configuration while running. Currently changes while Pootle is running are lost.

At the end of this you will have greatly improved the usefulness of Pootle for large installation as you will have removed the potential of blocking in long running tasks. You will also have made it possible to safely use all the command line tools on a running instance of Pootle.

Poke the code:

Further reading:

Possible mentors:

Migrate Pootle off jToolkit

Grade: Hard

Description: Pootle makes use of the jToolkit web framework. This has served us well but jToolkit is no longer maintained and we feel that we would be better served moving to a newer better maintained framework.

There has been some work porting Pootle to the Django framework which you may wish to continue. But we are open to discussion as to which target although we are currently quite interested in Turbogears.

Your first tasks will be to put Pootle on diet. This will mean taking as much functionality out of jToolkit and out of Pootle and into Translate Toolkit. This will ensure that good functionality that we will wish to reuse in offline tools is preserved. It also ensures that Pootle becomes smaller and hopefully easier to migrate.

Next will be the task of migration. We would like whoever takes on this role to approach this as an iterative process. We don't want to rewrite Pootle we want you to migrate it to the new platform. Once that is done features can be improved, added and performance reexamined.

Poke the code:

  • the current (incomplete) migration of pootle in the subversion branch django-migration

Further reading:

Possible mentors:

  • Lars Kruse (sumpfralle) - I could assist or mentor (if necessary) a student

General Improvements (Feature additions) to Pootle

Grade: Medium

Description: While working with Pootle at OLPC, we have come across a number of feature requests, most (if not all) can be implemented within the GSoC timeframe. Some of the most high priority ones among them are

  • Ability for the translation admin to merge the translations with the latest POT for her project.
  • Ability for the Pootle admin to easily set permission for languages on a global basis (ie: give user foo admin rights for all Spanish translation projects)
  • Support for validation of translated strings on submission (equivalent of msgfmt --check, but for individual strings),
  • Integration with http://open-tran.eu/ (use the XML RPC interface of the site to generate suggestions)
  • Ability for language administrator to get in touch with members of the translation team
  • Ability for translators to use an intermediate language (eg: an Aymara translator can use Spanish to understand the English msgids better) while translating (partially implemented, needs some more work).

Poke the code:

Further reading:

  • Other reading on the topic that would help

Possible mentors:

  • Sayamindu Dasgupta

Project Ideas Template

Grade: Easy, Medium, Hard

Description:

Poke the code:

Further reading:

  • Other reading on the topic that would help

Possible mentors: