Translate Toolkit & Pootle

Tools to help you make your software local

User Tools


Google Summer of Code 2010 - Ideas

The following are projects ideas for Google Summer of Code 2010. These are some project ideas that we have gathered together.

Our main aim has been to identify work that can be completed in 3 months, that is useful to us as the project but that is also challenging and interesting.

Quick links:

  • Pootle - our web based system for translation management and online translation. Used by OpenOffice.org, Mozilla, OLPC (Sugarlabs), Debian, LXDE and many others.
  • Virtaal - our powerful standalone tool for computer aided translation, including powerful features such as translation memory, machine translation and terminology assistance.
  • The Translate Toolkit - the technology platform for Pootle, Virtaal and many other localisation tools. It contains our core technologies for format support, natural language technology, and many useful tools and converters powering teams for Mozilla, OpenOffice.org and many others.

Generic skill requirements:

  • Python. Experience in another OO language will help, but might make the work harder for you.
  • Experience in computational linguistics is useful in some projects, but most do not need any specific language requirement.
  • Experience in localisation is helpful to understand the needs of a localiser.

Included in each project are:

  • The estimated difficulty
  • A description of the task
  • Where to start looking in the code
  • Further reading
  • Possible mentors

If you want to discuss any of these projects then try us on IRC freenode.net #pootle or mail the Translate development mailing list.

If you want to apply as a student, you may also want to check out theofficial student application guide from Google.

Improve Terminology Extraction

Difficulty: Medium

Skills: Python

Description: The translate toolkit has a terminology extraction tool poterminology used to generate glossaries from translation files (integrated in Pootle). the algorithm is mostly based on word frequency (but does some fancy stop words magic) and is more suited to software localization (for instance by default it counts the number of occurrences of a phrase in the source code using location comments in PO files).

The project would involve improving poterminology to use NLP techniques. for example parts of speech tagging to refuse phrases that are unlikely to be good terms, better segmentation and stemming before counting word/phrase frequency, etc.

Reference code:

Further reading:

Possible mentors: Alaa, Friedel, Wynand

Terminology Compliance Quality Check

Difficulty: Hard

Skills: Python

Description: The translate toolkit implements a set of translation quality checks. These are used by the pofilter command and are integrated in Pootle.

One of the most difficult tasks when reviewing the work of multiple translators is ensuring that they make use of a unified terminology list to ensure consistency and to avoid confusing the users. This project would involve implementing a new quality check that detects whether the terms in a terminology file have been used consistently.

A simple keyword search would lead to too many false positives, stemming and other NLP preprocessing should be used to improve the test results.

Optional Task: develop a different fuzzy matching algorithm for languages that lack a usable stemmer.

Reference code:

Further reading:

Possible mentors: Alaa, Friedel, Wynand

Improve Translation Memory Fuzzy Matching

Difficulty: Hard

Skills: Python

Description: The Translate Toolkit includes tmserver, a lightweight translation memory web service. Virtaal uses it to display suggestions for translations based on previous translations. tmserver relies on a sqlite3 database and makes use of full text indexing further narrow down the relevant strings. It then picks the most appropriate matches by using Levenshtein distances.

This approach works well for very close matches where the distance threshold is quite high. If the acceptable distance threshold is lowered the quality of matches decreases dramatically.

This project would involve exploring multiple ideas for improving matches, for example:

  • a different distance algorithm that gives lesser weight to case or punctuation changes.
  • less penalty for word reordering.
  • using toolkit placeables support to lessen impact of variables, xml tags, numbers, dates, urls etc.
  • using stemmers to develop a more linguistic orientated distance function
  • integrating a more advanced full text indexer

Reference code:

Further reading:

Possible mentors: Alaa, Friedel, Wynand

Advanced spell checking in Virtaal

Difficulty: Medium

Skills: Python, GTK+

Description: Virtaal already provides spell checking by means of Gtkspell. Gtkspell is a bit limited in terms of support for Windows and OSX, and is not really extendible. We would like to provide richer functionality to users to only spell check translatable text, and to ignore accelerators, for example.

Your task would be to implement a GUI for spell checking similar to Gtkspell with the same level of functionality as a start (using enchant, supporting the personal word list, providing suggestions on right click). Then we need to add support for ignoring accelerators, and to define regions to be spell checked.

A successful candidate will probably look into the API in the Translate Toolkit for dealing with placeables to ensure that only translatable text is passed to the spell checker.

A magnificent success would be integration with the MS Office spell checker over COM (we have python code to do that) and/or integration with the platform spell checker on OSX (Enchant has some initial support for this without build scripts).

Reference code:

Further reading:

Possible mentors: Walter

Improve Pootle's Version Control Support

Difficulty: Medium

Skills: Some Python, some Django. Used more than one SCM/VCS tool before.

Description: Pootle can update translation files and commit new translations using various VCS systems (SVN, CVS, Git, etc.), allowing translation coordinators to submit the work without having to deal with the complexity of version control and making it easier to delegate commit rights to translators without worrying about them touching code or other files. However, version control support has to be setup from the command line, but it is a difficult and obscure process.

The project will involve implementing a way to setup version control directly from the web interface using VCS checkout URIs. This should include both anonymous read only checkouts and authenticated ones where commits are allowed.

Optional Tasks:

  • Visually identify files that changed and need to be committed.
  • Highlight new uncomitted translations
  • Support multiple branches/versions

Reference code:

Further reading:

Possible mentors: Alaa, Friedel

Placeables Support in Pootle

Difficulty: Easy

Skills: Little Python, JavaScript, some jQuery

Description: Placeables are special parts of text that should be copied unchanged when translating. The Translate Toolkit has support for two kinds of placeables: on the one hand, explicit placeables encoded in the translation file as such (XLIFF placeables). On the other hand, discovered placeables, those are things like xml tags, emails, URLs, numbers, filenames, variables, patterns of text parsed via regular expressions and unlikely to change on translation.

Pootle lacks any support for placeables, complex xml/html tags, URLs and variables have to be typed manually which is error prone and might require a keyboard layout switch (slowing down translators), and offers no way of handling XLIFF placeables (we depend on XLIFF placeables for translating ODF documents).

The project will involve using Toolkit's placeables support to highlight placeables in the source text (original text), and with JavaScript insert the placeables text when clicked.

XLIFF placeables cannot be inserted as-is in the text area (they need to be interpreted as XML, not as inline text), and they tend to be ugly so no need to display them in full, they should instead be visually displayed using some graphics and a textual token (like 1) inserted in their place and replaced with the proper tags on insert.

Optional Task: select and insert placeables using keyboard shortcuts.

Virtaal has very rich placeables support. Play with it to get a sense of what Pootle needs.

Reference code:

Further reading:

Possible mentors: Alaa, Friedel, Julen

Improve Presentation of Quality Checks Errors

Difficulty: Easy

Skills: Python, some Django

Description: Pootle's automated translation quality review is one of its most powerful features. Built on Translate Toolkit's filters it allows translators to step through strings that fail a number of quality checks.

The project will involve redesigning the UI and workflow of quality checks review to introduce a number of improvements:

  • Display quality checks when viewing or editing translations. not just

in review mode.

  • Some quality check failures are specific to certain parts of the text (missing xml/html tags for instance). These should be highlighted in the source (original) text and also in the target (translated) text when possible.
  • Automated quality checks are most of the time just a guess. A difference in punctuation between the source and target might be a mistake or a deliberate choice of the translator. Only a human reviewer can tell. But Pootle lacks a way of indicating false positives which makes it difficult to estimate progress in translation review.

Reference code:

Further reading:

Possible mentors: Alaa, Friedel

Translation Memory in Pootle

Difficulty: Medium

Skills: Some Python, little Django, JavaScript, jQuery

Description: Translation Memory is one of the most popular features of CAT tools, at the moment Pootle's support for TM is quite primitive. compare with Virtaal which can get suggestions for translations from a variety of local and remote sources and presents it in an intuitive interactive widget.

The project would involve implementing at least one of these TM sources:

  • directly from Pootle's database
  • Translate Toolkit's TMServer
  • OpenTran.eu

Implement an interactive jQuery based widget for displaying TM suggestions and inserting the selected suggestion.

The TM widget should order suggestions based on how similar the original text is to the text currently being translated. for Virtaal we use Levenshtein distance to measure match quality. some quality measure will have to be implemented in JavaScript.

If all three sources are implemented and there is time left you can implement Machine Translation support (Apertium, Google, etc)

Reference code:

Further reading:

Possible mentors: Alaa, Friedel, Julen

Remote Terminology for Pootle

Difficulty: Easy

Skills: JavaScript, jQuery, little Python, little Django

Description: Pootle has a popular Terminology feature where translations for specific keywords are suggested based on either a site wide terminology glossary or a Project specific one. but lacks support for remote terminology.

The Project would involve writing JavaScript to query one or more remote terminology sources (OpenTran.eu is our favorite) and interactively display results.

A redesign of the current terminology suggestions UI since it cannot fit the large number of suggestions remote glossaries tend to return.

Note: This is a small task and is better paired with another one.

Reference code:

Further reading:

Possible mentors: Alaa, Friedel, Julen

Rich Editing Widget

Difficulty: Hard

Skills: Advanced JavaScript, advanced jQuery, some Python

Description: Pootle's translation form is large and somewhat complex due to the many features it supports. As more features are added (review some of the project ideas above) it might get too cluttered.

One way to avoid the complexity is by implementing a richer text editing widget in JavaScript (think tinymce), that is able to incorporate many of the features through context menus, toolbars and keyboard shortcuts.

Play with Virtaal's editing interface for inspiration on what a simple but powerful translation widget looks like.

Further reading:

Possible mentors: Alaa, Friedel, Julen

Translation Goals and Assignments for Pootle

Difficulty: Medium

Skills: some Python, some Django

Description: Pootle versions prior to 2.0 supported user defined translation goals, in which certain files could be grouped as a single goal to break down translation work. and files could be assigned to specific users to translate them. This feature was lost when Pootle was ported to Django and needs to be reimplemented.

The Project would involve:

  • designing the Database models/relations for specifying file level goals and assignments and implementing
  • explore the possibility of Unit level goals and assignments
  • implementing the UI/views for specifying goals, assigning work to users, tracking progress on goals and user's assigned work. collecting statistics about users work.

Possible mentors: Alaa, Friedel, Dwayne

Wiki Style Revisions

Difficulty: Medium

Skills: Python, Django

Description: The classical software localization workflow assumes that only one or two translator will work on a file or set of strings and then maybe one or two more will review their work. But as Pootle is increasingly being used as a social translation tool a single file or even a single unit might be translated and retranslated by many. It is difficult for reviewers to keep track of activity with such large teams.

Pootle collects useful statistics about user contributions by these statistics only measure quantity of work, not quality. It is difficult with large groups to build a reputation as a skilled or dedicated translator based on these stats only.

We can take inspirations from Wikis here and offer full revision history and more granular tracking of user contributions.

The Project would involve:

  • Designing models for translation revisions
  • UI for displaying translation history with diffs. While we take inspiration from wikis we hope the interface can be much simpler than what a typical wiki offers.
  • Revert action?
  • Views for browsing user activity
  • Calculate a score (karma) for users based on quality of contributions (could be measured through quality checks, how many contributions end up being final revisions, how large the average diff between their contribution and final revision)

Reference code:

Possible mentors: Alaa, Friedel, Julen

External APIs for Pootle

Difficulty: Easy

Skills: Some JavaScript, little Django, little Python

Description: No web app worth its bit size can live without APIs these days. But apart from buzzword compliance we see two areas where Pootle can benefit from external APIs:

Social Translation

This would involve implementing pure JSON views of some basic stats. Then building JavaScript widgets/badges based on this data to embed in blogs and project websites.

  • Badges (to be inserted in blogs) for displaying User stats/ranking/karma can boost interest and fun factor for translators when they
  • Badges for displaying Project translation progress/statistics might inspire more users to participate in translation, per language widgets might boost competition between language groups.
  • Other badges and widgets?

Integration with desktop translation tools like Virtaal:

This would involve designing a RESTful API that consumes and produces JSON for file level tasks. While full Virtaal integration would be most welcome ;-) it is not required for this task to be completed, simple example CLI tools are sufficient.

  • Browse available files on a remote Pootle server, displaying last updated and translation progress.
  • Download file from a remote Pootle server, upload back to server on save.
  • Import whole project?
  • Other offline tasks?

Possible mentors: Alaa, Friedel, Julen

Improved interactivity for Pootle

Difficulty: Medium

Skills: Some Python, some Django, some JavaScript, jQuery

Description: Pootle has proved to be a simple system popular with many small language teams and for several projects. Online translation has unfortunately always been a bit slow due to network latencies, especially from countries with lesser Internet connectivity. The addition of some clever AJAX to some pages will help make Pootle feel much more interactive, and might even lessen the load on the server a bit.

A start to this project could be to provide Pootle statistics in JSON form for AJAX code (and other clients) to be able to obtain it from Pootle easily. Pages showing statistics could then test if stats are available when the page is built. If not, we rather put in an AJAX call to do it later while ensuring that the page is still sent to the user quite quickly.

The main part of the project will be to allow continuous editing on the translate page of Pootle with AJAX queries helping to keep the data available by sending submissions asynchronously and prefetching data necessary for translating the next units. A proper implementation will have to support all features of the translate page, including terminology and translation memory, translator and developer notes, suggestions, etc.

Reference code:

Further reading:

Possible mentors: Alaa, Friedel, Julen

Segmentation for Virtaal

Difficulty: Hard

Skills: Python, GTK+

Description: Segmentation is the process of taking a block of text and breaking it into segments, such as sentences. While initially this looks simple, you might find problems as soon as you start using non-trivial text. Abbreviations in English could confuse a simple method, for example.

The main advantage of segmenting is that it allows us to use translation memory at a sentence level. Thus in a block of text you might have 3 sentences and 1 of which will match 100% while the others might match less and need to be reviewed. If you had not segmented you would probably not have matched anything.

The Translate Toolkit already has a simple tool for sentence segmentation, called posegment. This will give you some idea of where to start to do the segmentation in different languages. For Virtaal, you would have to use this information to indicate the current segment in the current string and allow a user to interact with it (for example with Ctrl+down and TM lookup).

Your main tasks in this project will be to:

  • Provide a GUI to display the currently active segment
  • Enable some current string level actions to work on a segment level instead and define the user interaction for these cases (like copying source to target, TM lookup and reuse).
  • Allow the user to correct the segmentation where the automatic method went wrong by altering the bounds of the segment as detected by the automatic method.

Reference code:

Further reading:

Possible mentors: Walter

Segmentation Support for the Translate Toolkit

Difficulty: Easy

Skills: Python

Description: Implementation of segmentation support using different standards in the Translate Toolkit for use in the other tools. This project includes:

  • Add proper support for the <seg> tag in TMX and the <sub> tag in XLIFF.
  • Implement the SRX standard that allows segmentation rules to be specified in XML.
  • Use PyICU in the toolkit to allow us to use their segmentation rules (or find some similar established segmentation software, or expand the existing segmentation software in the toolkit)
  • Extend the current TM server and/or API to be more aware of segment issues, and probably to store strings segmented and unsegmented.

Reference code:

Further reading:

Possible mentors: Dwayne

Workflow

Difficulty: Medium

Skills: Python

Description: The XLIFF standard is an XML based standard for localisation. It can store various state information and can be adapted to manage a translation workflow. Furthermore XLIFF can contain suggestions in <alt-trans> tags that could be reviewed in an editor and removed as the unit is updated.

By workflow we mean the simple process that moves from untranslated → translated → reviewed → approved. There are also processes for updating existing translations. These can be more complex where the review is 'authoritative' (the reviewer can make changes) vs. 'non-authoritative' (these are simply suggestions to the translator who then decides if she wishes to fix them).

This work would involve defining the possible states for XLIFF and other storage formats and defining an API that will make it easier for our tools to access and manipulate these in useful workflows for translators. Whereas currently most of our tools enforce a fuzzy/not fuzzy way of thinking about units, we should now have a list of states that are applicable to the format being used.

This is not a workflow engine. Our goal is not to make a workflow editor, but to create a set of standard workflows that meet the needs of current translations and exposes the inherent workflow of the file format.

Your main aim is to stay focused on the basics of unit states for the major formats (PO, XLIFF, TS) and deliver a solution that allows basic interaction with it in Virtaal and/or Pootle.

Furthermore you can look at helping users of Virtaal and/or Pootle to deal with suggestions in <alt-trans> tags by cleaning them, or removing them as they are used.

Reference code:

  • phase - is useful to understand some tools used to manage process

Further reading:

Possible mentors: Dwayne

General Improvements (Feature additions) to Pootle

Difficulty: Medium

Skills: Python, Django

Description: While working with Pootle at OLPC, we have come across a number of feature requests, most (if not all) can be implemented within the GSoC timeframe. Some of the most high priority ones among them are

  • Support for validation of translated strings on submission (equivalent of msgfmt --check, but for individual strings),
  • Ability for language administrator to get in touch with members of the translation team
  • what else?

Reference code:

Further reading:

  • Other reading on the topic that would help

Possible mentors: Sayamindu Dasgupta

Localization of Pootle & Virtaal for Haitian Creole

Difficulty: Medium

Description:

This isn't exactly a “code” project as such, but I think this would be important to do this year. It was very nice to see the machine translation support added for Kreyòl in Virtaal, but the language is not even added to pootle.locamotion.org, as it hasn't ever been considered as a localization target. Machine translation has many limitations - and especially if translations are being done by non-native speakers, it is important to bring in native speakers (who may not have much command of French, let alone English) to improve and correct the inevitable mistakes. Adding the language to the Pootle instance is easily done - more ambitious is the necessary outreach to find native speakers who can perform the localization. That has always been the hard part, but if there is some money available from Google for a student to do this work, that might make the difference in jump-starting this project.

Reference code:

(the status of the latter project is a bit unclear - see related google groups below - although they are also interested in GSOC)

Further reading:

Possible mentors:

Alex Dupuy

Video/Audio Transcription/Proofreading support for Pootle

Difficulty: Hard

Description: Pootle still doesn't have support to transcribe online videos/audios that we can find at Youtube, Vimeo, etc. Currently there are some propietary online apps that can do that, (dotsub.com, subtitle-horse.org , etc) , but nothing open source, yet.

A start to this project would be to provide the core support in Pootle to create a transcription from a video streaming url. The transcription page view should have a flash player embedded(recommended the JW FLV Player with Copyleft license) so the user can adjust the timing of the subtitles.

In addition Pootle should allow to distinguish transcriptors, proofreaders and translators users. And have different permission rights for them. The interface should support the ability to play the audio/video at the time position where the phrase is.

Extra features:

  • Support to configure a Speech Recognition tool or API to automatize a initial transcription done by machine.
  • Support to configure a multi-lingual speech synthesis system tool (e.g. Festival) to provide audios of the translated subtitles.
  • Modifying the source code of the JW player to provide a fast access to the subtitles list for each video(e.g. SubPly player)

Reference code:

Further reading:

Possible mentors:

Project Ideas Template

Difficulty: Easy, Medium, Hard

Description:

Reference code:

Further reading:

  • Other reading on the topic that would help

Possible mentors: