Translations of this page?:

Mini Projects

Do you feel like taking a sizeable bite into useful functionality for WordForge?

These mini projects are mostly standalone that is that do not affect other parts of Pootle or the Translate Toolkit but are very useful tools for WordForge. If you would like to hack on one then please contact the team and we will assign the project to you.

Micro-projects

Want something smaller? Try looking at our various TODO lists:

Not all of these are small, some are invasive, but many just need someone with the will and a bit of time. If you need help in choosing something then please ask on the mailing lists or on #pootle.

NOTE: please don't list micro-projects here, rather point to them from here.

Projects

Segmentation

(Note that initial work on this has started. See posegment.)

Both XLIFF and TMX can make use of a standard for segmentation. What is segmentation? This is how you break up paragraphs of text into sentences. It differs for languages and the standard developed by LISA allows you to define segmentation rules for text.

The advantage of segmentation is that it create better reuse. If you think that in one text you might have a paragraph that contains a sentence that in another piece of text is an exact match, but if we are not able to get to the sentence level we will not see the match. So segmentation makes our Translation Memory much more usable. And if it is part of XLIFF it makes matching in the actual translation even more usable.

Your job would be to implement the segmentation standard and integrate that into XLIFF and TMX as needed.

Alignment Tool

You have two pieces of text, the original and the translated text. But you do not have them combined in a bilingual translation file. Perhaps this is old work you inherited from someone else or you've found a source of good translations and you want be able to use them in your Translation Memory. You might only have the latest source document and an old translation so you don't expect them to align completely.

In this case we need an alignment tool. The tool should be able to read the files using our base classes and present the texts side by side, hopefully using the segmentation rules to make good guesses. The role of the user is to validate the alignment and to adjust it if needed.

The end result should be all the text items have been aligned or rejected.

The program then can outputs a new bilingual translation file eg XLIFF or PO or a TMX Translation Memory file.

Glossary Extractor

(note: some initial work has been done on a tool as described here. See poglossary.)

When you translate you should start with a glossary of terms. Most glossary words are frequently occurring in a body of text. But you might also have frequently occurring phrases that you would want to translate differently from the single words. The glossaries are then used by translators and reviewers to check translations and to ensure consistency.

The glossary extractor tool would look at a number of source files and extract candidate words and phrases. The user would be able to set the frequency levels eg how many times must it occur before we extract it, list of stop words, maximum phrase length etc.

The user should be able to eliminate words, check context in the originating text, pull online definitions and link them to the glossary entry or add their own clarification notes (this might be the role of a separate glossary editor)

The output would be a TBX file or other file that can be imported into an application to populate the translations of the terms

Glossary Populator

Or glossary guesser. Use statistical techniques to take an empty glossary file and using your existing translation try to guess what might be a glossary word.

The simple case is the single word entry in translations. The harder case would be where the word occurs in a sentence or paragraph

Converters

The toolkit provides a framework that allows you to define a storage format (e.g.. Gettext PO, .properties, etc.) and allow a converter to migrate translation between those and the base formats (XLIFF, PO). The following are format that would be useful to add to the Translate Toolkit. They are in no particular order, but we have limited them to ones that we regard as most useful

  • OpenDocument Format (ODF) - Document format used by OpenOffice.org and friends
  • PDF
  • Microsoft Word
  • man pages (covered by po4a which can integrate into Pootle but useful to have a native format)
  • DocBook (xml2pot does this but useful to have a native format)
  • Graphics (.png, .gif, etc) investigate localisation of meta data

Content Management System Integration

Take your favourite CMS and integrate localisation of the content (not the UI) by using Pootle.

XLIFF in Gettext

Gettext is the home of PO format. It would be good if the Gettext tools could also do XLIFF. These are areas that need to be modified to allow full use of XLIFF. They are in the order of most important to least important.

  • msgfmt - Extend the PO compiler msgfmt to compile XLIFF files into MO files. As a first step there is a Python compiler in the Translate Toolkit that could be extended
  • xgettext & msgmerge - Allow us to extract and create XLIFF Template files and merge them with existing file
  • intltool - although not pure Gettext you need to adapt these tools and methods to allow GNOME projects to use the Gettext tools for XLIFF or PO
  • align the new msgctxt facility in gettext-0.15 with the context specifiers in XLIFF.

Content Negotiation (Interface)

Most Wikis, CMSs, general websites DO NOT do proper content negotiation. In this mini-project we are not concerned about the actual content but simply about the interface. It would be nice if for instance MediaWiki's interface defaulted to the users preferred language when they view the site. Most of these systems allows people to specify their language when they sign in. But that is not enough.

This project would look at a few things, such as:

  • Create a library or function within PHP, PHP development frameworks that makes it easy to correctly set language based on the users preferred language setting but also allow cookie based dropdown selected languages to work in conjunction with it.
  • Implement this in some key Wikis: MediaWiki, DokuWiki, Tikiwiki, etc
  • Document clearly how to use this
  • Do similar things for languages and frameworks written other languages: Python, etc.
  • Audit free software implementations of content negotiation for correctness: Konqueror, Firefox, Galeon, etc
  • Write a generic algorithm for the docs which shows how you should implement the negotiation to work correctly with Apache and others
developers/mini_projects.txt · Last modified: 2008/03/17 10:02 by dwaynebailey
SourceForge.net Logo
Recent changes RSS feed Creative Commons License Driven by DokuWiki