Glossaries are one of the best helps a free software translator has, and maybe it is one of the less used. Glossaries are good for improving translation consistency, reducing errors when translating and so on. But they have a lot of problems:
but all the problems above can be solved by solving these two key problems:
in a few words, if the translators can use CAT tools that completely support TBX format and can use a terminology discussion system that produces TBX suitable for using in CAT tools, the problem is solved because if we have CAT tools with good support for TBX we can start using TBX format, if we can use TBX we can make a tool for discussing terminology (TMS) that outputs TBX suitable for CAT tools we can stop using mailing lists or wikis for discussing terminology, having both CAT tools with good support for TBX and a TMS that outputs TBX boosts TBX usage…
The CAT tools TBX support improvement goes beyond the scope of this particular paper, so we won't talk a lot about it. Some tools like Lokalize, OmegaT, Pootle or Virtaal already have a basic TBX support, but others like Gtranslator or Poedit don't. Some tools have interesting features like terminology extraction from a translations compendium (in Pootle), the guessing of the TBX glossary location and downloading from the Internet for the current language (Virtaal), TBX file editing (better implemented in Lokalize than in Virtaal since it opens a completely different window which doesn't lead the translators to treat the terminology like regular translations).
Talking about the translate toolkit ecosystem first we have to improve the TBX representation in the storage lib of translate toolkit. After that we can handle the two problems above. For seeing a complete description of the TBX requirements go to http://translate.sourceforge.net/wiki/toolkit/tbx#tbx_requirements_by_galician_translation_team_proxecto_trasno (example code included).
The using of terminology discussion software (TMS) can be useful for:
Now we can proceed with the TMS (terminology management system) description. Since this system is meant to replace the current terminology discussion systems it should include all the features included in that systems. The data model explained in http://translate.sourceforge.net/wiki/toolkit/tbx#tbx_requirements_by_galician_translation_team_proxecto_trasno already covers most of the data that the TMS have to handle. We only have to make some additions to it. For example it should have users, discussions per language and per concept and so on. Let's go with the complete list of data that should be able to handle.
The terminology management system (TMS) needs to handle several glossaries, each glossary can have several concepts, and each concept can have several definitions (only one definition per language in a given concept), and also can have several translations for each concept (several translations per language in a given concept). The concepts will also have associated some links to get more information (several links per language in a given concept). Also is needed to have defined several languages. Now we have a list of all the needed entities lets go with the list of attributes for each of that entities:
Each glossary has a name and a description.
Each concept has an unique id, a subject field (which is another concept in the same glossary), it can have several concepts that people may wish to see (lets call it related concepts), and it can also have a parent concept (broader concept).
Each link has a type (image, Wikipedia page,…), the address of the link, and a tiny description.
Each definition has a definition text.
We want to save the ISO 639 code of each language.
Each translation can have a translation text, it has an unique id, the part of speech, the grammatical gender (if applicable), the grammatical number (if applicable), a field that indicates if the translation is an abbreviation or an acronym, an explaining note, examples of use (created by the people that make the terminology), links to examples of real use (a corpus or translation database), a field that indicates if the translation is completed or if it is still incomplete (completion status), and we also need to save the translation administrative status (if it is a recommedend translation, a not recommended one (deprecated), or if it is a forbidden translation) and the reason why the translation has the actual administrative status (a simple text string) that only applies when the administrative status is other than “recommended”.
Now we have to add some new things.
For each language we want to save its language name in the language itself, and a language description.
We want to handle users. For each user we want a login name, a password, we want to save its language for the interface (we aren't going to save a list of languages where the user can participate).
We want to save the role of each user in each glossary. The list of roles can be administrator (can do everything. this particular role is not related to a particular glossary), anonymous (like nobody in Pootle), user without rights in a particular glossary (can view but not touching nothing). We also need at least one role for people that own a glossary and therefore can edit or delete data in the glossary (concepts, translations, definitions…). In the galician team we discussed this particular issue and we found that should be two different roles: one for the people that can add data to the glossary, and another role that allow to set relations between the concepts (broader concept, related concepts); the first role is the “terminologist” and the second one is the “lexicographer”. It is important to note that the “lexicographer” can do all the things the “terminologist” can do but it also can set the relations, own glossaries, delete all kind of info within a glossary owned by him, etc. For the free software terminology maybe it is not important to have the “terminologist” role, but it can be important for more serious environments. I will put a permissions table below. Since it is difficult to differentiate between the roles we want to save its names and its descriptions.
We want to save the discussions for a given concept (only one discussion per language). A discussion is a list of messages like in a forum (each discussion is a thread). Each message have a concept, a language, a user that wrote it, a date, a message text and the id of the parent message in the discussion.
We also want to have a history for each of the entities handled in the TMS: concepts, glossaries (for example creation, modification, users added), definitions, translations, languages, external links (e.g. to wikipedia). This is meant for having a change list like in the wikis, or maybe some similar. For each change entry for all the entities listed we want to save its date, the user responsible of the change, and a description. In the galician team we talked about saving the previous state for recovering it due to vandalism or mistakes making changes.
Since some teams discuss terminology using mailing lists they may want this new system to send them the messages of the discussion to them by mail, maybe to archive them like they did until now, so sending all the new discussion messages for a given language to the old mailing list that a particular team used for the discussion before can be a good approach. The other approach is sending the messages to all the people registered in the TMS for a given language, but maybe not all of them are registered in the TMS. A little problem with that approach is that the users may want to reply to a message sent to them via the mailing list, so the reply won't reach the TMS and therefore it won't put along with the other messages in the discussion, but it is a minor problem. To save the mailing list address we can add it like other language attribute.
Now lets go with some ideas for the interface.
One good idea is to have a breadcrumbs line at the top of each page for make easy to the user knowing where they are. In the main page we can put a section for “last changes”, a search form, and a list of glossaries with a link besides each one for exporting the glossary using TBX. Of course we need links for a “custom export” page, “advanced search”, “admin panel” and so on. These last links maybe can appear on all public pages.
If you click on a glossary you go to a page that lists all the contributors, its owner, its name, its description, the number of concepts and translations and maybe other information. It is very important to place here a list of all the glossary concepts for visiting them. Maybe if the “broader concept” is set the list can be a tree of concepts to represent that relationship.
If you click on a concept the user should view the concept using its interface language. This language make show the definition for that language (a list of all languages with definition is always available for the user can change the language in which language is viewing the concept definition). The interface language is also important for showing other info like the “broader concept” or the “related concepts” since it is more easy to understand with which concepts are related to a given one by seeing one of its translations. The translation shown for the broad concept or the related concepts should be the first recommended one for the language in which we are currently viewing the concept (by default the user interface language).
Next in the concept page we can see the list of all the external links, which their link addresses and language of the resource. Next is the list of all the concept translations with their translation text, language, examples of use, real use links, part of speech, grammatical gender, grammatical number, completion status, administrative status, administrative status reason, etc.
Last we have the discussion for the current concept in the language we are currently viewing the concept (by default the user interface language). A list of all languages with discussion should be provided for the user can view another language discussion. It is useful to allow hiding or showing the discussion because sometimes it has a lot of messages and maybe the user only can see the data and not the discussion that lead to the data finally saved. The discussion messages should show it author, its text, its date and should be ordered with the most recent at the bottom.
If the user has rights to add new info or modify the existing info then (s)he should see links to add more content or edit it on the relevant section of the concept page. If the user has rights to delete content or to add/edit/modify the “broader concept” or the “related concepts” info then (s)he should see links for performing that actions on the relevant sections of the concept page.
Another important thing is the change history for each element on the concept page. The changes in “broader concept” or “related concepts” count as changes in the concept. We also have a page for definition changes, another one for external links changes and another one for translation changes. We think in putting the history pages with a wiki format to list the changes and we even considered to recover previous versions. The recovering only can be done by a user with privileges (maybe the lexicographer). Links for all the history pages should be included in the relevant sections, preferably besides the element for which they provide its history.
When you remove any info you should be asked first if you sure to remove all the info. This confirmation page should have some visual indication for showing that is a risky operation (you can lose data) maybe showing the page with red color. This page should show at least some of the info you about to remove, and the remove button should be at the bottom to force the user to see all the info that is going to remove. Remember that only users with admin or lexicographer privileges can remove info within a glossary or glossaries. Info other than glossaries (see “auxiliar management” below) only can be removed by admin users.
When you remove info like a translation, all its information should be showed including all its real use links and examples of use, which can be several in a single translation. Maybe this can be a lot of information so in extreme cases it could be useful to limit the info showed for not overload the database engine. The same for concepts (translations with all their information, definitions, broader concept, related concepts…). With other elements like definitions or info from auxiliar management there is no problem because is little information. In theory there is no planned to be a lot of removals.
For editing we have two options: inline editing in the concept page or a private section for editing (like the one Django provides). I think the inline editing is more fast, but maybe can lead to confusions so I prefer the second option.
Let's talk about the management of languages list, parts of speech list, grammatical genders list, grammatical numbers list, and the administrative status reasons list. We can call this “auxiliar management” since this data is secondary related to the terminology data handled by the TMS. We are not including here the list of roles since it will be hardcoded in the app. The administrative statuses will also be hardcoded. Maybe the glossaries can be put here, but remember that each glossary creator can modify its own glossary. Languages are easy, nothing weird here. All changes in the auxiliar management only can be done by the TMS admin.
The problems start with the parts of speech since they may be different from one language to another, and we don't want to show the user a list of several hundreds of part of speech when (s)he is entering the part of the speech. If we can make a list of part of speech of each language when the user fills the language the part of speech list is reloaded with new values. This is a problem that is not resolved yet for other languages.
Maybe we have the same problem with the grammatical gender. The ones I know are feminine, masculine and neuter, but maybe in some language another option is used…
With grammatical numbers we face a known problem. In gettext sometimes we have to include the line for the plural forms in the file header. We have a list of plural forms of each language in http://translate.sourceforge.net/wiki/l10n/pluralforms that shows the number of plural forms, but not its names which is what is used in TBX. In languages like chinese with only one plural form there is no problem since just not specifying nothing the problem is solved, with languages like spanish or french with the same plural forms like english there is no problem. The problem is with languages like gaelic, with five plural forms. In any case if the grammatical number is not specified we assume that the specified grammatical number is “singular”.
The administrative status reasons may differ from one language to another. In galician we have at least six or seven reasons: lusism, anglicism, galicism… This is a problem that is not resolved yet for other languages.
The system should be able to search a word or phrase in all the translations and provide a list of results, maybe in a Google like form (SEE Glósima FOR A WORKING IMPLEMENTATION). It can be interesting giving the option to search only translations in a given language, or expand the search to other elements like definitions, restrict the search to a subset of glossaries instead of searching on all of them. The search results should show the exact matches and the fuzzy matches, but could be useful to separate them (first the exact ones and then the fuzzy results).
All the system is completely useless for us if we cannot export the results to an useful interchange format for terminology (TBX), or even to another format like HTML or PDF. Remember that the two keys of this new tool are the integrated discussion system and the TBX exporting capabilities.
A basic implementation is allowing the exporting of each glossary to TBX (a link besides its name in the glossary list). After that it will be very useful allowing the exporting of a single TBX file made of the compilation of several glossaries. We may call this “custom export”. Another useful features for custom export may be the selection of the languages exported, whether to export incomplete translations, whether to export deprecated translations or whether to export forbidden translations. Since it is very likely the CAT tools won't support things like the external links, examples of use, links to real use examples, related concepts, or broader concept maybe it could be useful to allow not to export this information to making more compact TBX files that can be loaded by CAT tools in less time and that are more easy to interchange.
Since autoterm needs a reference glossary per language the TMS tool can provide a virtual glossary that is created in real time putting together or all of the glossaries for a given language (http://translate.sourceforge.net/wiki/virtaal/autoterm). If we do that we have to save some kind of configuration for this particular case of custom export, since the consumer is a CAT tool and not a person, so it won't fill a form for customizing the export. That configuration can be per language.
Maybe the terminology extraction feature included in Pootle is better placed in the TMS tool. Since I don't know much about it I can't provide ideas right now.
Due to the Django bug http://code.djangoproject.com/ticket/373 this particular app can't be written using Django without using some hacks. I made a entity-relationship diagram for the database of the TMS and when I passed it to relational the definition entity is rendered into a table with a primary key composed by two foreign keys, one pointing to the concept table and another pointing to the language table. I don't know how to represent this using Django. I hope you do.