Translate Toolkit & Pootle

Tools to help you make your software local

User Tools


Process information in XLIFF files

XLIFF is a standard file format specifically developed by the OASIS consortium to store translation and localisation information. Among many other bits of information, the XLIFF format permits storing information about the phases that the file has gone through. These phases are stored at the beginning of the file (in the <header>), and could be maintained throughout the whole life of the file.

Information stored in an XLIFF file might refer to the whole file (and in this case it will be placed in the <header> of the file) or to one of the translations, in which case it will be included inside the corresponding <trans-unit>.

Structure inside the XLIFF file (what the standard says)

The XLIFF specifications permit the definition of a <phase-group> element inside the <header>. This element might contain as many <phases> as needed. In particular, the specifications say the following:

The information must be placed in the a <phase-group> element inside the <header>. Inside the <phase-group> each bit of information must be include in a <phase> element.

The <phase> contains metadata about the tasks performed in a particular process. The optional phase-name attribute uniquely identifies the phase for reference within the file. The required process-name attribute identifies the kind of process the phase corresponds to; e.g. “proofreading”. The optional company-name attribute identifies the company performing the task. The optional tool-id attribute references the <tool> used in performing the task. The optional date attribute provides a timestamp indicating when the task was performed. The optional job-id attribute allows an ID to be assigned to the job. The optional contact-name, contact-email, and contact-phone attributes all refer to the person performing the task.

Required attributes:

phase-name, process-name. [ERRATA: phase-name is considered optional in the text above, but mandatory here]

Optional attributes:

company-name, tool, tool-id, date, job-id, contact-name, contact-email, contact-phone.

Contents:

Zero, one or more

elements.

Structure:

<xliff>1
|
+- <file>+
   |
   +--- <header>?
   |    |
   |    +--- <skl>?
   |    |    |
   |    |    +--- (<internal-file> | <external-file>)1
   |    |
   |    +--- <phase-group>?
   |    |    |
   |    |    +--- <phase>+
   |    |         |
   |    |         +--- <note>*

(legend: 1 = one
       + = one or more
       ? = zero or one
       * = zero, one or more)

The standard does not specify in which order the phases must be stored in the phase group. For performance purposes, it would be most interesting to ensure that the phases are in chronological order. Inverse chronological order (current phase is first in the list) would improve performance even further.

===== Information and phases that we have and might consider including =====

==== Template XLIFF file creation ====

This action is not a proper phase, and it is not necessary to create a <phase> element for this action, as most of the necessary information can be coded in the <file> element.

The <file> element includes a date attribute that is defined as “the creation date of this XLIFF file”. It is unclear if this date refers to the date in which the template was created or the date in which this specific instance of the file was created. XLIFF files are created as empty templates, which are later translated or filled with prior translations, or the result of merging two files. In either case, the important date is the date in which the data was extracted from the original data in the source language. Intermediary processes should attempt to keep this data, so that it can be fed into the XLIFF file. A possible extension of the XLIFF standard would be to have a separate date in which the present instance of the file (specifically for one language) was created.

The file creation date (and maybe tools) is useful to ensure that upgrades are always done from older files (older versions of the original software) to newer ones.

The <file> element also has other attributes that permit defining the piece of software and the exact version that the XLIFF file belongs to, ensuring that different versions of the project are not confused by the translator.

It would be very interesting to have a new attribute that indicates the path of this specific file inside the Project tree (in case the project has several files stored in several directories. This information helps translation management tools know exactly where the file belongs.

The <file> element does have one attribute to indicate which <tool> was used to create the XLIFF file.

When the file is created, it is also interesting to count messages and include the count information in the file (in the <count-group>).

If the file is created from a PO file, the standard mandates that the header of the PO file be stored as the first translation unit, with an empty <source> string and the contents of the complete PO header as the <target> string.

==== Instantiation and initialisation (upgrade from an old version) ====

An XLIFF file that is ready for localisation is either:

  • A new XLIFF file for which the target language is defined, or:
  • An XLIFF file in which the already-translated contents of older versions of the same file (or of translation memory) for a specific language has been integrated.

In the second case, if the file is the update of an older version of the same file, there is information that must be kept. In principle, all the phases of the old file should be integrated in the updated version, so that information is maintained.

The one piece of information that is new is the version of the file from which the content was taken (prior to the one of the present file). This information might be useful for knowing what the history of the information in a file is, and specifically from where the content was brought in. Nevertheless, if a phase is kept for each upgrade (which for some projects might take place two or three times a week), then the number of phases generated might be too large. The alternatives to including this repetitive phase are either adding this information somewhere else in the <header> of the file or storing information only when the last phase of the file is not an upgrade phase.

If such phase is used, it should include:

  • Date of upgrade.
  • Version of the file (project name) from which info was taken to initialise
  • Name and version of the tool that was used to do the upgrade

Again, this whole set of information does not influence the process, but it is interesting to know when the origin of the information needs to be tracked. The existence of the phase itself shows that a step was taken, and that the old project was probably eliminated form the system.

All the upgrade, translation, review and approval phases of the old project are copied into this file, then this phase is added. Glossary and TM inclusion phases are not kept, as this information will not be copied from the file when upgrading.

The upgrade phase will look like this:

 <phase 
  phase-name="xxx999" 
  process-name="upgrade" 
  date = "2006-01-25T21:06:00Z" 
  x-prior-project="OpenOffice 2.0.2"
  tool="Translate Toolkit pot2po 0.9"
 </phase>

==== Inclusion of Glossary info in XLIFF phase ====

The XLIFF file format allows the inclusion of a glossary inside its structures. The glossary is included in the <header> of the file, and can be either an embedded glossary inside the file or a reference to and external glossary.

Including a glossary phase has one major interest: keeping track of the date in which the glossary was embedded. When the process of embedding is an asynchronous process that might be repeated if there is an update of the glossary, it is important to keep track of either the version of the glossary or its last date of modification. This will indicate if the process must be repeated or not.

If the last phase available is the inclusion or update of a glossary, and a new revision is necessary, then the prior phase will be erased, and a new one annotated after the revision of the glossary information in the file. It does not make sense to maintain information about how many times the glossary was updated between other processes, as only the last upgrade is important (and not to other processes).

The phase will look like

<phase 
  phase-name="xxx213" 
  process-name="glossary inclusion" 
  date = "2006-01-25T21:06:00Z" 
 </phase>

==== Inclusion of Translation Memory (TM) information in an XLIFF phase ====

Unlike in the case of glossaries, TM is not included in the <header> of the file, but inside each specific translation unit (<trans-unit>). Translation memory information can be placed inside the <target> element of the translation unit, if it is a perfect match, or as an alternative translation unit <alt-trans>.

As the <header> is concerned, we are in a similar case as with glossaries. We indicate in a phase the last time in which TM was analysed to include information in the file. We only upgrade TM content in the file is the date of the TM is posterior to the time stamp in a prior TM phase.

The phase will look like:

<phase 
  phase-name="xxx213" 
  process-name="TM inclusion" 
  date = "2006-01-25T21:06:00Z" 
 </phase>

For each match that is considered of interest for the translator, but not an exact match the TM inclusion process should create an <alt-trans> unit with as much info as is available in TM, plus the information generated by the TM analysis engine (for example quality of match).

<alt-trans xml:lang="fr" match-quality="92">
  <source xml:lang="en-US">Knights of the round table, taste how good the wine is</source>
  <target xml:lang="fr" state="needs-review-l10n"  phase-name="xxx321">
    Chevaliers de la table ronde, goutez mois si le vin est bon. </target>   
  <context-group>
    <context context-type="x-openoffice">
      avmedia/source/framework.po-mediacontrol.src#AVMEDIA_STR_ENDLESS.string.text
    </context> 
  </context-group>
</alt-trans>

Note that the <target> of the <alt-trans> element makes reference to the phase in which it was included in the file.

What an exact match is would have to be defined, and there could even be grades of exact matches. An exact ID match, in which it is clear that the string belongs to the same string of the same file of the same program, could go directly into the <target>, not marking the string as fuzzy, nor for review. An exact match or unknown origin or from another application would be marked as a fuzzy match, by setting the state attribute of the <target> element to: “needs-review-translation”.

In the case of an exact match, the phase-name attribute of the <target> element of the <trans-unit> will make reference to the TM <phase> in which the content was filled.

In this phase we may also include the translation of the source messages to a third language that is not that of the source nor of the target of this file. This information is given to help the translator, who might be more proficient in an intermediate language (for example Spanish when translating to South or Central American indigenous languages) than in English. These translations would also be included in <alt-trans> units that would make reference to the phase-name,

==== Translation phase ====

Normal translation phase. The translator receives a file in which some or all the messages need translation, and performs the translation. The file might have glossary and Translation Memory data. A translation phase last all the period in which a translator is working on a file, the phase will only terminate when somebody else (such as a reviewer) starts working on the file. The start date of the phase is included in the phase, the termination date is the start date of the next phase.

The phase will look like:

<phase 
  phase-name="xxx234" 
  process-name="translation" 
  date = "2006-01-25T21:06:00Z" 
  contact-name="Alberto Martinez"
  contact-email="alberto@martinez.com">
 </phase>

Each time the translator creates or modifies a <target> element, the <target> must be associated to the current phase through the phase-name attribute and the status must be changed depending on the rights of the translator (and the workflow that he is following). If the workflow requires a review phase after translation, the status will need to be changed to “needs-review-translation” (because it has been translated, but not yet reviewed).

==== Automatic review phase ====

In this phase the file goes through a number of automatic tests that may check conditions as different as alignment to glossary, capitalisation, spaces extra or missing at the beginning or end of the translation or correct tags inside the <target>.

The results of tests that are positive are stored in the state-qualifier attribute of the <target>. As much further detail in reasons for rejection is possible to detect by tests than the pre-defined possibilities offered by the standard as values for the state-qualifier attribute, it would be interesting to have a non-standard attribute that may be present when the value of the state-qualifier is “rejected-inaccurate”. This way, editors that are not prepared for the extension may still see the general definition of rejection because of lack of accuracy. Nevertheless, an extension of the standard should be considered to add more standard values to this attribute.

In many cases it will still be important to give specific information why the string is rejected (such as specifying where in the string is the problem, which variable is missing from the target, what error in XML, etc. This information can be included in a note of the <trans-unit>

<phase 
  phase-name="xxx" 
  process-name="automatic review" 
  date = "2006-01-25T21:06:00Z" 
  tool ="Translate Toolkit pocheck 1.0">
</phase>

The problem with associating translation to this phase is that we automatically loose the reference to the corresponding translation phase, as only one phase can be indicated for a translation unit. We will know when the information was checked, but no longer who translated it. We can assume that it is the translator in the prior translation phase, but this might not always be correct.

==== Manual authoritative review phase ====

As in the case of translation, a review phase will last all the period that a given translator spends working on a file. He will need to review message from all prior translation phases since the last review was performed.

The <phase> element should contain:

<phase 
  phase-name="xxx" 
  process-name="authoritative-proofreading" 
  date = "2006-01-25T21:06:00Z" 
  contact-name="Alberto Martinez's reviewer"
  contact-email="boss@martinez.com">
 </phase>

With each <target>, the reviewer might accept the string, correct it or reject it.

Each time the reviewer reviews or modifies a <target> and considers it correct, the <target> must be associated to the current phase through the phase-name attribute and the status must be changed to “translated”. If the state-qualifier is fuzzy-match, the reviewer can fix it or leave it as it is, as if it was a non-translated message.

Again, the problem here is that we loose information. If there are two translations phases, followed by a review phase, we will know know in which of the two translation phases the translation took place.

<target 
   phase-name="xxx321" 
   status="translated"....

The reviewer might reject a translation, without fixing it. In this case, the state state-qualifier will be set to: rejected-* and the state to needs-translation. If needed a qualifying note can also be added.

 <note
   xml:lang="af"
   from="reviewer"
   annotates="target">
   "File" is a noun in this context not a verb
 </note>

The reviewer might also accept non-aligned glossary, in that case it is necessary to change the state-qualifier, but it might be interesting to maintain the additional attribute that indicates lack of alignment to the present glossary, so that the glossary manager might add the new term to the glossary (or reject it). He might also accept other tests that have failed, if (s)he considers that the translation is correct.

The result of an authoritative review phase can either be a file that is considered correct by the reviewer or too incorrect to be corrected by him. In the second case, the file would be send back to the translator for a “review editing phase”.

In this case, the standard does not seem to have a global attribute to specify that a file has not been approved by the reviewer and should be sent back to the translator. Only looking at non-approved items will the system know that the next step is either continuing with the workflow or returning the file to the translator.

==== Manual non-authoritative review phase ====

This phase is similar to the prior one, except that the reviewer is not allowed to change the <target> translated by the translator, only to qualify it and to accept or not the automatic tests results. The file in most cases will be sent back to the translator, as the reviewer does not have the power to move it forward, (s)he is only advising the translator.

==== Review Editing phase ====

In the workflows in which the reviewer(s) is non-authoritative, there should also be an editing phase, normally performed by the original translator. The translator receives back a file that has been annotated by the automatic review process and/or a non-authoritative reviewer. The translator in this phase pays attention to the strings that have comments, and produces a file for the approver.

<phase 
  phase-name="xxx238" 
  process-name="review-edit" 
  date = "2006-01-25T21:06:00Z" 
  contact-name="Alberto Martinez"
  contact-email="alberto@martinez.com">
 </phase>

==== Approval phase ====

In theory, the file that arrives for approval does not have any outstanding problems, and automatic test failures have been overridden by either translator or reviewer (whomever had the power to do it).

<phase 
  phase-name="xxx999" 
  process-name="approval" 
  date = "2006-01-25T21:06:00Z" 
  contact-name="Big Boss"
  contact-email="bigboss@martinez.com">
 </phase>

The approver can accept or reject translations. When accepted the state is changed to final. The approval attribute of the <trans-unit> is changed to “yes”. The <phase-name> of the translations made during this cycle (which had not been approved in a prior approval phase) will be changed to point to this approval phase.

If rejected they can use the same reasons and process used by the reviewer.

===== Uniqueness of phase names =====

All <phase-name>s in a file must be unique. It is important to ensure this, considering the case in which an upgrade brings in phases from another file. It is important to ensure uniqueness of the phase name of the Creation phase, so that it will not agree with other phase names. Many be phase name should be the name of the phase followed by a number. This way the phase names will always be manageable.