Some ideas that might be worth pursuing for terminology extraction. Good terminology mean higher quality and more consistent translations.
Current techniques in poterminology rely on word frequency and stop lists to select words that should appear in the terminology list for an application. The problem with this is that high frequency words are not always the ones you want. What you do want is domain specific words, closely aligned words (easily confused), etc.
This is a collection of ideas and techniques to apply to try to get better hits on good terminology words. It also collects references to mathematical techniques and Python implementations.
Using a thesaurus to identify words that could cause translation confusion.
The idea is to use a thesaurus to find closely related words in the source text. If these words are related in English but used for different aspects within the application then there is the risk that a translator might use the same target term. This difference should be analysed and understood and words chosen to ensure that, if needed, the target language preserves the seperate meanings by using different target terms.
Simple ideas is to take each word and check it against WordNet
Words that are not in the English spell checker are potential candidates for terminology development.
A simple list of all words not recognised by the spell checker would be potential words that need to be checked. We would need to have some way of eliminating common spelling errors otherwise you could get overloaded.
Words that have a different frequency profile to those in the broader domain or in the language are potential terminology words.
If we can profile frequency of words in an application or in an application domain (e.g. word processors) and compare that to a wider domain (e.g. computer software, written text), then we should be able to find words that do not follow the required frequency pattern. These are then potential words for terminology analysis.