IATE Bilingual Termbase Extraction – The Problems

Dear Translator,

This website, SanTrans, opens the easy way for you to have available in the CAT-tool of your choice Europe’s IATE termbase, a collection of 8 million terms from the 24 languages of the EU.
I can provide you (at a very low price) files ready for import into the main CAT programs like SDL Trados Studio 2011/2014/2015, DVX2/3, CafeTran, memoQ 2013/2014/2015, WordFast Classic/Pro and others, whithout being plagued by the problems that you would be faced with when trying to use the raw IATE data.


In July 2014, the Termcoord of the Translation Centre for the Bodies of the European Union published a large data base (2.2 GB) on its webpage (Download IATE Termbase) – with the intention to make their information accessible for users for their own purposes. The data base now contains 1.3 million multilingual entries. Of course, most of the (prospective) users are translators, not linguists or terminologists, just because there are many more from the first category. However, when a translator wants to extract data from the data base in order to integrate them in his own CAT-tools, he is met with a large number of problems.

Version 2015 of IATE data base; added possibility of extracting subsets

In Jan 2015, the Termcoord has published an updated version of its data base; apart from adding new terms (and delete some) they also removed some of the most obvious flaws of their data base, viz. html tags and/or formatting strings, but most other problems as listed below have not been touched.

These problems are detailed further below on this page; if you want to skip the details now, go to What you get…

Obviously, as the size of that database is so large, handling it is a daunting task.

Various attempts to attack this problem of size have been published, for instance, with the latest version of Xbench3.0 (build 1243) you can extract language pairs and create output in diverse formats.

Also, a method was described by Paul Filkin from SDL on his blog Multifarious (What a Whopper!) to extract more languages together and construct an SDL Termbase.

Recently, the IATE Termcoord has published a tool enabling the user to extract one or more subsets, ordered along languages and  Domains; this alleviates the problem of handling the large dbx file but leaves all other problems untouched. Also, you can select to extract only one (sub)domain at a time, and an important Domain group cannot even be extracted at all!

Using either one of these (complicated and time-consuming) methods in order to arrive at a termbase for pure translation work, the results are less than convincing, because of the reasons explained below.

Description of the problem areas

Although the methods mentioned above may result in smaller databases that eventually could be used as input for a bi-, tri- or multilingual termbase or translation memory, these still are not well suited for use in a pure translation environment. The reasons therefore are numerous.

  1. handling of synonyms – sometimes synonyms are strung together within one text record, separated by a semicolon, sometimes they get separate text records;
  2. context notes are sometimes inserted in the text record between square brackets;
  3. the termbase lists subjects as numerical codes, that can only be resolved after consulting the code definitions on the IATE website; however the codes in the file do not correspond with the codes on the website, and the website does not define all codes found in the .tbx file;
  4. the termbase contains text entries varying from just one word or expression up to complete sentences, inclusive remarks and explanations; these longer text entries have no purpose in a termbase.
  5. the IATE file contains numerous non-UTF-8 characters
  6. many of the first users complained about the occurrence of  a lot of 1-, 2- and even 3-letter words; others did not want ACRONYMS and abbreviations in their termbase, or at least have the possibility to create separate termbases for these terms.
  7. Since many entries in the IATE file have multiple (sub)domains assigned, merging separately extracted files from multiple domains into one data base can lead to a substantial overlap caused by duplicate entries.

In the following examples, you will see some of the problems illustrated.

1 – Synonyms

The following picture shows two forms of synonyms in the IATE database: in the Portugese termset two terms are defined, in the Swedish termset the synonyms are separated by a semicolon within one term definition.  No standard extraction- or CAT program can cope with these two methods of synonym definition at the same time. For instance, when Xbench extracts the language pairs, the first English synonym is completely missed. When the database is converted into an SDL Studio TermBase, the second form is not recognized as containing synonyms.

Syn-2forms-3

2 – Context

Very often, context is added to a term description; although the information is useful when you look up a term, it is very disturbing if you just need a match of the word dielectrics.

Context-en-3

3 – Subject

For almost all termentries, a Subject is defined. This definition might help decide whether the term is applicable in the context of the translation. However, this definition is in the form of a numerical code, the meaning of which is only to be found on the IATE website. This is very impractical, to say the least. In this case, the meaning of the codes is “ECONOMICS”, “Financial institutions and credit”.

Subject-3

4 – Long text entries

You would not like to have sentences as long as this one in your Termbase – they should go into a Translation Memory

Long-sentence-3

If you are interested in the solution of these problems, go to the next page