What you get…

– Bilingual extraction already carried out

The complex task of extracting the language pairs from the gigantic (2.2 GB) IATE termbase is already carried out, so you don’t have to wrestle with exotic text editors or third party extraction tools like Xbench.
Depending on the language pair(s) ordered, you will get files with up to 1,200,000 cleaned-up term pairs, rid of html-codes, synonyms properly handled in function of the intended CAT-tool, These files can be imported into your CAT-tool in a breeze, using supplied templates where necessary.

– Up to 6 different output files in the file set for one CAT-tool, Domain code grouping for large file sets

The output of the extraction process is split into up to 6 different files, as follows:

  • Terms in the source language having up to 3 words are separated into multiple outputs, destined for the creation of multiple term bases, giving you the possibility of excluding them from your general termbase and adding them to dedicated termbases, to be used with increased penalties or other selection criteria, depending on the capabilities of your CAT-tool:
    1. Abbreviations and acronyms (characterized either by consisting of one word only using CAPITALS, or having the term property “abbreviation” in the IATE termbase; there may be duplicates of these terms in some of the other files, depending on the letter count)
    2. Terms consisting of 1 character only
    3. Terms consisting of 2 characters only
    4. Terms consisting of 3 characters only
    5. All terms in the source language up to 3 words, not belonging to one of the above classes will be destined for your general termbase
    6. All terms in the source language having more than 3 words are output in a file(set) destined for the creation of a translation memory, rather than the creation of a termbase.

Note that it is not advisable to use the bilingual termbases in ‘reverse’, i.e. using the target terms as source terms and vice versa, as the separation into different files based on character- and word counts has been carried out on the basis of a count in the source language only!

 – Synonyms handled in a consistent and complete way

The synonyms, coded in the IATE termbase in at least 4 different ways, are automatically identified and, depending on the target termbase (MultiTerm, CafeTran, DVX or other) presented in groups or individually.

– html-tags and formatting strings removed

These tags and strings may be useful in a terminology tool but have nothing lost in a termbase destined to be used in a pure translation environment. The IATE termbase contains many hundred thousands of them (in .xml format) that hinder using the terms in a CAT-tool; besides, many of them have a wrong syntax and/or miss their pairing tag. These tags all begin with the ampersand “&”, but the ampersand (as well as the “<” and “>” characters are sometimes used by itself in the terms, so handling them in a correct way is better left to a program.

– Context information moved from term to separate property

The context information, sometimes located in the text record between square brackets, is moved to a separate property, so it does not hinder finding a match for the word when it occurs in the text to be translated.

– English Subject property localized into target language

In the Dataset extracted from IATE, almost all terms have one or more Subject properties in the English language.
When creating a Language Pair, my program replaces the Subject properties with their translation into the target language, thus making the information better understandable for those translators who are not fluent in English.

– Obvious errors in the IATE termbase corrected

The IATE termbase contains some errors like texts not being in UTF-8 coding, sometimes Greek characters used in clearly Dutch entries – can you see the difference between a latin “H” (LATIN CAPITAL LETTER H) and a greek “Η” (GREEK CAPITAL LETTER ETA)? But your CAT program can see the difference and will not find a match.

– Simple filling of your termbase

All files in the fileset prepared by me are aimed at simplifying the process of bilingual termbase creation. Therefore, when a termbase template can help simplifying that process, the template is delivered as well (Now for MultiTerm, memoQ TB and DVX2/3 TB).