I am happy to say that I can offer output files for every language combination, ready for use in SDL Studio 2011 and upwards, CafeTran, memoQ 2013 and upwards, DVX 2/3 and WordFast Classic/Pro where the problems described on the first page have been taken care of.
In order to solve these problems I proceeded in two steps.
To begin with, I wrote a cleaning program (Unix ‘sed‘ for insiders) in order to remove and/or change all offending strings in the original IATE database. I run this program only once for every new revision – it needs only to be adapted if the additions to the IATE data contain new, not previously occurring, problems.
Also, I downloaded the list of translations of the subject codes from the IATE website and replaced the actual numerical codes used in the IATE database by their names.
The next step I performed is to extract (using again a Unix program, ‘nawk‘), per language pair, relevant information contained in the IATE database and process it according to the requirements of the various CAT-tools for which Term Base- and Translation Memory files are created. In this process, synonyms are ordered and grouped where relevant, context information is moved to where it is not disturbing the terms anymore, etc.
This process creates output files for filling specific Term Bases and Translation Memories.
The first set is for SDL Studio in the file format Studio requires. These are: a set of .xml files with 100 000 entries each for filling the term base, inclusive a format definition file for the term base, as well as a.tmx file for filling the Translation Memory.
Then, a dedicated .csv file for CafeTran with all synonyms in the same record is created. It just needs to be accessed by CafeTran.
Third, a .tmx file ready to be imported in a memoQ Translation Memory, complete with an .xml TM import scheme, is created. For the memoQ term base, the default .csv termbase files can be used.
Fourth, for a DVX term base, a template is created to facilitate import of the default csv term base files. The DVX Translation Memory can be filled from the default .csv TM file.
For other CAT-tools the default .csv files that have been created, having an individual record for every synonym pair, can be used and processed.
Fifth, output file sets for import into WordFast Classic/Pro are created.
Although my standard offer contains, for those language pairs where both languages contain more than 300 000 terms, file sets based on 6 predefined Domain groups, it is now also possible to order file sets that have been extracted for a group of Domain codes specified by you, for any language pair. This will allow you to create term bases that contain only terms for the Domain areas you are working with. You can have a look at the Domain code grouping by downloading the file in the following link: Domain group ordering
As a result, you won’t have problems any more with file sizes and term base definitions, and can, after you have downloaded my files, directly go ahead with creating and filling your CAT databases.