I am happy to say that I can offer output files for every language combination, ready for use in SDL Studio 2011/2014, CafeTran, memoQ, DVX 2/3 and WordFast Classic/Pro where the problems described on the first page have been taken care of.
In order to solve these problems I proceeded in two steps.
To begin with, I wrote a cleaning program (Unix ‘sed‘ for insiders) in order to remove and/or change all offending strings in the original IATE database. I run this program only once; should IATE make an update available, I could run the same program to clean the input again – it needs only to be adapted if the additions to the IATE data contain new, not previously occurring, problems.
Also, I downloaded the list of translations of the subject codes from the IATE website and replaced the actual numerical codes used in the IATE database by their names.
The next step I am performing is to extract (using again a Unix program, ‘nawk‘), per language pair, relevant information contained in the IATE database and process it according to the requirements of the various CAT-tools for which Termbase- and Translation Memory files are created. In this process, synonyms are ordered and grouped where relevant, context information is moved to where it is not disturbing the terms anymore, etc.
This process creates output files for filling specific Termbases and Translation Memories.
The first set is for SDL Studio 2011/2014 in the file format Studio requires. These are: a set of .xml files with 10000 entries each for filling the termbase, inclusive a format definition file for the Termbase, as well as a.tmx file for filling the Translation Memory.
Then, a dedicated .csv file for CafeTran with all synonyms in the same record is created. It just needs to be accessed by CafeTran.
Third, I create a .tmx file ready to be imported in a memoQ Translation Memory, complete with an .xml TM import scheme. For the memoQ termbase, the default .csv termbase files are used.
Fourth, for a DVX 2/3 termbase, a template is created to facilitate import of the default csv termbase files. The DVX 2/3 Translation Memory can be filled from the default .csv TM file.
For other CAT-tools the default .csv files that have been created, having an individual record for every synonym pair, can be used and processed.
Fourth, output file sets for import into WordFast Classic/Pro have been created and are ready for download.
Although my standard offer contains, for those language pairs where both languages contain more than 300 000 terms, file sets based on 6 predefined Domain groups, it is now also possible to order file sets that have been extracted for a group of Domain codes specified by you, for any language pair. This will allow you to create term bases that contain only terms for the Domain areas you are working with. You can have a look at the Domain code grouping by downloading the file in the following link: Domain group ordering
As a result, you won’t have problems any more with file sizes and Termbase definitions, and can, after you have downloaded my files, directly go ahead with creating and filling your CAT databases.