The Term Extraction Module (TEM) ☍ allows you to extract candidate terms from one or more documents and/or from one or more URLs. It identifies single-word and multi-word terms that appear more often in your uploaded document than in general language. It then ranks these suggested terms by relevance, using scores based on how frequent, specific, and unique they are.
The module offers monolingual and bilingual extractions, and the languages that are currently supported are DE, EN, ES, FR and IT. There is no need to specify the language of the documents that you submit, as the module will recognise it automatically.
In this page, you will find out how to:
Read more on user groups and access rights.
CREATE A TEM REQUEST
To create a term extraction request:
- Go to the ‘Term processing’
menu, and click on the ‘Term Extraction Module (TEM)’ tab. - Use the sliding button to indicate whether the request is monolingual or bilingual.
- For monolingual requests, upload one or more source documents and/or insert one or more URLs. For bilingual requests, also provide the target files and/or URLs.
The most common editable formats are accepted (Word, Excel, PowerPoint, editable PDF, HTML, XML, CSV, etc.). |
- Name your term extraction request.
- Choose whether to apply an exclusion list, i.e. a list of terms that should not be proposed as candidate terms.
- Click on ‘Create’ to submit your TEM request.
As the next step, read more on how to retrieve a TEM request.
EXCLUSION LIST
You can either generate your own exclusion list using the template available for download at the bottom of the page, or use one of two proposed exclusion lists containing:
- the most frequent EN words in the DGT corpus, or
- the most duplicated EN terms in IATE.
To apply an exclusion list, upload at least two source (or target) files. You will then be given the option to mark one of them as an exclusion file (only one exclusion list can be applied per request).
You can also apply both proposed exclusion lists automatically (feature only available for English). Choose from the following three thresholds:
- Low: excludes 33 % of the content of the two lists.
- Medium: excludes 66 % of the content of the two lists.
- High: excludes all the content of the two lists.
PROCESS DATA IN TEM
The Term Extraction Module processes your documents in two main stages:
- Pre-processing – preparing and analysing the text:
- The system converts the uploaded text into a standard text (.txt) file.
- A word tokeniser (UIMA) breaks the text down into individual words and punctuation marks, and it also finds their dictionary forms (lemma) and root forms (stem).
- A POS tagger (TreeTagger) identifies the part of speech for each word (like noun or verb).
- A regex engine and a list of regex rules (also based on UIMA) find potential multi-word terms.
- A contextualiser analyses the surrounding words for each single-word term.
- The system applies a score calculation to each potential term and groups similar words together. This includes splitting multi-word terms and grouping variations based on prefixes, grammar structures, synonyms, and minor spelling differences.
- Post-processing – cleaning up the results:
- The system applies a set of custom clean-up rules to filter out junk data, and it automatically removes:
- Standalone two-letter words (e.g., ‘II’, ‘EU’, ‘OJ’).
- Two-letter codes mixed with numbers or punctuation (e.g., ‘2014/17/EU’, ‘p.34’).
- Words consisting only of repeated letters (e.g., ‘III’).
- Email addresses and website links (containing ‘@’ or ‘.com’).
- Standalone adjectives or adverbs (they are only kept if they are part of a larger multi-word term).
- Any terms you included in an uploaded exclusion file.
- The system applies a set of custom clean-up rules to filter out junk data, and it automatically removes:
Any feedback to further improve this module will be very welcome.
(*) User GROUPs and access rights
Check below to see which IATE user groups can create term extraction requests:
| User group | Create TEM request |
|---|---|
| NON-LOGGED-IN USER | No |
| TRANSLATOR and above (except LIMITED) | Yes |
The most common editable formats are accepted (Word, Excel, PowerPoint, editable PDF, HTML, XML, CSV, etc.).