Skip to content
IATE logo 🖶
lorem ipsum

Term Extraction Module (TEM) (*)

The Term Extraction Module (TEM) ☍ allows you to extract candidate terms from one or more documents and/or from one or more URLs. It identifies single-word and multi-word terms that appear more often in your uploaded document than in general language. It then ranks these suggested terms by relevance, using scores based on how frequent, specific, and unique they are.

The module offers monolingual and bilingual extractions, and the languages that are currently supported are DE, EN, ES, FR and IT. There is no need to specify the language of the documents that you submit, as the module will recognise it automatically.

In this page, you will find out how to:

  • Create a TEM request
  • Use an exclusion list
  • Process data in TEM

Read more on user groups and access rights.

CREATE A TEM REQUEST

To create a term extraction request:

  • Go to the ‘Term processing’ menu, and click on the ‘Term Extraction Module (TEM)’ tab.
  • Use the sliding button to indicate whether the request is monolingual or bilingual.
  • For monolingual requests, upload one or more source documents and/or insert one or more URLs. For bilingual requests, also provide the target files and/or URLs.
The most common editable formats are accepted (Word, Excel, PowerPoint, editable PDF, HTML, XML, CSV, etc.).
For bilingual term extractions, the process is run on parallel documents or URLs, with English required as either the source or target language. The bilingual extraction will then propose pairs of candidate terms. Users with Administrator and Terminologist+ roles can additionally create bilingual raw entries from these pairs.
  • Name your term extraction request.
  • Choose whether to apply an exclusion list, i.e. a list of terms that should not be proposed as candidate terms.
  • Click on ‘Create’ to submit your TEM request.

As the next step, read more on how to retrieve a TEM request.

EXCLUSION LIST

You can either generate your own exclusion list using the template available for download at the bottom of the page, or use one of two proposed exclusion lists containing:

  • the most frequent EN words in the DGT corpus, or
  • the most duplicated EN terms in IATE.

To apply an exclusion list, upload at least two source (or target) files. You will then be given the option to mark one of them as an exclusion file (only one exclusion list can be applied per request).

You can also apply both proposed exclusion lists automatically (feature only available for English). Choose from the following three thresholds:

  • Low: excludes 33 % of the content of the two lists.
  • Medium: excludes 66 % of the content of the two lists.
  • High: excludes all the content of the two lists.
PROCESS DATA IN TEM

The Term Extraction Module processes your documents in two main stages:

  1. Pre-processing – preparing and analysing the text:
    • The system converts the uploaded text into a standard text (.txt) file.
    • A word tokeniser (UIMA) breaks the text down into individual words and punctuation marks, and it also finds their dictionary forms (lemma) and root forms (stem).
    • A POS tagger (TreeTagger) identifies the part of speech for each word (like noun or verb).
    • A regex engine and a list of regex rules (also based on UIMA) find potential multi-word terms.
    • A contextualiser analyses the surrounding words for each single-word term.
    • The system applies a score calculation to each potential term and groups similar words together. This includes splitting multi-word terms and grouping variations based on prefixes, grammar structures, synonyms, and minor spelling differences.
  2. Post-processing – cleaning up the results:
    • The system applies a set of custom clean-up rules to filter out junk data, and it automatically removes:
      • Standalone two-letter words (e.g., ‘II’, ‘EU’, ‘OJ’).
      • Two-letter codes mixed with numbers or punctuation (e.g., ‘2014/17/EU’, ‘p.34’).
      • Words consisting only of repeated letters (e.g., ‘III’).
      • Email addresses and website links (containing ‘@’ or ‘.com’).
      • Standalone adjectives or adverbs (they are only kept if they are part of a larger multi-word term).
      • Any terms you included in an uploaded exclusion file.

Any feedback to further improve this module will be very welcome.

(*) User GROUPs and access rights

Check below to see which IATE user groups can create term extraction requests:

User groupCreate TEM request
NON-LOGGED-IN USERNo
TRANSLATOR and above (except LIMITED)Yes

Related Pages

TEM candidate management

↩ Back to IATE
  • General information
    • Introduction to the IATE Online Help
    • About IATE
    • Multilingual interface
    • Browser compatibility
    • Accessibility statement
    • Create an IATE account and log on/off
    • Local storage and browser cache
    • Contact
    • Legal notice
  • User dashboard (*)
    • User profile (*)
    • User preferences (*)
    • Bookmarks (*)
    • Last entries (*)
    • Watch lists (*)
    • Notifications (*)
  • Search
    • Main search
      • Expanded search
        • Matching options
        • Search by term types
        • Search in specific fields
        • Filters
    • Search by collection
    • Search by URL
    • Advanced search (*)
      • Tips and examples of useful queries
    • Batch search
    • Results
    • Standard view vs interpreters’ view
    • Exports (*)
  • Entry overview
    • Full entry view
    • Entry structure
      • Language-Independent Level
      • Language Level
      • Term Level
    • Feedback on an entry
  • Entry management (*)
    • General input criteria
    • Entry structure for editors
      • List of fields
      • Language-Independent Level (LIL)
        • Domains
        • Primarity
        • Anchor language
        • Cross-references
        • Collections
        • Field completion score
        • Other LIL fields
      • Language Level (LL)
        • Definition and definition reference
        • Other LL fields
      • Term Level (TL)
        • Term and related mandatory fields
        • Other TL fields
    • References
      • Best practices related to references (*)
      • Types of references
      • Entry-to-entry links
      • Clipboard
    • Advanced field management
      • Confidentiality
      • Protection
      • Validation status
      • Ownership
    • General editing features
      • Action buttons
      • Contextual menu
      • Formatting
    • Entry creation (*)
    • Data modification (*)
    • Duplicate detection (*)
    • Deletion (*)
    • Undeletion (*)
    • History/audit (*)
    • Locking mechanism (*)
  • Best practices for terminologists (*)
    • Entries owned by other institutions
    • Consolidation
    • Intellectual property rights
    • Taxonomy
      • Language-Independent Level, Latin and MUL
      • Specific categories of entries
      • Examples of IATE entries to help guide your work
  • Advanced features (*)
    • Validation (*)
    • Collections (*)
    • Attachments (*)
    • Marks (*)
    • Merging (*)
      • Manual merge (*)
      • Automatic merge (*)
    • Import (*)
    • Table view (*)
    • Asynchronous requests (*)
    • Experimental features (*)
    • Post-adoption checks (*)
  • Terminology projects (*)
    • Project list (*)
    • Create and edit a project (*)
    • Preparatory material (*)
    • Project entries (*)
    • Project assignments (*)
    • My assignments (*)
    • My assigned entries (*)
    • Internal forum (*)
    • External forum (*)
  • Document processing (*)
    • Term Recognition Module (TRM) (*)
    • Internal IATE plug-ins for Trados Studio (*)
    • Term Extraction Module (TEM) (*)
      • TEM candidate management (*)
  • Statistics
  • Documentation
    • Guidelines (*)
    • Documentation & tutorials
    • Useful shortcuts
    • EurTerm (*)
  • Technical info
    • Release notes
    • Download IATE (*)
    • Public search APIs
    • IATE search widget
    • Template repository (*)
    • Downloadables (*)

This handbook is part of IATE, the European Union terminology portal.

Powered by PressBook WordPress theme