Home   ·   Index   ·   Search

memoQ 2015: Term Extraction


extracting terms, creating a TB, creating a term base, EC TB

Term extraction can be used to identify terms in a project and create a term base ready for translation.

At STP, the process of term extraction is instigated by the PM but performed almost entirely by the linguist. Scroll down for linguist instructions.

PM instructions

Before a large, term-heavy project, PMs can allow linguists to spend some time using term extraction in a project. Linguists will identify term candidates, translate them and generate a term base to be used for the translation project.
This can considerably improve consistency and translation speed and also provide valuable terminology for later use. 

Basic workflow

  1. PM sets up a new project and identifies it as suitable for term extraction
  2. PM ensures that an empty project TB is attached to the project
  3. PM requests term extraction from a linguist – preferrably a translator set to work on the project
  4. PM ensures that all files in the project are accessible to the linguist (either by assigning them all to this linguist or by giving the linguist PM rights)
  5. Linguist performs term extraction and informs the PM when ready
  6. PM launches translation phase of project
  7. PM informs TrM and LT that term extraction has been performed and a TB generated

Suitable projects

Always consider term extraction before projects matching the following criteria:
  • Projects with at least 10K weighted words
  • The project must not have any previous client terminology

Multilingual projects

When extracting terms, the linguist can translate them directly, but for multilingual projects, it is more efficient to only have the first linguist extract the terms (build the source language TB) and then have this linguist and the other target language translators translate the terms simultaneously.
The simplest way to do this is by using the Term Base Editor but as it only allows editing by one user at a time, it's not very convenient.
A better solution is to export the source language TB as a CSV and import it as a multilingual delimited file. This CSV can then be translated and exported back out, then imported to the project TB again. Note that you probably want to give each linguist access to the main project files so they can look for context (this can be done by creating a view of all other files and using the text filter to look for context). Ask LT for help if you need.

Time

Allow for about 2h of time spent on terminology extraction/translation for a 10K job. The translator shouldn't spend more time than that but should inform you that the For a project with multiple target languages, allow 1h for one translator to extract terms and 1h for each translator that will translate the terms for their TL.

Term extraction after translation

PMs can also set up term extraction after a large term-heavy project is finished, to extract both the source terms and their existing translations. This can be used to generate a term base of common terms in the project for future use or upon client request. (Term extraction can also be used to extract term candidates from translation memories or LiveDocs corpora, but this is less common.)

Linguist instructions

Term extraction has two phases: creating the session and extracting/translating the terms.

Creating the term extraction session

Sessions can only be created by linguists, in the checked out copy of the main project. Linguists will set up the session as per the instructions below.

A new term extraction session is generated by clicking Extract Terms.

These are the parameters to be aware of:

Session name: By default, the session is named after the date it is created. You can go back to existing sessions using Term Extraction Sessions at a later point, so it's good to name the sessions something recognizable.

Sources: You generally want to perform the term extraction on every translation document, but PMs may specify that you perform it only on selected documents or even on a TM or LiveDocs corpus.

Options: Here is where you can fine-tune the extraction.

General options
Maximum length (words): The number of words in the longest term candidate. memoQ will not list expressions that are longer than this. The default value is 4, but for term-heavy texts it can be set to a lower value.
Minimum frequency: memoQ will not list candidates that do not occur in the source text as many or more times as the number specified here. For example, if the minimum frequency is 3, the list will contain candidates that occur 3 or more times in the source text. The default value is 3 and that is usually ok.
Expression delimiters: This is a list of characters that mark the beginning or the end of a term candidate. memoQ will not extract expressions where one or more of these characters occur inside the expression. Usually the default settings will be ok.
Length factor: This is a number between 0.5 and 3 that controls how much memoQ should favor longer expressions. Each term candidate (that is, extracted expression) receives a score during the extraction process. The larger the length factor, the larger the difference will be between the score of a longer and a shorter expression. The default value is 1.5.
Ignore words with numbers: If this check box is checked, memoQ will not include expressions if there is a word in it that contains one or more digits. The check box is not checked by default.

Single-word term options
memoQ uses a different approach to extract single-word term candidates. The settings below control how they are extracted.
Minimum length (characters): memoQ does not list words that are shorter than the number specified here. For example, if the minimum length is 3, memoQ extracts single-word candidates that are 3 characters long or longer. The default value is 3. Increasing the value is a good way to weed out a lot of false positives. Note that this limitation does not apply to term candidates that contain multiple words.
Minimum frequency: This works just like the General option with the same name, but only applies to single word term candidates. The default value is 3.

Term base lookup
When extracting candidates, memoQ looks for expressions in the source-language text only. However, memoQ can retrieve possible translations for the extracted candidates by looking them up in term bases used in the same project.
Look up candidates: Check this if you want memoQ to look up translations for each candidate in the term bases used in the current project. The check box is checked by default (if any TBs are attached to the project).
All term bases in project: Choose this if you wish to look up the candidates in all term bases in the current project. This is the default setting.
Term base with the highest rank only: Choose this if you wish to look up the candidates in the highest ranked term base only.

Stop word lists
You can create and save stop word lists that can be used to limit the term candidates. Stop words are words that generally do not occur in terms or in a particular position in a term. For example, the word if is very unlikely to occur anywhere in a term, while the word the is very unlikely to occur in the end of a term.

  • Existing stop word lists can be selected from the dropdown menu.
  • Stop words are added by typing them in the Word text box and clicking Add. After adding a word, use the tickboxes to select if the word is a stop word when it occurs first in a term candidate, inside a term candidate or last in a term candidate.
  • Stop word lists can be saved by clicking Save As.
  • STP does not currently maintain any stop word lists.

Term extraction

Once the session is created, the linguist can start on the actual term extraction.

The term extraction process can be summarized in 3 steps:

  1. Mark candidates as either Accepted or Dropped
  2. Edit candidates where necessary before Accepting then (remove redundant words or "noise" and fix capitalization in the source term, add or insert translation in the target term)
  3. Send Accepted terms to the term base

The result of term extraction is a list of term candidates. The linguist task is to decide whether a term should be added into the term base (and possibly translated), or dropped from the list.

The Term Candidates list

The candidate list table contains the following columns:
$ (score): A number that shows how confident memoQ is that the candidate is a valid term. It is computed on the basis of the frequency and the length of the candidate (multiplied with the Length factor). memoQ computes a higher score if the candidate or part of it was found in other term bases attached to the project.
H (hidden): Each row contains a check box that shows whether or not the candidate is hidden. You can click the check box to hide the candidate. Hidden candidates always remain visible, but if you re-sort the list (using the Re-sort now link or the Ctrl+R key shortcut), they will be sorted to the end of the list.
Status: Shows the status of the candidate:
 Candidates are not hidden, dropped, or accepted. Intially, the status of all term candidates is Candidate.
 Accepted have been accepted using the Accept button or the Ctrl+Enter key shortcut. Only accepted candidates are copied to the final term base.
 Dropped have been discarded from the list using the Drop button or the Ctrl+D key shortcut. Dropped candidates are sorted to the end of the list when you use the Re-sort link or the Ctrl+R key shortcut.
Source: The source term as it was extracted from the text. You can edit this cell. The Source row can contain additional information:
 Also means the candidate was merged with another, and now they form a single term base entry that has two or more source-language alternatives. The word Also: is followed by a list of alternatives.
 Original means the candidate was edited, and the source-language expression is not the same as the one memoQ extracted from the text. The word Original: is followed by the original candidate.
Target: The target term. You can either type translations for the candidates, or have memoQ fill in the cells by clicking Look up terms now (provided there are other termbases in the project. Term base creation usually presupposes linguists translating the accepted candidates before sending them to the term base, but they can also be translated at a later point (for example, the term base might be intended for a multlanguage project. In this case it's most efficient if one linguist extracts the candidates (and translates to one target language, if able) and the other linguists then simply translate the terms into their target languages in an exported CSV.

Editing the list

You can type in both the Source and Target columns. When you edit the term candidate in the Source column, its original form is displayed at the bottom of the cell, with the Original: label. If you want to enter two or more translations, separate them with a semicolon (;).

If other term bases returned one or more hits for the current candidate, they are displayed in the lower-right corner of the candidate list editor tab. If there are multiple hits, they are listed in the left side of the Term base results panel. You can either click a hit to display its details, or you can move up and down in the list using the Ctrl+Up arrow and Ctrl+Down arrow key shortcuts. The selected entry is displayed in a formatted layout in the right side of the Term base results panel. You can select text in this entry, and drag it to the target cell.

At the lower-left corner, the candidate list editor displays the Occurrences panel where it shows the text where the source term was found in the source during term extraction. If there are no term base hits, or they are not relevant, you can use the context of the term to determine its translation. This is a concordance view. If the source document contains translations, those are also displayed. You can drag selected text from the Occurrences panel to the target cell. If there are multiple occurrences, you can navigate between them using Ctrl+Up arrow and Ctrl+Down arrow.

Filtering and cleaning up the candidate list

The candidate list may contain many irrelevant phrases that should not be accepted as terms. Some of the candidates may be synonyms (different phrases with the same meaning) or parts of others, and one member of the group of synonyms might be accepted as a term. This means you need to clean the list before you can use it.

When you clean the list, you can accept, drop (discard), hide, and merge candidates. Merging candidates means that you treat a group of candidates as a single terminology entry where the source-language term can take multiple forms.

Only accepted candidates will be copied into the final term base. Candidates you hide and don't accept will not be copied.

You can filter the list using the Filter text box. If you type a phrase in the text box, memoQ restricts the list to those candidates that contain the phrase you typed in. The text box can also be pulled down to select previous filters. If you check the Only with TB result check box, the list will contain only those candidates that have one or more hits in the term bases used in the term extraction session. To access the Filter text box using the keyboard, press Ctrl+Shift+F.

The Term Extraction ribbon

The Term Extraction ribbon provides the following commands to clean the term candidate list. All commands have a shortcut key to speed up your work.

  • Drop Term marks the current or selected candidate(s) as Dropped. Dropped candidates remain on the list, but when you sort the list again using the Re-sort now command, they are sorted to the end of the list. Shortcut key: Ctrl+D.
  • Accept as term marks the current candidate or the selected candidates as Accepted. Accepted candidates will be copied to the final term bases. When you sort the list using the Re-sort now command, accepted candidates are sorted to the beginning of the list. Shortcut key: Ctrl+Enter.
  • Hide/unhide shorter hides those candidates where the source term is part of the current candidate but shorter. Hidden candidates are sorted to the end of the list. If the shorter candidates are hidden, this command uncovers them (it works like a toggle). For example, if the current candidate is "base unit", the Hide shorter will hide "base" and "unit". Shortcut key: Ctrl+L.
  • Merge candidates is used when two or more candidates are selected to merge them into a single candidate (a single row in the list). The new row shows the first selected candidate as the main term, but displays the other candidates, marked with the word Also:. This command is similar to the Join segments command in the translation grid, so its shortcut key is Ctrl+J.
  • Unmerge splits the current candidate (if it is a merged one) into separate candidates again. This command is similar to the Split segment command in the translation grid, so its shortcut key is Ctrl+T.
  • Prefix merge and hide looks for candidates in the list that have the same prefix as the current one. If memoQ finds two or more candidates with the same prefix, it automatically merges them. Normally, a source term must include a prefix marker – the pipe | character – to run this command (example: system|s). However, if there is no prefix marker in the source term, memoQ displays the No prefix markers in term dialog to ask for confirmation. If that is confirmed, the entire source term in the current candidate is used as a prefix. Use with caution. Shortcut key: Ctrl+M.
  • Add as stopword displays the New stop word dialog where you can add the selected text as a new stop word to the stop word list used in the current session. Shortcut key: Ctrl+W.
  • Target language drop-down list allows you to switch between target languages in the project. You can look up and enter translations in all target languages.
  • Accepted items provide lookup results check box will allow the term extraction session to work like an ordinary term base, even without actually sending the terms to a term base. When you work on a document in the translation grid, accepted terms from the term extraction session appear in the Translation results pane. The check box is checked by default.
  • Export To TaaS opens an upload dialog to upload the accepted terms to TaaS. Not used by STP.
  • Restart Session displays the Extract candidates dialog, and starts the session again. The candidate list is cleared, and a new one is created. You will lose all changes you made to the candidate list – use this command with care.
  • Export To Term Base opens the Export accepted terms to term base dialog, and copies the accepted terms and their translations to a term base you choose. This is usually the last step of a term extraction session
 

Workflow examples

Example 1:

  • New project
  • One set of source documents to translate into multiple target languages
  • Task: create term base to send to client for confirmation before translation

Workflow

  1. PM creates the project and imports the source documents
  2. PM creates a blank project TB if none is in the project
  3. PM assigns term base generation task to ONE linguist, preferably a translator who will later translate the main project into one of the target languages
  4. Linguist checks out project and generates term extraction session. He/she can try various settings until the session looks acceptable, or consult LT.
  5. Linguist processes candidate list, Accepting or Dropping candidates and adjusting the Accepted source terms where necessary. No translation of terms should be done at this point.
  6. Linguist finishes term extraction by exporting all Accepted source terms to the TB
  7. PM exports a CSV version of the TB and sends imports back to the project using the multilingual delimited file and assigns to all translators (including the one who extracted the candidates) so they can start translating the terms.
  8. Once all terms have been translated, PM exports the CSV and sends to the client.
  9. PM sends CSV to client for confirmation. Client can make changes to the CSV directly in Excel for either language.
  10. PM imports the CSV back into a fresh TB and commences translation phase of project.


Example 2:

  • Delivered project(s)
  • One target language
  • Task: client requests building of term base based on one or more previous delivered projects

Workflow

  1. PM creates a blank project TB if none is in the project
  2. PM assigns a linguist (ideally the original translator or reviser) to the term base generation task
  3. Linguist checks out project and generates term extraction session. He/she can try various settings until the session looks acceptable, or consult LT.
  4. Linguist processes candidate list, Accepting or Dropping candidates, adjusting the Accepted source terms where necessary and selecting target terms from the translation directly.
  5. Linguist finishes term extraction by exporting all Accepted terms to the TB
  6. PM exports TB as CSV sheet and delivers to client for confirmation. Client can make changes to the CSV directly in Excel and send back for updating our TB if needed.

memoQ 2015: UI index