woensdag 11 april 2012

English terminography

Sirius Computing has improved the Sirius English Text Analyser with terminography support. Terminological information is of course necessary for a software tool to understand natural language. Compiling the required terminological information into a format that the computer understands is however a very big task. Therefore, the Sirius English Text Analyser tries to make it easy for terminographers to compile such term base.


We are now using e-books to teach the English Text Analyser terminology. When we paste a text (fragment) into the software tool the English Text Analyser displays the words (tokens) that it does not yet know. The terminographer should then assign the appropriate lexical category to the word lemma. The English Text Analyser uses built-in morphological rules to assist the terminographer. For example, if a word ends with -ly, the software tool automatically marks the word as an adverb. If a word ends with -ed or -ing the Text Analyser wil select the lexical category 'Verb'. Default, the lexical category 'Noun' is selected.
If the terminographer selects the lexical category 'Proper noun', the proposed word lemma will be the original word.
The software tool already tries to propose the most likely word lemma based on morphological analysis. For example, if a word ends on -s or -es, the proposed word lemma will drop the -s or -es. Of course, the terminographer can easily change the word lemma.
In some cases, we also add other word forms as terms. The word 'children', for example, should be added as a term. Otherwise, the software tool would not be able to recognize the word. Via the Terminology tab, a term may be marked as plural and the root form (lemma) could be specified. Similarly, the term 'further' should be added to the database. This term could then be marked as a comparative of the root form 'far' via the Terminology tab.


For a verb, the terminographer should specify the verb conjugation information. However, when the terminographer selects the lexical category 'Verb', the software tool automatically proposes verb conjugation information based on morphology.

By pressing the button 'Add lemma', only the word lemma is added to the database and the next token will be selected for terminography. A lemma does not hold any further linguistic information.
To include the lexical category (and possibly verb conjugation), the terminographer should press the button 'Add term'. This will add the word as a term. If desired, the terminographer may specify the meaning of the term with a second disambiguating term. For example, to add the term 'fly' as an insect instead of a zipper, the terminographer could add the disambiguating term 'insect' behind {.

Conclusion: The Sirius English Text Analyser makes it easier than ever to teach the computer English vocabulary and terminology.
Hopefully, I will be able to describe how we deal with other linguistic information such as grammar and phraseology soon.