zaterdag 29 augustus 2009

Dutch Text Interpretation Aid 4

Today I added support for nouns and other lexical categories in the Dutch Text Interpretation Aid software tool. I downloaded lists of adjectives, adverbs, conjunctions, nouns, prepositions, and pronouns available at www.muiswerk.nl. Then I used some software routines to retrieve dictionary information for the lemmas, if available on Wiktionary. As you can see in the following screenshot a lot of words are now being recognised. To futher expand the lexicon, I will develop functionality to add lemmas (with conjugation information for verbs), and to manage the lemmatization rules within the software tool.


woensdag 19 augustus 2009

Dutch Text Interpretation Aid 3

Today I wrote a small software tool that retrieves dictionary information from Wiktionary. This technique is often called screen scraping since it involves scraping information from the screen (Internet browser).
Most of the verb lemmas recognized by the Dutch verb lemmatizer are already described on Wiktionary. This dictionary information could thus be added to the electronic dictionary used in the Dutch Text Interpretation Aid software tool. Dictionary entries for verb lemmas that currently do not exist, I will add manualy later on.



Next, I will add dictionary entries for nouns and other lexical categories.

zondag 16 augustus 2009

Dutch Text Interpretation Aid 2

The software tool, called Dutch Text Interpretation Aid, is now linked to an electronic dictionary.

Any text can be pasted into the text pane. The tool will analyze the words in the text and check whether a lemma can be found. If a lemma could be found, information about the lemma is looked up in the electronic dictionary. If a dictionary entry was found, the word is underlined in green. If a lemma was found but no dictionary entry, the word is underlined orange.

While hovering the mouse over the text, the text field at the bottom displays information about the word under the mouse pointer, i.e. lemma and dictionary information.



Currently, only the Dutch verb lemmatizer is used for lemmatisation. As you can see few verbs have already a dictionary entry. So a lot of work is still to be done.

dinsdag 11 augustus 2009

Dutch Text Interpretation Aid

As mentioned in my previous post, I wanted to develop a Dutch tokenizer that could be used to identify words in a Dutch text. A word could than be fed to a Dutch lemmatizer to find the lemma of the word. Using the lemma, dictionary information about the selected word might be found.

However, while I investigated the possibilities for such Dutch tokenizer, I found out that there exist two Java functions, i.e. Utilities.getWordStart() and Utilities.getWordEnd(), that may be used to identify a word in a text. Therefore, I decided to use these utilities instead of developing a tokenizer of my own.

The following screenshot displays a prototype of the Dutch text interpretation aid I want to develop. Any text can be pasted into the text pane. While hovering the mouse over the text, the text field at the bottom should display information about the word under the mouse pointer.



Next, I will link an electronic dictionary to the tool to provide the necessary dictionary information.

zondag 9 augustus 2009

Dutch Lemmatizer 2

Today I added support for Dutch noun lemmatisation in the Dutch lemmatizer.

Most Dutch nouns change when used in plural or as a diminutive. For example, the plural of been (leg) is benen and the diminutive of schip (ship) is scheepje. It is now possible to add the lemmatisation rules (based on the ending of the noun) in the Dutch lemmatizer.

In many cases the plural of a noun is formed by adding -en to the lemma. A good rule to derive the lemma of the plural form is then to remove the ending -en. Exceptions on this rule are nouns that end on -e or -en. Therefor, it is possible to manage these nouns in a separate list.



I would like to use the Dutch lemmatizer on Dutch electronic texts to provide a dictionary lookup functionality. Therefore, I will next develop a Dutch tokenizer that separates a Dutch text in tokens, i.e. words and punctuation.

zaterdag 8 augustus 2009

Dutch Lemmatizer

I just added support for the lemmatisation of adjectives in the Dutch lemmatizer.

Adjectives are words that modify nouns. An adjective generally occurs in two forms, an undeclined one and a declined one, ending in -e. A good description of the rules I found at Wikibooks.



Adjectives are also modified to form comparatives and superlatives. For example:
goed - beter -best.
Such special cases may be added as rules like:
beter=>goed
best=>goed

Ordinal numbers may also be considered as adjectives, so the lemmatizer should propose the cardinal number of any ordinal number. For example:
drie - derde.

In most cases, de cardinal number is found by deleting the end -de or -ste of the ordinal number.
For example:
twee - tweede
twintig - twintigste

Since this rule may only be applied for ordinal numbers, a list of cardinal numbers can be maintained. This list should not be too long since it is sufficient to cover the cardinal numbers on which the lemma might end.

Next, I will add support for Dutch adverbs and nouns in the lemmatizer.

woensdag 5 augustus 2009

Dutch Verb Lemmatizer

Today I developed a first version of a Dutch verb lemmatizer. The lemmatizer uses the validated verb information generated by the Dutch Verb Conjugation tool. Given a verb form, the lemmatizer proposes the correct lemma and displays an explanation.



Next, I will try to extend the lemmatizer for Dutch adjectives, adverbs, and nouns.

zondag 2 augustus 2009

Dutch Verb Conjugation 3

As described in my previous post, I am developing a Dutch verb conjugator that generates verb forms for the simple present and simple past based on basic verb information, e.g. aandrijven - dreef aan - heeft aangedreven.

The software is currently able to conjugate all regular verbs (available at www.muiswerk.nl).

I also added functionality to correct and save verb information. This is necessary because Dutch has quite a few irregular verbs, i.e. verbs for which it is impossible to derive all conjugated forms.



For example, the verb aankunnen is described with the following basic information:
aankunnen - kon aan - heeft aangekund

The conjugation of the verb for the simple present is:
1st person, singular: kan aan
2th person, singular: kan aan
3th person, singular: kan aan
1st person, plural: kunnen aan
2th person, plural: kunnen aan
3th person, plural: kunnen aan

The conjugation of the verb for the simple past is:
1st person, singular: kon aan
2th person, singular: kon aan
3th person, singular: kon aan
1st person, plural: konden aan
2th person, plural: konden aan
3th person, plural: konden aan

This example shows that the conjugation of the simple present and simple past of the verb aankunnen does not follow the normal conjugation rules.

Currently, I am correcting and saving the conjugated forms for the verbs available at www.muiswerk.nl. Once that is done, I will develop the Dutch verb lemmatizer.