zondag 6 december 2009

JavaDB for Dutch Text Interpretation Aid

Finally, I found some time to develop a JavaDB database to store the contents of the Dutch Text Interpretation Aid.

I believe this is important, because a database is better scalable than plain text files to store and retrieve information. The database also allows collaboration by making use of time stamps and user information. I plan to distribute the software tool to interested users who may contribute to, and make use of, the shared database.

Including all word forms for Dutch takes a huge amount of time and energy. Dutch is after all a highly inflected language. By dividing the work over many users, I hope that the benefits will outnumber the costs.

Currently, the tool recognises about 90% of the words in a news article. About 2% of the missing words are proper nouns (see the figure below).

vrijdag 18 september 2009

Gold Standard for Dutch Lemmatizer

I already use the Dutch Text Interpretation Aid to expand the Dutch lexicon and to improve the Dutch lemmatizer. Therefore, I included a gold standard inside the software tool. Each time, a lemma is added, a gold standard entry will be added that includes the lexical category, the word, and the lemma.

Afterwards, this gold standard may be used to verify the lemmatizer.

zondag 13 september 2009

Dutch Text Interpretation Aid 5

It is now possible to add adjectives, adverbs, nouns, proper nouns, verbs, and other word types in the Dutch Text Interpretation Aid. The procedure is very simple:
1. Paste a text into the text pane.
Words that are not yet recognized will be underlined red. If such unknown word starts with a capital letter, the word will be underlined orange. Usually these words are proper nouns.
2. Select a word that is underlined and activate the popup menu (left mouse button released).
Choose the proper lexical category (adjective, adverb, noun, proper noun, verb, or other) from the popup menu.



3. Check or edit the word properties (i.e. lemma or verb conjugation) in the specific dialog window.
3.a. For proper nouns the following window is displayed.



3.b. For verbs the following window is displayed.



3.c. For adjectives, adverbs, nouns, and other word types the following window is displayed.



The following tasks are now on my to do list:
1. Store the information into a database instead of in the current text files.
Now some lemmas and the corresponding dictionary information are being repeated in several text files. For example the lemma groot appears both as an adverb and as an adjective.
2. Add functionality to update the dictionary information in the Dutch Text Interpretation Aid.
3. Add help information.

zaterdag 29 augustus 2009

Dutch Text Interpretation Aid 4

Today I added support for nouns and other lexical categories in the Dutch Text Interpretation Aid software tool. I downloaded lists of adjectives, adverbs, conjunctions, nouns, prepositions, and pronouns available at www.muiswerk.nl. Then I used some software routines to retrieve dictionary information for the lemmas, if available on Wiktionary. As you can see in the following screenshot a lot of words are now being recognised. To futher expand the lexicon, I will develop functionality to add lemmas (with conjugation information for verbs), and to manage the lemmatization rules within the software tool.


woensdag 19 augustus 2009

Dutch Text Interpretation Aid 3

Today I wrote a small software tool that retrieves dictionary information from Wiktionary. This technique is often called screen scraping since it involves scraping information from the screen (Internet browser).
Most of the verb lemmas recognized by the Dutch verb lemmatizer are already described on Wiktionary. This dictionary information could thus be added to the electronic dictionary used in the Dutch Text Interpretation Aid software tool. Dictionary entries for verb lemmas that currently do not exist, I will add manualy later on.



Next, I will add dictionary entries for nouns and other lexical categories.

zondag 16 augustus 2009

Dutch Text Interpretation Aid 2

The software tool, called Dutch Text Interpretation Aid, is now linked to an electronic dictionary.

Any text can be pasted into the text pane. The tool will analyze the words in the text and check whether a lemma can be found. If a lemma could be found, information about the lemma is looked up in the electronic dictionary. If a dictionary entry was found, the word is underlined in green. If a lemma was found but no dictionary entry, the word is underlined orange.

While hovering the mouse over the text, the text field at the bottom displays information about the word under the mouse pointer, i.e. lemma and dictionary information.



Currently, only the Dutch verb lemmatizer is used for lemmatisation. As you can see few verbs have already a dictionary entry. So a lot of work is still to be done.

dinsdag 11 augustus 2009

Dutch Text Interpretation Aid

As mentioned in my previous post, I wanted to develop a Dutch tokenizer that could be used to identify words in a Dutch text. A word could than be fed to a Dutch lemmatizer to find the lemma of the word. Using the lemma, dictionary information about the selected word might be found.

However, while I investigated the possibilities for such Dutch tokenizer, I found out that there exist two Java functions, i.e. Utilities.getWordStart() and Utilities.getWordEnd(), that may be used to identify a word in a text. Therefore, I decided to use these utilities instead of developing a tokenizer of my own.

The following screenshot displays a prototype of the Dutch text interpretation aid I want to develop. Any text can be pasted into the text pane. While hovering the mouse over the text, the text field at the bottom should display information about the word under the mouse pointer.



Next, I will link an electronic dictionary to the tool to provide the necessary dictionary information.

zondag 9 augustus 2009

Dutch Lemmatizer 2

Today I added support for Dutch noun lemmatisation in the Dutch lemmatizer.

Most Dutch nouns change when used in plural or as a diminutive. For example, the plural of been (leg) is benen and the diminutive of schip (ship) is scheepje. It is now possible to add the lemmatisation rules (based on the ending of the noun) in the Dutch lemmatizer.

In many cases the plural of a noun is formed by adding -en to the lemma. A good rule to derive the lemma of the plural form is then to remove the ending -en. Exceptions on this rule are nouns that end on -e or -en. Therefor, it is possible to manage these nouns in a separate list.



I would like to use the Dutch lemmatizer on Dutch electronic texts to provide a dictionary lookup functionality. Therefore, I will next develop a Dutch tokenizer that separates a Dutch text in tokens, i.e. words and punctuation.

zaterdag 8 augustus 2009

Dutch Lemmatizer

I just added support for the lemmatisation of adjectives in the Dutch lemmatizer.

Adjectives are words that modify nouns. An adjective generally occurs in two forms, an undeclined one and a declined one, ending in -e. A good description of the rules I found at Wikibooks.



Adjectives are also modified to form comparatives and superlatives. For example:
goed - beter -best.
Such special cases may be added as rules like:
beter=>goed
best=>goed

Ordinal numbers may also be considered as adjectives, so the lemmatizer should propose the cardinal number of any ordinal number. For example:
drie - derde.

In most cases, de cardinal number is found by deleting the end -de or -ste of the ordinal number.
For example:
twee - tweede
twintig - twintigste

Since this rule may only be applied for ordinal numbers, a list of cardinal numbers can be maintained. This list should not be too long since it is sufficient to cover the cardinal numbers on which the lemma might end.

Next, I will add support for Dutch adverbs and nouns in the lemmatizer.

woensdag 5 augustus 2009

Dutch Verb Lemmatizer

Today I developed a first version of a Dutch verb lemmatizer. The lemmatizer uses the validated verb information generated by the Dutch Verb Conjugation tool. Given a verb form, the lemmatizer proposes the correct lemma and displays an explanation.



Next, I will try to extend the lemmatizer for Dutch adjectives, adverbs, and nouns.

zondag 2 augustus 2009

Dutch Verb Conjugation 3

As described in my previous post, I am developing a Dutch verb conjugator that generates verb forms for the simple present and simple past based on basic verb information, e.g. aandrijven - dreef aan - heeft aangedreven.

The software is currently able to conjugate all regular verbs (available at www.muiswerk.nl).

I also added functionality to correct and save verb information. This is necessary because Dutch has quite a few irregular verbs, i.e. verbs for which it is impossible to derive all conjugated forms.



For example, the verb aankunnen is described with the following basic information:
aankunnen - kon aan - heeft aangekund

The conjugation of the verb for the simple present is:
1st person, singular: kan aan
2th person, singular: kan aan
3th person, singular: kan aan
1st person, plural: kunnen aan
2th person, plural: kunnen aan
3th person, plural: kunnen aan

The conjugation of the verb for the simple past is:
1st person, singular: kon aan
2th person, singular: kon aan
3th person, singular: kon aan
1st person, plural: konden aan
2th person, plural: konden aan
3th person, plural: konden aan

This example shows that the conjugation of the simple present and simple past of the verb aankunnen does not follow the normal conjugation rules.

Currently, I am correcting and saving the conjugated forms for the verbs available at www.muiswerk.nl. Once that is done, I will develop the Dutch verb lemmatizer.

vrijdag 31 juli 2009

Dutch Verb Conjugation 2

As described in my previous post, I am developing a Dutch verb conjugator that generates verb forms for the simple present and simple past based on basic verb information, e.g. aandrijven - dreef aan - heeft aangedreven.

The software is currently able to conjugate all regular verbs (available at www.muiswerk.nl).

To identify the stem of certain verbs, like aarzelen, correctly, it was necessary to maintain a list of valid stems. Initially, the proposed stem would be aarzeelen by doubling the e.



Most other problems were caused by borrowed verbs from English. For example the verb uploaden initially generated the stem uploaad by doubling the a. For the verb relaxen, I extended the rule to find wether it is a T-verb or a D-verb from 't kofschip into 't ex-kofschip.

Currently, I try to improve the software so that it can also conjugate irregular verbs. As could be expected this requires some extra coding.

maandag 27 juli 2009

Dutch Verb Conjugation

As explained in my previous post, I try to develop a rule-based Dutch Verb Lemmatizer.
To capture all the necessary verb information and lemmatization rules I am developing a verb conjugator first.
Once I can conjugate all Dutch verbs, I should also have the information to calculate the lemma of a given verb form.
It seems that I should only calculate the simple present and the simple past, since all other tenses are a combination of the infinitive and the past participle with other verbs.
For the conjugation rules, I found good reference material at www.dutchgrammar.com.



So far, everything works fine. I'm checking the output of the Conjugator and when necessary do some additional coding to handle exceptions. For example, when the stem ends on t, no extra t should be added etc.

Will be continued...

zaterdag 25 juli 2009

Dutch Verb Lemmatization

Today I started a project to build a lemmatization tool for Dutch verbs.

Most other lemmatization tools rely heavily on statistical techniques. Rules that are automatically generated are, however, very obscure to humans.
Because I would like that humans could easily improve the lemmatizer and learn from the rules, I will use manually crafted lemmatization rules.

An idea is also that the lemmatizer could learn from the same didactic material we humans use. Such material I found on the website www.muiswerk.nl. They have for example verb information in the following text format:
aaien - aaide - heeft geaaid
aanbesteden - besteedde aan - heeft aanbesteed
aanduiden - duidde aan - heeft aangeduid
aanklagen - klaagde aan - heeft aangeklaagd
aankleden - kleedde aan - heeft aangekleed
aankondigen - kondigde aan - heeft aangekondigd
aanleggen - legde aan - heeft aangelegd

As a start I structured this information according to the different rules for verb conjugation we humans would use. For example:
(affix+)stem+en - stem+de( affix) - heeft (affix+)ge+stem(+d)
(affix+)stem+en - stem+te( affix) - heeft (affix+)ge+stem(+t)


Currently, I am developing a software tool that can import the verb information based on the verb conjugation rules.



This proves harder than I thought because the stem itself may change. For example in the verb aanbesteden the stem appears as bested and in the verb aanleggen it appears as legg.

Further development will thus be necessary!