As described in my previous post, I am developing a Dutch verb conjugator that generates verb forms for the simple present and simple past based on basic verb information, e.g. aandrijven - dreef aan - heeft aangedreven.
The software is currently able to conjugate all regular verbs (available at www.muiswerk.nl).
To identify the stem of certain verbs, like aarzelen, correctly, it was necessary to maintain a list of valid stems. Initially, the proposed stem would be aarzeelen by doubling the e.
Most other problems were caused by borrowed verbs from English. For example the verb uploaden initially generated the stem uploaad by doubling the a. For the verb relaxen, I extended the rule to find wether it is a T-verb or a D-verb from 't kofschip into 't ex-kofschip.
Currently, I try to improve the software so that it can also conjugate irregular verbs. As could be expected this requires some extra coding.
vrijdag 31 juli 2009
maandag 27 juli 2009
Dutch Verb Conjugation
As explained in my previous post, I try to develop a rule-based Dutch Verb Lemmatizer.
To capture all the necessary verb information and lemmatization rules I am developing a verb conjugator first.
Once I can conjugate all Dutch verbs, I should also have the information to calculate the lemma of a given verb form.
It seems that I should only calculate the simple present and the simple past, since all other tenses are a combination of the infinitive and the past participle with other verbs.
For the conjugation rules, I found good reference material at www.dutchgrammar.com.
So far, everything works fine. I'm checking the output of the Conjugator and when necessary do some additional coding to handle exceptions. For example, when the stem ends on t, no extra t should be added etc.
Will be continued...
To capture all the necessary verb information and lemmatization rules I am developing a verb conjugator first.
Once I can conjugate all Dutch verbs, I should also have the information to calculate the lemma of a given verb form.
It seems that I should only calculate the simple present and the simple past, since all other tenses are a combination of the infinitive and the past participle with other verbs.
For the conjugation rules, I found good reference material at www.dutchgrammar.com.
So far, everything works fine. I'm checking the output of the Conjugator and when necessary do some additional coding to handle exceptions. For example, when the stem ends on t, no extra t should be added etc.
Will be continued...
zaterdag 25 juli 2009
Dutch Verb Lemmatization
Today I started a project to build a lemmatization tool for Dutch verbs.
Most other lemmatization tools rely heavily on statistical techniques. Rules that are automatically generated are, however, very obscure to humans.
Because I would like that humans could easily improve the lemmatizer and learn from the rules, I will use manually crafted lemmatization rules.
An idea is also that the lemmatizer could learn from the same didactic material we humans use. Such material I found on the website www.muiswerk.nl. They have for example verb information in the following text format:
aaien - aaide - heeft geaaid
aanbesteden - besteedde aan - heeft aanbesteed
aanduiden - duidde aan - heeft aangeduid
aanklagen - klaagde aan - heeft aangeklaagd
aankleden - kleedde aan - heeft aangekleed
aankondigen - kondigde aan - heeft aangekondigd
aanleggen - legde aan - heeft aangelegd
As a start I structured this information according to the different rules for verb conjugation we humans would use. For example:
(affix+)stem+en - stem+de( affix) - heeft (affix+)ge+stem(+d)
(affix+)stem+en - stem+te( affix) - heeft (affix+)ge+stem(+t)
Currently, I am developing a software tool that can import the verb information based on the verb conjugation rules.
This proves harder than I thought because the stem itself may change. For example in the verb aanbesteden the stem appears as bested and in the verb aanleggen it appears as legg.
Further development will thus be necessary!
Most other lemmatization tools rely heavily on statistical techniques. Rules that are automatically generated are, however, very obscure to humans.
Because I would like that humans could easily improve the lemmatizer and learn from the rules, I will use manually crafted lemmatization rules.
An idea is also that the lemmatizer could learn from the same didactic material we humans use. Such material I found on the website www.muiswerk.nl. They have for example verb information in the following text format:
aaien - aaide - heeft geaaid
aanbesteden - besteedde aan - heeft aanbesteed
aanduiden - duidde aan - heeft aangeduid
aanklagen - klaagde aan - heeft aangeklaagd
aankleden - kleedde aan - heeft aangekleed
aankondigen - kondigde aan - heeft aangekondigd
aanleggen - legde aan - heeft aangelegd
As a start I structured this information according to the different rules for verb conjugation we humans would use. For example:
(affix+)stem+en - stem+de( affix) - heeft (affix+)ge+stem(+d)
(affix+)stem+en - stem+te( affix) - heeft (affix+)ge+stem(+t)
Currently, I am developing a software tool that can import the verb information based on the verb conjugation rules.
This proves harder than I thought because the stem itself may change. For example in the verb aanbesteden the stem appears as bested and in the verb aanleggen it appears as legg.
Further development will thus be necessary!
Abonneren op:
Posts (Atom)