Today I started a project to build a lemmatization tool for Dutch verbs.
Most other lemmatization tools rely heavily on statistical techniques. Rules that are automatically generated are, however, very obscure to humans.
Because I would like that humans could easily improve the lemmatizer and learn from the rules, I will use manually crafted lemmatization rules.
An idea is also that the lemmatizer could learn from the same didactic material we humans use. Such material I found on the website
www.muiswerk.nl. They have for example verb information in the following text format:
aaien - aaide - heeft geaaidaanbesteden - besteedde aan - heeft aanbesteedaanduiden - duidde aan - heeft aangeduid
aanklagen - klaagde aan - heeft aangeklaagd
aankleden - kleedde aan - heeft aangekleed
aankondigen - kondigde aan - heeft aangekondigdaanleggen - legde aan - heeft aangelegd
As a start I structured this information according to the different rules for verb conjugation we humans would use. For example:
(affix+)stem+en - stem+de( affix) - heeft (affix+)ge+stem(+d)
(affix+)stem+en - stem+te( affix) - heeft (affix+)ge+stem(+t)Currently, I am developing a software tool that can import the verb information based on the verb conjugation rules.
data:image/s3,"s3://crabby-images/db971/db97189203e71d2823eb3904ac8e47f74b089f24" alt=""
This proves harder than I thought because the stem itself may change. For example in the verb
aanbesteden the stem appears as
bested and in the verb aanleggen it appears as legg.
Further development will thus be necessary!