zaterdag 8 augustus 2009

Dutch Lemmatizer

I just added support for the lemmatisation of adjectives in the Dutch lemmatizer.

Adjectives are words that modify nouns. An adjective generally occurs in two forms, an undeclined one and a declined one, ending in -e. A good description of the rules I found at Wikibooks.



Adjectives are also modified to form comparatives and superlatives. For example:
goed - beter -best.
Such special cases may be added as rules like:
beter=>goed
best=>goed

Ordinal numbers may also be considered as adjectives, so the lemmatizer should propose the cardinal number of any ordinal number. For example:
drie - derde.

In most cases, de cardinal number is found by deleting the end -de or -ste of the ordinal number.
For example:
twee - tweede
twintig - twintigste

Since this rule may only be applied for ordinal numbers, a list of cardinal numbers can be maintained. This list should not be too long since it is sufficient to cover the cardinal numbers on which the lemma might end.

Next, I will add support for Dutch adverbs and nouns in the lemmatizer.

Geen opmerkingen:

Een reactie posten