zondag 9 augustus 2009

Dutch Lemmatizer 2

Today I added support for Dutch noun lemmatisation in the Dutch lemmatizer.

Most Dutch nouns change when used in plural or as a diminutive. For example, the plural of been (leg) is benen and the diminutive of schip (ship) is scheepje. It is now possible to add the lemmatisation rules (based on the ending of the noun) in the Dutch lemmatizer.

In many cases the plural of a noun is formed by adding -en to the lemma. A good rule to derive the lemma of the plural form is then to remove the ending -en. Exceptions on this rule are nouns that end on -e or -en. Therefor, it is possible to manage these nouns in a separate list.

I would like to use the Dutch lemmatizer on Dutch electronic texts to provide a dictionary lookup functionality. Therefore, I will next develop a Dutch tokenizer that separates a Dutch text in tokens, i.e. words and punctuation.

Geen opmerkingen:

Een reactie posten