Finally, I found some time to develop a JavaDB database to store the contents of the Dutch Text Interpretation Aid.
I believe this is important, because a database is better scalable than plain text files to store and retrieve information. The database also allows collaboration by making use of time stamps and user information. I plan to distribute the software tool to interested users who may contribute to, and make use of, the shared database.
Including all word forms for Dutch takes a huge amount of time and energy. Dutch is after all a highly inflected language. By dividing the work over many users, I hope that the benefits will outnumber the costs.
Currently, the tool recognises about 90% of the words in a news article. About 2% of the missing words are proper nouns (see the figure below).
zondag 6 december 2009
Abonneren op:
Posts (Atom)