maandag 25 januari 2010

Sirius Dutch Text Editor 2

I made some progress in developing the software tool. First I parsed Wiktionary information to extend the dictionary information (of the Dutch Text Interpretation Aid) with lexical categories. These lexical categories will be used to verify the grammar of sentences. Because words may refer to multiple meanings (that may belong to different lexical categories), the user should first specify the proper meaning of each word. For this purpose, ambiguous and not yet specified words are underlined in blue. While the mouse pointer hovers over such underlined word a popup menu appears that allows to select the proper meaning. Once all words of a sentence have a proper meaning attached to them, it may be that the grammar of the sentence is incorrect or unknown by the tool. Such incorrect sentence will be underlined green.
I still have to develop an easy way to add valid syntactic patterns and improve the analysis of the text.
It is already possible to save and load the semantically enriched text in HTML-format. I should however further improve the functionality (with JavaScript) so that the published texts may still be understood easily.

maandag 18 januari 2010

Sirius Dutch Text Editor

The new software tool I am working on is a Dutch Text Editor. The software tool will support the writing of Dutch texts. While writing a text, the tool will not only highlight spelling and grammar mistakes, but will also indicate ambiguous and/or difficult words. Where possible the tool will offer the user a list of synonyms to replace difficult words. The user may also specify the meaning of ambiguous words by choosing the proper definition. This semantic information will be saved together with the text so that it is easier for readers (humans or machines) to interpret the semantically enriched text.
So far, I developed a tokenizer and sentence splitter for Dutch.
As you can see in the figure, the sentence splitter distinguishes between a point for the digit group separator and for a full stop. The tokenizer also replaces abbreviations like "zo'n" with the full form "zo een". This should facilitate the work of the parser that still needs to be developed.

maandag 4 januari 2010

My Own Company

Happy New Year! Since 2010-01-01, I have my own company (Sirius Computing). Via the company I wish to develop and distribute useful software tools for natural language processing.
The first product available is the Dutch Text Interpretation Aid. Hopefully a lot of people decide to use and buy the software tool. The profits may then be used to develop other software tools and information resources.