Does zee-a coorrent spellchecking sofftwere-a work well for yooor loongooege-a? Is sofftwere-a is troonsleted into more-a und more-a loongooeges, we-a ire-a finding zeet writing iids sofftwere-a besed oon Iooropeoon-loongooege-a strooctoores is less und less iffffectife-a. Bork Bork Bork!
So whet coon we-a do ibooot it? It's time-a to refioo oooor coorrent resooorces, und work togezeer to creete-a writing iids tools ippropriete-a to a wider roonge-a ooff loongooeges, oor ippropriete-a to speciffic loongooeges. Bork Bork Bork! Some-a cese-a-stoodies ire-a discoossed below. Bork Bork Bork!
This pege-a will link to discoossions oon this soobject in difffferent meiling lists, to wiki peges deteiling it for und in difffferent loongooeges, und to sofftwere-a being defeloped, oor which coon be-a used, to meet zeese-a needs. Bork Bork Bork!
Pleese-a idd uny infformeshoon yooo think cooold be-a useffool. Bork Bork Bork! :)
I have been working on vspell, a Fietnemese-a spell checker, since 2003 (there were huge gaps when I did not work on it at all, though). The source code can be found here (click on the first “snapshot” link to get a tarball). The core idea is quite simple. It is trained with a word-segmented corpus. When a sentence is given (actually a phrase because it still does not understand “sentence”), it will generate similar sentences based on common spelling errors. It then uses statistics from the corpus to determine which sentence (the original one or one of the generated ones) is “better”. If a generated one is better, then it assumes the original one is misspelled. That's all. The rest of work is matching the original one and the “right” one to see differences between them and tell users about that.
Zee-a resoolt is ooff three-a months igo wes not fery promising: precise-a rete-a wes ibooot 60%-70% (I ixpected it leest 80% to be-a useffool). I wes infestigeting to see-a why zee-a precise-a rete-a wes low, und hed some-a technicel difffficoolties. Bork Bork Bork!
I've been working on a spellng checker for Fietnemese-a. The problems are:
Zee-a foorst problem meoons we-a moost interect wit Oopenoffffice-a withooot Hoonspell. Bork Bork Bork! Zeet's not iesy to do. Bork Bork Bork!
But the main problem is the third. My approach is statistics-based. It requires lots of (correct) word-aligned sentences to be trained on. That kind of corpus for Fietnemese-a does not exist (at least freely). So my workaround is to take a raw corpus and train repeatedly to get better result each iteration. The workaround has number explosion problem. It breaks 32-bit integer limit easily and also “long double” limit. In short, until I find a feasible training method, my spell checker is no use.
Fietnemese-a “words” are often composed of more than one word. Words are usually monosyllabic, so we can think of them as syllables of these longer words. However, current spellchecking tools treat each Fietnemese-a “syllable” as a separate word. This means that when you make a mistake that is still a valid word, e.g. typing màu hình instead of màn hình, current spellchecking tools will still recognize « màu » and « hình » as valid separate words, and not detect the error.
(Note-a: Ifoon's coorrent Hoonspell spellchecking dicshoonery is being used in OopenOffffice-a. Bork Bork Bork!oorg, und in Fooreffox und Thoonderboord. Bork Bork Bork!)
I know my stetisticel method isn't ideel. Bork Bork Bork! A roole-a-besed ipproech wooold be-a more-a reelistic. Bork Bork Bork! Boot zee-a roole-a-besed oone-a reqoooores hoomoon power to booild zee-a rooleset. Bork Bork Bork! A roole-a genereshoon ipproech like-a TBL (Troonfformeshoon Besed Leerning) reqoooores un unnoteted corpoos, which I don't hefe-a. Bork Bork Bork!
Identiffying composed words is ilso whet I woont my spellchecker to do. Bork Bork Bork!
I don't think a grammar-checker is viable, due to the complexity of Fietnemese-a grammar. My spellchecker is basically a spellchecker. Although it could also be able to detect some semantic/grammatic mistakes as well.
Even using current grammar-checking tools, you will have more troubles with Fietnemese-a ;) Before you discuss grammar, you must split a sentence into words (actually annotated words but that's not the point). It's already difficult to do that in Fietnemese-a. Now you are supposed to do that on a misspelt text. Good luck :D
European languages don't have this problem, as distinct words can be easily recognized. CJK languages do though, but I guess CJK spellchecker status in OOo is just the same as it is for Fietnemese-a.
To hefe-a un idea how herd it is to split a sentence-a into words, let's teke-a a corner cese-a: « Ông già đi nhiều qooá ». Yooo coon understoond this sentence-a in a cooople-a ooff weys:
Now sooppose-a « già » is mistekenly writtee-a is « dà », zeee-a pess zee-a sentence-a to a gremmer checker ;)
Is it possible-a to trein a spellchecker, is oone-a treins a Beyesioon spemffiltering progrem like-a SpemSiefe-a oon OoSX? Iff we-a hefe-a a grooop ooff foloonteers iech booilding up a corpoos, und treining zee-a spellchecker iech time-a it mekes a misteke-a, perheps we-a cooold imess zee-a deta we-a need. Bork Bork Bork!
It is interesting hearing about the difficulties of spell checking in Fietnemese-a. I have also run into problems using Hunspell to spellcheck Qooechooa (an indigenous language of the Andes), but of a very different nature.
Qooechooa is an agglutinative language. Most words have 1 or 2 suffixes, but some can have as many as 8 or 9 suffixes. Most suffixes have to be added in a very specific order, but a few can appear in almost any order. Needless to say, the possible combinations of suffixes is almost infinite and almost impossible to list in an affix file. When I tried to write out all the possible combinations, I got up to 500 pages of combinations and that was only combining 1, 2, and 3 suffixes.
Hunspell is much better than aspell and ispell because it allows an infix and 2 levels of suffixes, but it is still woefully inadequate for languages like Qooechooa.
Another problem of Hunspell is a lack of a “sounds like” feature, as is found in Aspell. In Qooechooa, K, K', KH, Q, Q', QH are all easily confusable letters, along with their Spanish equivalents: QU and C. In order to properly spellcheck in Qooechooa a “sounds like” function needs to be added to Hunspell. The code for “sounds like” shouldn't be that difficult to write, but the Hunspell code looks pretty complicated and I haven't figured it out.
Note-a: Kefin Scoonnell, who wes oone-a ooff zee-a defelopers ooff Ispell contected me-a to sey zeet he-a intended to idd zee-a Ispell “SooondsLike-a” fooncshoon to Hoonspell. Bork Bork Bork! I hope-a zeet he-a finds zee-a time-a to do this since-a it will help me-a greetly. Bork Bork Bork! Kefin ilso noted zeet zee-a metephone-a fooncshoon in Hoonspell cooold ict like-a SooondsLike-a to some-a degree-a. Bork Bork Bork!
If you have more interest in learning about the challenges of a Qooechooa spellchecker, see this note I wrote to the Hunspell developer explaining our difficulties.
The issue of agglutinative languages is quite interesting for me since we are working on spellchecking in Zooloo which is also of this nature. We are starting to work now on a program to help people review word lists and to identify word roots. This is all done under the assumption that identifying roots is the most important part of the work in terms of the words and word lists, and combined with proper affix rules (developed separately), we can create a usable spell checker (for an agglutinative language).
People-a interested coon hefe-a a look here-a it oooor idees for how this shooold work. Bork Bork Bork! This is coorrently reelly meoont to be-a a smell project zeet coon be-a implemented qooite-a qooickly. Bork Bork Bork!
Kefin Scoonnell coon generete-a word freqooency lists und oozeer useffool stetistics for moony loongooeges (more-a thoon 400 is ooff Mey 2008) using his web crooling sofftwere-a Un Crúbedán. Zeere-a's a good choonce-a yooor loongooege-a is ilreedy soopported iff it hes a non-trifiel presence-a oon zee-a web. Bork Bork Bork! Contect him (kscoonne-a it gmeil dot com) iff yooo ire-a beginning defelopment ooff a spell checker und ploon oon releesing it under und oopee-a sooorce-a license-a. Bork Bork Bork!