I have a corpus of text which contains some strings. In these strings, some are English words, some are random such as VmsVKmGMY6eQE4eMI, there are no limit on the number of characters in each string.
Is there any way to test whether or not one string is a English word? I am looking for some kind of algorithm that does the job. This is in Java, and I rather not to implement an extra dictionary.
Advertisement
Answer
If you mean some kind of a rule of a thumb that distinguishes english word from random text, there is none. For reasonable accuracy you will need to query an external source, whether it’s the Web, dictionary, or a service.
If you only need to check for an existence of the word, I would suggest Wordnet. It is pretty simple to use and there is a nice Java API for it called JWNL, that makes querying Wordnet dictionary a breeze.