How whether a string is randomly generated or plausibly an English word?

Question

I have a corpus of text which contains some strings. In these strings, some are English words, some are random such as VmsVKmGMY6eQE4eMI, there are no limit on the number of characters in each string. Is there any way to test whether or not one string is a English word? I am looking for some kind of algorithm…

Accepted Answer

If you mean some kind of a rule of a thumb that distinguishes english word from random text, there is none. For reasonable accuracy you will need to query an external source, whether it&#8217;s the Web, dictionary, or a service. If you only need to check for an existence of the word, I would suggest Wordnet. It is pretty simple to use and there is a nice Java API for it called JWNL, that makes querying Wordnet dictionary a breeze.

Advertisement

Answer