Lemmatization with apache lucene

Question

I&#8217;m developing a text analysis project using apache lucene. I need to lemmatize some text (transform the words to their canonical forms). I&#8217;ve already written the code that makes stemming. Using it, I am able to convert the following sentence The stem is the part of the word that never changes eve…

Accepted Answer

In case someone still needs it, I decided to return to this question and illustrate how to use the russianmorphology library I found earlier to do lemmatization for English and Russian languages.First of all, you will need these dependencies (besides the lucene-core): org.apache.lucene.morphology russian 1.1 org.apache.lucene.morphology english 1.1 org.apache.lucene.morphology morph 1.1Then, make sure you import the right analyzer:import org.apache.lucene.morphology.english.EnglishAnalyzer;import org.apache.lucene.morphology.russian.RussianAnalyzer;These analyzers, unlike standard lucene analyzers, use MorphologyFilter which converts each word into a set of its normal forms.So if you use the following codeString text = "The stem is the part of the word that never changes even when morphologically inflected; a lemma is the base form of the word. For example, from "produced", the lemma is "produce", but the stem is "produc-". This is because there are words such as production";Analyzer analyzer = new EnglishAnalyzer();TokenStream stream = analyzer.tokenStream("field", text);stream.reset();while (stream.incrementToken()) { String lemma = stream.getAttribute(CharTermAttribute.class).toString(); System.out.print(lemma + " ");}stream.end();stream.close();it will printthe stem be the part of the word that never change even whenmorphologically inflected inflect a lemma be the base form of the wordfor example from produced produce the lemma be produce but the stem beproduc this be because there are be word such as productionAnd for the Russian textString text = "Продолжаю цикл постов об астрологии и науке. Астрология не имеет научного обоснования, но является частью истории науки, частью культуры и общественного сознания. Поэтому астрологический взгляд на науку весьма интересен.";the RussianAnalyzer will print the following:продолжать цикл пост об астрология и наука астрология не иметь научныйобоснование но являться часть частью история наука часть частьюкультура и общественный сознание поэтому астрологический взгляд нанаука весьма интересныйYo may notice that some words have more that one base form, e.g. inflected is converted to [inflected, inflect]. If you don’t like this behaviour, you would have to change the implementation of the org.apache.lucene.morphology.analyzer.MorhpologyFilter (if you are interested in how exactly to do it, let me know and I’ll elaborate on this).Hope it helps, good luck!

Advertisement

Answer