Lemmatization with apache lucene

Tags: , , , ,



I’m developing a text analysis project using apache lucene. I need to lemmatize some text (transform the words to their canonical forms). I’ve already written the code that makes stemming. Using it, I am able to convert the following sentence

The stem is the part of the word that never changes even when morphologically inflected; a lemma is the base form of the word. For example, from “produced”, the lemma is “produce”, but the stem is “produc-“. This is because there are words such as production

into

stem part word never chang even when morpholog inflect lemma base form word exampl from produc lemma produc stem produc becaus word product

However, I need to get the base forms of the words: example instead of exampl, produce instead of produc, and so on.

I am using lucene because it has analyzers for many languages (I need at least English and Russian). I know about Stanford NLP library, but it has no Russian language support.

So is there any way to do lemmatization for several languages like I do stemming using lucene?

The simplified version of my code responsible for stemming:

//Using apache tika to identify the language
LanguageIdentifier identifier = new LanguageIdentifier(text);
//getting analyzer according to the language (eg, EnglishAnalyzer for 'en')
Analyzer analyzer = getAnalyzer(identifier.getLanguage());
TokenStream stream = analyzer.tokenStream("field", text);
stream.reset();
while (stream.incrementToken()) {
    String stem = stream.getAttribute(CharTermAttribute.class).toString();
    // doing something with the stem
    System.out.print(stem+ " ");
}
stream.end();
stream.close();

UPDATE: I found the library that does almost what I need (for English and Russian languages) and uses apache lucene (although in its own way), it’s definitely worth exploring.

Answer

In case someone still needs it, I decided to return to this question and illustrate how to use the russianmorphology library I found earlier to do lemmatization for English and Russian languages.

First of all, you will need these dependencies (besides the lucene-core):

<!-- if you need Russain -->
<dependency>
    <groupId>org.apache.lucene.morphology</groupId>
    <artifactId>russian</artifactId>
    <version>1.1</version>
</dependency>

<!-- if you need English-->
<dependency>
    <groupId>org.apache.lucene.morphology</groupId>
    <artifactId>english</artifactId>
    <version>1.1</version>
</dependency>

<dependency>
    <groupId>org.apache.lucene.morphology</groupId>
    <artifactId>morph</artifactId>
    <version>1.1</version>
</dependency>

Note that these artifacts are located at CUBA repository (https://dl.bintray.com/cuba-platform/main/).

Then, make sure you import the right analyzer:

import org.apache.lucene.morphology.english.EnglishAnalyzer;
import org.apache.lucene.morphology.russian.RussianAnalyzer;

These analyzers, unlike standard lucene analyzers, use MorphologyFilter which converts each word into a set of its normal forms.

So if you use the following code

String text = "The stem is the part of the word that never changes even when morphologically inflected; a lemma is the base form of the word. For example, from "produced", the lemma is "produce", but the stem is "produc-". This is because there are words such as production";
Analyzer analyzer = new EnglishAnalyzer();
TokenStream stream = analyzer.tokenStream("field", text);
stream.reset();
while (stream.incrementToken()) {
    String lemma = stream.getAttribute(CharTermAttribute.class).toString();
    System.out.print(lemma + " ");
}
stream.end();
stream.close();

it will print

the stem be the part of the word that never change even when morphologically inflected inflect a lemma be the base form of the word for example from produced produce the lemma be produce but the stem be produc this be because there are be word such as production

And for the Russian text

String text = "Продолжаю цикл постов об астрологии и науке. Астрология не имеет научного обоснования, но является частью истории науки, частью культуры и общественного сознания. Поэтому астрологический взгляд на науку весьма интересен.";

the RussianAnalyzer will print the following:

продолжать цикл пост об астрология и наука астрология не иметь научный обоснование но являться часть частью история наука часть частью культура и общественный сознание поэтому астрологический взгляд на наука весьма интересный

Yo may notice that some words have more that one base form, e.g. inflected is converted to [inflected, inflect]. If you don’t like this behaviour, you would have to change the implementation of the org.apache.lucene.morphology.analyzer.MorhpologyFilter (if you are interested in how exactly to do it, let me know and I’ll elaborate on this).

Hope it helps, good luck!



Source: stackoverflow