Skip to content
Advertisement

Hibernate search with lucene does not index similar names correctly

I’m learning Hibernate Search 6.1.3.Final with Lucene 8.11.1 as backend and Spring Boot 2.6.6. I’m trying to create a search for product names, barcodes and manufacturers. Currently, I’m doing an integration test to see what happens when a couple of products have similar name:

JavaScript

As you can see in the test, I expect to obtain the two tobaccos with similar names: tobaccoGreen and tobaccoGreenhouse by using a green as query for the search criteria. The entity is the following:

JavaScript

I have followed the docs and configure an analyzer for names:

JavaScript

And using a simple query with fuzzy option:

JavaScript

The test shows that is only able to find tobaccoGreen but not tobaccoGreenhouse and I don’t understand why, how can I search similar product names (or barcodes, manufacturer)?

Advertisement

Answer

Before I answer your question, I’d like to point out that calling .fetch(10).hits() is suboptimal, especially when using the default sort (like you do):

JavaScript

If you call .fetchHits(10) directly, Lucene will be able to skip part of the search (the part where it counts the total hit count), and in large indexes this could lead to sizeable performance gains. So, do this instead:

JavaScript

Now, the actual answer:


Approaching this through the search query

.fuzzy() isn’t magic, it won’t just match anything you think should match 🙂 There’s a specific definition of what it does, and that’s not what you want here.

To get the behavior you want, you could use this instead of your current predicate:

JavaScript

You lose fuzziness, but you get the ability to perform prefix queries, which would give the results you want (green* would match greenhouse).

However, the prefix queries are explicit: the user must add * after “green” in order to match “all words that start with green”.

Which leads us to…

Approaching this through analyzers

If you want this “prefix matching” behavior to be automatic, without the need to add * in the query, then what you need is a different analyzer.

Your current analyzer breaks down indexed text using space as a separator (more or less; it’s a bit more complex but that’s the idea). But you apparently want it to break down “greenhouse” into “green” and “house”; that’s the only way a query with the word “green” would match the word “greenhouse”.

To do that, you can use an analyzer similar to yours, but with an additional “edge_ngram” filter, to generate additional indexed tokens for every prefix string of your existing tokens.

Add another analyzer to your configurer:

JavaScript

And change your mapping to use the name analyzer when querying, but the name_prefix analyzer when indexing:

JavaScript

Now reindex your data.

Now your query “green” will also match “TobaCcO GreENhouse”, because “GreENhouse” was indexed as ["greenhouse", "gr", "gre", "gree", "green", "greenh", "greenho"].

Variations

edgeNGram filter on distinct fields

Instead of changing the analyzer of your current fields, you could add new fields for the same Java properties, but using the new analyzer with the edgeNGram filter:

JavaScript

Then you can target these fields as well as the normal ones in your query:

JavaScript

As you can see, I added a boost to the fields that don’t use prefix. This is the main advantage of this variant over the one I explained higher up: matches on actual words (not prefixes) will be deemed more important, yielding a better score and thus pulling documents to the top of the result list if you use a relevance sort (which is the default sort).

Handling only compound words instead of all words

I won’t detail it here, but there’s another approach if all you want is to handle compound words (“greenhouse” => “green” + “house”, “superman” => “super” + “man”, etc.). You can use the “dictionaryCompoundWord” filter. This is less powerful, but will generate less noise in your index (fewer meaningless tokens) and thus could lead to better relevance sorts. Another downside is that you need to provide the filter with a dictionary that contains all words that could possibly be “compounded”. For more information, see the source and javadoc of class org.apache.lucene.analysis.compound.DictionaryCompoundWordTokenFilterFactory, or the documentation of the equivalent filter in Elasticsearch.

User contributions licensed under: CC BY-SA
8 People found this is helpful
Advertisement