Hibernate search with lucene does not index similar names correctly

Question

I'm learning Hibernate Search 6.1.3.Final with Lucene 8.11.1 as backend and Spring Boot 2.6.6. I'm trying to create a search for product names, barcodes and manufacturers. Currently, I'm doing an integration test to see what happens when a couple of products have similar name: As you can see in the test, I expect to obtain the two tobaccos with similar

Accepted Answer

Before I answer your question, I&#8217;d like to point out that calling .fetch(10).hits() is suboptimal, especially when using the default sort (like you do):        return Search.session(entityManager)            .search(Tobacco.class)            .where(f -> f.match()                .fields("barcode", "name", "manufacturer")                .matching(query)                .fuzzy()            )            .fetch(10)            .hits();If you call .fetchHits(10) directly, Lucene will be able to skip part of the search (the part where it counts the total hit count), and in large indexes this could lead to sizeable performance gains. So, do this instead:        return Search.session(entityManager)            .search(Tobacco.class)            .where(f -> f.match()                .fields("barcode", "name", "manufacturer")                .matching(query)                .fuzzy()            )            .fetchHits(10);Now, the actual answer:Approaching this through the search query.fuzzy() isn&#8217;t magic, it won&#8217;t just match anything you think should match 🙂 There&#8217;s a specific definition of what it does, and that&#8217;s not what you want here.To get the behavior you want, you could use this instead of your current predicate:            .where(f -> f.simpleQueryString()                .fields("barcode", "name", "manufacturer")                .matching("green*")            )You lose fuzziness, but you get the ability to perform prefix queries, which would give the results you want (green* would match greenhouse).However, the prefix queries are explicit: the user must add * after &#8220;green&#8221; in order to match &#8220;all words that start with green&#8221;.Which leads us to&#8230;Approaching this through analyzersIf you want this &#8220;prefix matching&#8221; behavior to be automatic, without the need to add * in the query, then what you need is a different analyzer.Your current analyzer breaks down indexed text using space as a separator (more or less; it&#8217;s a bit more complex but that&#8217;s the idea). But you apparently want it to break down &#8220;greenhouse&#8221; into &#8220;green&#8221; and &#8220;house&#8221;; that&#8217;s the only way a query with the word &#8220;green&#8221; would match the word &#8220;greenhouse&#8221;.To do that, you can use an analyzer similar to yours, but with an additional &#8220;edge_ngram&#8221; filter, to generate additional indexed tokens for every prefix string of your existing tokens.Add another analyzer to your configurer:@Component("luceneTobaccoAnalysisConfigurer")public class LuceneTobaccoAnalysisConfigurer implements LuceneAnalysisConfigurer {    @Override    public void configure(LuceneAnalysisConfigurationContext context) {        context.analyzer("name").custom()            .tokenizer("standard")            .tokenFilter("lowercase")            .tokenFilter("asciiFolding");        // THIS PART IS NEW        context.analyzer("name_prefix").custom()            .tokenizer("standard")            .tokenFilter("lowercase")            .tokenFilter("asciiFolding")            .tokenFilter("edgeNGram")                    // Handling prefixes from 2 to 7 characters.                    // Prefixes of 1 character or more than 7 will                    // not be matched.                    // You can extend the range, but this will take more                    // space in the index for little gain.                    .param( "minGramSize", "2" )                    .param( "maxGramSize", "7" );    }}And change your mapping to use the name analyzer when querying, but the name_prefix analyzer when indexing:@Data@Entity@Indexed@NoArgsConstructor@AllArgsConstructor@Builder(toBuilder = true)@EqualsAndHashCode(of = "id")@EntityListeners(AuditingEntityListener.class)public class Tobacco {    @Id    @GeneratedValue    private UUID id;    @NotBlank    @KeywordField    private String barcode;    @NotBlank    // CHANGE THIS    @FullTextField(analyzer = "name_prefix", searchAnalyzer = "name")    private String name;    @NotBlank    // CHANGE THIS    @FullTextField(analyzer = "name_prefix", searchAnalyzer = "name")    private String manufacturer;    @CreatedDate    private Instant createdAt;    @LastModifiedDate    private Instant updatedAt;}Now reindex your data.Now your query &#8220;green&#8221; will also match &#8220;TobaCcO GreENhouse&#8221;, because &#8220;GreENhouse&#8221; was indexed as ["greenhouse", "gr", "gre", "gree", "green", "greenh", "greenho"].VariationsedgeNGram filter on distinct fieldsInstead of changing the analyzer of your current fields, you could add new fields for the same Java properties, but using the new analyzer with the edgeNGram filter:@Data@Entity@Indexed@NoArgsConstructor@AllArgsConstructor@Builder(toBuilder = true)@EqualsAndHashCode(of = "id")@EntityListeners(AuditingEntityListener.class)public class Tobacco {    @Id    @GeneratedValue    private UUID id;    @NotBlank    @KeywordField    private String barcode;    @NotBlank    @FullTextField(analyzer = "name")    // ADD THIS    @FullTextField(name = "name_prefix", analyzer = "name_prefix", searchAnalyzer = "name")    private String name;    @NotBlank    @FullTextField(analyzer = "name")    // ADD THIS    @FullTextField(name = "manufacturer_prefix", analyzer = "name_prefix", searchAnalyzer = "name")    private String manufacturer;    @CreatedDate    private Instant createdAt;    @LastModifiedDate    private Instant updatedAt;}Then you can target these fields as well as the normal ones in your query:@Component@AllArgsConstructorpublic class IndexSearchTobaccoRepository {    private final EntityManager entityManager;    public List<Tobacco> find(String query) {        return Search.session(entityManager)            .search(Tobacco.class)            .where(f -> f.match()                .fields("barcode", "name", "manufacturer").boost(2.0f)                .fields("name_prefix", "manufacturer_prefix")                .matching(query)                .fuzzy()            )            .fetchHits(10);    }}As you can see, I added a boost to the fields that don&#8217;t use prefix. This is the main advantage of this variant over the one I explained higher up: matches on actual words (not prefixes) will be deemed more important, yielding a better score and thus pulling documents to the top of the result list if you use a relevance sort (which is the default sort).Handling only compound words instead of all wordsI won&#8217;t detail it here, but there&#8217;s another approach if all you want is to handle compound words (&#8220;greenhouse&#8221; => &#8220;green&#8221; + &#8220;house&#8221;, &#8220;superman&#8221; => &#8220;super&#8221; + &#8220;man&#8221;, etc.). You can use the &#8220;dictionaryCompoundWord&#8221; filter. This is less powerful, but will generate less noise in your index (fewer meaningless tokens) and thus could lead to better relevance sorts.Another downside is that you need to provide the filter with a dictionary that contains all words that could possibly be &#8220;compounded&#8221;.For more information, see the source and javadoc of class org.apache.lucene.analysis.compound.DictionaryCompoundWordTokenFilterFactory, or the documentation of the equivalent filter in Elasticsearch.

Hibernate search with lucene does not index similar names correctly

Advertisement

Answer

Approaching this through the search query

Approaching this through analyzers

Variations

`edgeNGram` filter on distinct fields

Handling only compound words instead of all words