Skip to content
Advertisement

Why is Lucene sometimes not matching InChIKeys?

I have indexed my database using Hibernate Search. I use a custom analyzer, both for indexing and for querying. I have a field called inchikey that should not get tokenized. Example values are:

  • BBBAWACESCACAP-UHFFFAOYSA-N
  • KEZLDSPIRVZOKZ-AUWJEWJLSA-N

When I look into my index with Luke I can confirm that they are not tokenized, as required.

However, when I try to search them using the web app, some inchikeys are found and others are not. Curiously, for these inchikeys the search DOES work when I search without the last hyphen, as so: BBBAWACESCACAP-UHFFFAOYSA N

I have not been able to find a common element in the inchikeys that are not found.

Any idea what is going on here?

I use a MultiFieldQueryParser to search over the different fields in the database:

    String[] searchfields = Compound.getSearchfields();
    MultiFieldQueryParser parser = new MultiFieldQueryParser(Version.LUCENE_29, Compound.getSearchfields(), new ChemicalNameAnalyzer());
    //Disable the following if search performance is too slow
    parser.setAllowLeadingWildcard(true);
    FullTextQuery fullTextQuery = fullTextSession.createFullTextQuery(parser.parse("searchterms"), Compound.class);
    List<Compound> hits = fullTextQuery.list();

More details about our setup have been posted here by Tim and I.

Advertisement

Answer

It turns out the last entries in the input file are not being indexed correctly. These ARE being tokenized. In fact, it seems they are indexed twice: once without being tokenized and once with. When I search I cannot find the un-tokenized.

I have not yet found the reason, but I think it perhaps has to do with our parser ending while Lucene is still indexing the last entries, and as a result Lucene reverting to the default analyzer (StandardAnalyzer). When I find the culprit I will report back here.

Adding @Analyzer(impl = ChemicalNameAnalyzer.class) to the fields solves the problem, but what I want is my original setup, with the default analyzer defined once, in config, like so:

<property name="hibernate.search.analyzer">path.to.ChemicalNameAnalyzer</property>
User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement