Skip to content
Advertisement

Add weights to documents Lucene 8

I am currently working on a small search engine for college using Lucene 8. I already built it before, but without applying any weights to documents.

I am now required to add the PageRanks of documents as a weight for each document, and I already computed the PageRank values. How can I add a weight to a Document object (not query terms) in Lucene 8? I looked up many solutions online, but they only work for older versions of Lucene. Example source

Here is my (updated) code that generates a Document object from a File object:

public static Document getDocument(File f) throws FileNotFoundException, IOException {
    Document d = new Document();

    //adding a field
    FieldType contentType = new FieldType();
    contentType.setStored(true);
    contentType.setTokenized(true);
    contentType.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);
    contentType.setStoreTermVectors(true);

    String fileContents = String.join(" ", Files.readAllLines(f.toPath(), StandardCharsets.UTF_8));
    d.add(new Field("content", fileContents, contentType));

    //adding other fields, then...

    //the boost coefficient (updated):
    double coef = 1.0 + ranks.get(path);
    d.add(new DoubleDocValuesField("boost", coef));

    return d;

}

The issue with my current approach is that I would need a CustomScoreQuery object to search the documents, but this is not available in Lucene 8. Also, I don’t want to downgrade now to Lucene 7 after all the code I wrote in Lucene 8.


Edit:

After some (lengthy) research, I added a DoubleDocValuesField to each document holding the boost (see updated code above), and used a FunctionScoreQuery for searching as advised by @EricLavault. However, now all my documents have a score of exactly their boost, regardless of the query! How do I fix that? Here is my searching function:

public static TopDocs search(String query, IndexSearcher searcher, String outputFile) {
    try {
        Query q_temp = buildQuery(query); //the original query, was working fine alone

        Query q = new FunctionScoreQuery(q_temp, DoubleValuesSource.fromDoubleField("boost")); //the new query
        q = q.rewrite(DirectoryReader.open(bm25IndexDir));
        TopDocs results = searcher.search(q, 10);

        ScoreDoc[] filterScoreDosArray = results.scoreDocs;
        for (int i = 0; i < filterScoreDosArray.length; ++i) {
            int docId = filterScoreDosArray[i].doc;
            Document d = searcher.doc(docId);

            //here, when printing, I see that the document's score is the same as its "boost" value. WHY??
            System.out.println((i + 1) + ". " + d.get("path")+" Score: "+ filterScoreDosArray[i].score);
        }

        return results;
    }
    catch(Exception e) {
        e.printStackTrace();
        return null;
    }
}

//function that builds the query, working fine
public static Query buildQuery(String query) {
    try {
        PhraseQuery.Builder builder = new PhraseQuery.Builder();
        TokenStream tokenStream = new EnglishAnalyzer().tokenStream("content", query);
        tokenStream.reset();

        while (tokenStream.incrementToken()) {
          CharTermAttribute charTermAttribute = tokenStream.getAttribute(CharTermAttribute.class);
          builder.add(new Term("content", charTermAttribute.toString()));
        }

        tokenStream.end(); tokenStream.close();
        builder.setSlop(1000);
        PhraseQuery q = builder.build();

        return q;
    }
    catch(Exception e) {
        e.printStackTrace();
        return null;
    }
}

Advertisement

Answer

Regarding my edited problem (boost value completely replacing search score instead of boosting it), here is what the documentation says about FunctionScoreQuery (emphasis mine):

A query that wraps another query, and uses a DoubleValuesSource to replace or modify the wrapped query’s score.

So, when does it replace, and when does it modify?

Turns out, the code I was using is for entirely replacing the score by the boost value:

Query q = new FunctionScoreQuery(q_temp, DoubleValuesSource.fromDoubleField("boost")); //the new query

What I needed to do instead was using the function boostByValue, that modifies the searching score (by multiplying the score by the boost value):

Query q = FunctionScoreQuery.boostByValue(q_temp, DoubleValuesSource.fromDoubleField("boost"));

And now it works! Thanks @EricLavault for the help!

User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement