12.21
On my current project we have a large central store of articles, each article is tagged with terms from a thesaurus and indexed in a Solr collection. There are a couple of issues with getting the expected results from the search..
Scoring
If we have a super simple schema:
- Id
- Title
- Summary
- Tags
.. and add two documents to the store ..
<field name="id">FELINE-ARTICLE-1</field>
<field name="title">The biggest feline in the world</field>
<field name="summary">Today a moggy of enormous proportions was found.</field>
<field name="tags">cat dog monkey</field>
</doc>
<doc>
<field name="id">CANINE-ARTICLE-1</field>
<field name="title">The biggest canine in the world</field>
<field name="summary">Today a mutt of enormous proportions was found.</field>
<field name="tags">dog monkey cat</field>
</doc>
.. obviously we have one document that is more about cats than dogs and visa versa, but when we search the tags field for ‘cat’ the results we get back have an equal scoring according to Solr. THis is due to the fact that they contain the same terms, even if they are in a different order, and so score equally. Meaning that the dog article could appear before the cat article.
The ideal situation would be to weight tags so that we are able to say that one tag is more relevant than the others, luckily in Solr 1.4 we now have the option of a new field type called payloads.
Payloads work by using pairs of terms and numerical weightings .. e.g. cat | 1.5 | dog | 6.0 | monkey | 0.1. This would mean that the dog tag is more relevant in this article than the others. We should therefore change our field type in the solrschema.xml to use the payloads type and update our documents accordingly.
<field name="id">FELINE-ARTICLE-1</field>
<field name="title">The biggest feline in the world</field>
<field name="summary">Today a moggy of enormous proportions was found.</field>
<field name="tags">cat|2.4 dog|1.2 monkey|0.1</field>
</doc>
<doc>
<field name="id">CANINE-ARTICLE-1</field>
<field name="title">The biggest canine in the world</field>
<field name="summary">Today a mutt of enormous proportions was found.</field>
<field name="tags">dog|3.4 monkey|1.2 cat|0.1</field>
</doc>
Interlude – Code Section
Unfortunately this is only half the battle, getting this to work in Solr takes a bit of work, while payloads are supported in the underlying Lucene engine and have a field type defined in the example schema some Java code is needed to get it to all hang together. I found this rather hard to find and in lots of little pieces and some of the syntax had chnaged between versions. I’ll repeat the code here just in case it can help anyone get up and running any quicker.
The first part is to add a similarity analyser to your schema.xml so that the values get indexed correctly..
import org.apache.lucene.analysis.payloads.PayloadHelper;
import org.apache.lucene.search.DefaultSimilarity;
public class PayloadSimilarity extends DefaultSimilarity
{
@Override public float scorePayload(int docId, String fieldName, int start, int end, byte[] payload, int offset, int length)
{
// can ignore length here, because we know it is encoded as 4 bytes
return PayloadHelper.decodeFloat(payload, offset);
}
}
<similarity class="uk.org.company.solr.PayloadSimilarity" />
The next step is to add a payload query analyser – this is taken directly from https://issues.apache.org/jira/browse/SOLR-1485.
You must also update your solrconfig.xml
import org.apache.lucene.index.Term;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.payloads.*;
import org.apache.solr.common.SolrException;
import org.apache.solr.common.params.SolrParams;
import org.apache.solr.common.util.NamedList;
import org.apache.solr.request.SolrQueryRequest;
import org.apache.solr.search.QParser;
import org.apache.solr.search.QParserPlugin;
import org.apache.solr.search.QueryParsing;
public class PayloadTermQueryPlugin extends QParserPlugin {
public void init(NamedList args) {
}
@Override
public QParser createParser(String qstr, SolrParams localParams, SolrParams params, SolrQueryRequest req) {
return new QParser(qstr, localParams, params, req) {
public Query parse() throws ParseException {
return new PayloadTermQuery(
new Term(localParams.get(QueryParsing.F), localParams.get(QueryParsing.V)),
createPayloadFunction(localParams.get("func")),
false);
}
};
}
private PayloadFunction createPayloadFunction(String func) {
// TODO: refactor so that payload functions are registered as plugins and loaded
// through SolrResourceLoader.
PayloadFunction payloadFunction = null;
if ("min".equals(func)) {
payloadFunction = new MinPayloadFunction();
} else if ("avg".equals(func)) {
payloadFunction = new AveragePayloadFunction();
} else if ("max".equals(func)) {
payloadFunction = new MaxPayloadFunction();
}
if (payloadFunction == null) {
throw new SolrException( SolrException.ErrorCode.BAD_REQUEST, "unknown PayloadFunction: " + func);
}
return payloadFunction;
}
}
<queryParser name="payload" class="uk.org.company.solr.PayloadTermQueryPlugin" />
Querying
We can now use this new query parser to search our documents and should get the results we expect .. i.e. that if we search for ‘cat’ it should appear higher in the search results due to its weighting given in the payload field.
The query parser allows us to specify 3 modes for ordering the weighting Max, Average and Minimum.
http://localhost:8983/solr/select?q={!payload%20f=tags%20func=avg}cat&debugQuery=true&indent=on .. (AVG)
http://localhost:8983/solr/select?q={!payload%20f=tags%20func=min}cat&debugQuery=true&indent=on .. (MIN)
.. with debugQuery on we can see how the new scoring is in effect.
0.5044795 = (MATCH) weight(payloads:cat in 0), product of:
0.99999994 = queryWeight(payloads:cat), product of:
0.5945349 = idf(payloads: cat=2)
1.681987 = queryNorm
0.5044796 = (MATCH) fieldWeight(payloads:cat in 0), product of:
1.6970563 = (MATCH) btq, product of:
0.70710677 = tf(phraseFreq=0.5)
2.4 = scorePayload(...)
0.5945349 = idf(payloads: cat=2)
0.5 = fieldNorm(field=payloads, doc=0)
</str>
<str name="CANINE-ARTICLE-1">
0.02101998 = (MATCH) weight(payloads:cat in 1), product of:
0.99999994 = queryWeight(payloads:cat), product of:
0.5945349 = idf(payloads: cat=2)
1.681987 = queryNorm
0.021019982 = (MATCH) fieldWeight(payloads:cat in 1), product of:
0.07071068 = (MATCH) btq, product of:
0.70710677 = tf(phraseFreq=0.5)
0.1 = scorePayload(...)
0.5945349 = idf(payloads: cat=2)
0.5 = fieldNorm(field=payloads, doc=1)
</str>
While the project develops I will try to post some more on the techniques and technologies we are using to achieve certain our goals, hopefully using payloads can give us the flexibility and results we need for this part of our project.
Hi,
I want to boost that payload term while querying also.
that means we have 3 payload terms , those are cat,dog, monkey
i want to search with two or more payloads by specifying the query time weightage, like
http://localhost:8983/solr/select?q={!payload%20f=tags%20func=max}cat^5 dog^4&debugQuery=true
score calculation should happen like this.
first doc = 2.4*5+1.2*4 = 16.8
second doc =0.1*5+3.4*4 = 14.1
and should display the docs based on that.
can anyone , please suggest me, how to acheive this, I have the above setup ready.
Thanks,
Leela