Haystack makes integrating ElasticSearch into Django projects a breeze, but custom configuration takes a little bit more work.
A lot of feature requirements in Django projects are solved by domain specific third-party modules that smartly fit the bill and end up becoming something of a community standard. For search, Haystack is that touchstone: it supports some of the most common search engines and its API closely mirrors that of existing Django APIs, making it easy for developers to get started.
We’ve been using Haystack with a Lucene backed engine called ElasticSearch - you know, for search. Unlike the popular Solr search engine, ElasticSearch uses schema-free JSON instead of XML and runs as a binary without requiring an external Java server. For our needs it optimizes simplicity and power.
Note: ElasticSearch support is only available in Haystack 2.0.0 beta. To use it you’ll need to grab the code from source, not PyPI.
Rather than simply filtering your content, a search engine performs
textual matching. Unlike a LIKE
query in SQL, the query and indexed
content can be provided with different relevancy weights, language
characteristics can be chosen, and even synonyms. And it can do across
different types of content, or rather, different types of ‘documents’.
The search engine does so by tokenizing and filtering the content - both indexed content and query terms. ElasticSearch allows you to configure how these are used, and you can add your own as well. With the available filters and tokenizers, you can add in analyzers that reference different languages, use custom stop words, and filter on synonyms. You update the index based on the index configuration.
Here’s an example from the ElasticSearch docs for setting up an analyzer to filter on synonyms using a provided synonym file.
This looks like a pretty useful feature until you realize that Haystack’s ElasticSearch backend only supports a default setting configuration. Here’s what our index settings look like (source).
And here’s the snippet showing how these are used (source).
The settings configure two nGram analyzers for Haystack, but we’re left without a way of changing the filter or tokenizer attributes, or of adding a new analyzer.
The solution, for the time being, is to use a custom search backend. The first step is to update the settings used for updating the index. Here’s a custom backend extending the original.
This extended backend does nothing more than look for a custom settings dictionary in your project settings file and then replace the backend settings with your own. But now we can swap out those settings.
Even though we’ve updated the settings, our changes are still unavailable. Haystack assigns the specific analyzer to each search field based on a hard coded analyzer.
The default analyzer for non-nGram fields is the “snowball” analyzer. The snowball analyzer is basically a stemming analyzer, which means it helps piece apart words that might be components or compounds of others, as “swim” is to “swimming”, for instance. It also adds in a stop word filter, which removes common words from entering the index, such as common prepositions and articles. The analyzer is also language specific, which could be problematic since the default language is English and to change this you need to specify the language in the index settings.
Here’s the snippet in which the default analyzer is set in the
build_schema
method, with minor formatting changes for this page
(source).
The chosen analyzer should be configurable, so let’s make it so.
This update closely follows how the base method is written, including iterating through the fields as well as ignoring nGram fields. Now on reindexing all of your non-nGram indexed content will be analyzed with your specified analyzer. For explicitness the default analyzer is directly set as an attribute.
We’ve now set up a configurable default analyzer, but why not control
this on a field by field basis? It should be pretty straightforward.
We’ll just subclass the fields, adding an analyzer
attribute via a
keyword argument.
And then define a new field class using the mixin:
Just be sure to import and use the new field rather than the field from
the indexes
module as you’d normally do. This establishes which
analyzer the field should use, but doesn’t actually use the analyzer for
indexing. Again, we need to extend the subclassed backend to do so,
focusing on the build_schema
method.
If you wanted to control nGram analysis on a field by field basis simply remove the conditional.
When you update your project settings to use the new backend, ensure
that you’re referring to an engine instance (BaseEngine
), not a
backend instance (BaseSearchBackend
). Given that we’ve just defined a
new backend instance, we’ll need to also go ahead and define a new
search engine.
Now simply update your project settings accordingly to reference your new search engine backend and you’re good to go.
Don’t forget to update your index.
Django class-based view mixins for Haystack and handy debugging management commands.
Update you can grab the code used here in the reusable
elasticstack
app and install
from PyPI as well.