A Basic Full Text Search Server in Erlang
In the beginner’s mind there are many possibilities, but in the expert’s there are few
This post explains how to build a basic full text search server in Erlang. The server has the following features:
- indexing
- stemming
- ranking
- faceting
- asynchronous search results
- web frontend using websockets
Familiarity with the OTP design principles is recommended.
The sample application (build with help from my colleague Michel Rijnders mies@tty.nl
) uses the Creative Commons Data Dump from StackExchange as demo data.
We cover the following subjects:
- running the sample application
- OTP supervision tree
- importing demo data
- indexing
- stemming
- faceting
- querying and relevance ranking
- displaying search results
- improvements
Running the Sample Application
Clone the source from GitHub:
git clone git://github.com/tty/async_search.git
And start the application:
$ rebar get-deps compile && erl -pa `pwd`/ebin `pwd`/deps/*/ebin +P 134217727
Eshell> application:start(async).
Eshell> stackoverflow_importer_ser:import().
Visit http://localhost:3000, you should see the following page:
Sample ranked search output for the query erlang armstrong
:
Sample tags facets output for the query java
:
OTP Supervision Tree
Looking at the OTP application supervision tree is a good way to understand the architecture of an OTP application.
The application supervisor async_sup
starts up the following supervisors:
keyword_sup
. Akeyword_ser
process is created for every unique word in the StackExchange posts. Thiskeyword_ser
is linked to thekeyword_sup
supervisor (asimple_one_for_one
supervisor). Thekeyword_ser
child process maintains a list of document positions of a keyword (an inverted index).facet_sup
. Akeyword_ser
process is also created for every unique facet category in the StackExchange posts. Thiskeyword_ser
process is linked to thefacet_sup
supervisor (asimple_one_for_one
supervisor as well). Thekeyword_ser
child process maintains a list of facet values with the IDs of the documents the facets appear in.
The application supervisor also start the following gen_server
singleton processes:
stackoverflow_importer_ser
. This server imports the demo Stack Overflow data.document_ser
. This server holds a copy of all documents, so it can return the original title and body of matching Stack Overflow posts in the results.query_ser
. This server’s task is to run the actual query and return results.websocket_ser
. This server provides a HTTP frontend for the search engine.
No attention is given to fault tolerance (apart from the basic restart strategies), thus parts of the search index are lost if a keyword_ser
process terminates.
Demo Data Import
The StackExchange data is provided as XML. Since some of the documents are quite large, it’s not recommended to load the full XML documents in memory. The solution is to use a SAX parser which treats a XML file as a stream, and triggers events when new elements are discovered. The search server uses the excellent SAX parser from the Erlsom library by Willem de Jong.
In the example below erlsom:parse_sax
reads the XML file from FilePath
and calls the function sax_event
if an XML element is found.
When the element is a row
element (i.e. a post element), attributes like Id
, Title
and Body
are stored in a dictionary. For every post a copy of all the attributes in document_ser
is saved. This is used for returning the actual posts for a query match. After that the add_attribute_tokens
function is called:
The add_attribute_tokens
function does two things. It calls add_facet
(discussed later) and it creates a list of tuples with all the words and their position in the document. This process is called tokenization. Each token/position tuple is then submitted to the add_keyword_position
function of the keyword_ser
for indexing.
Indexing
Indexing of the tuples, or keywords, is handled by the keyword_ser
. For every unique word a keyword_ser
process is started if not already present. The state of a keyword_ser
process is a dictionary with the document ID as key and a list of positions as value. The document ID corresponds to the ID of the Stack Overflow post.
The keyword_server_name
function generates a unique name under which the keyword_ser
process is registered, so the module can check if a keyword already has a process or a new process needs to be created.
Stemming
Stemming is the process for reducing inflected words to their base form. Computing
and computer
both are stemmed to comput
. So when a user searches on computing
, it also matches text that contains computer
. This makes it possible to return results that are relevant, but do not exactly match the query.
In our sample application all keywords are stemmed using the popular Porter Algorithm. The Erlang implementation by Alden Dima is used in the application.
erlang:phash2
is used to transform the stemmed name to a hash, to make sure the registered process name is valid.
Faceting
Faceted search is an important navigation feature for search engines. A user can drill down the search results by filtering on pre-defined attributes, like in this example of a digital camera search on CNET:
As mentioned above, the data import the function add_attribute_tokens
also calls the add_facet
function. Using pattern matching the Tags
and the Creationdate
attributes are selected for faceting. Tags
is a so called multivalue facet, as a Stack Overflow post can have one or more tags assigned. For every tag and creation date the facet_ser:add_facet_value
function is called.
facet_ser
works very similar to keyword_ser
. For every facet category, Tag
or Creationdate
in our case, a facet_ser
processes is started. The state of a facet_ser
is a dictionary with the Tag
or Creationdate
values as key and their document IDs as dictionary values.
Querying and Relevance Ranking
In previous sections is shown:
- how the XML demo data is parsed.
- how this data is stemmed and indexed by creating a
keyword_ser
process for every unique keyword. - how this data is indexed for faceted search by creating a
facet_ser
process for every facet category.
With the function stackoverflow_importer_ser:import()
these steps are executed, and your Erlang node is now ready for querying. So how does that work?
Querying
Querying is handled by passing the user’s query terms to the function do_async_query
of the singleton query_ser
server. When calling this function you need to specify the module, function and optional reference attribute which will be called when query results are available.
In the handle_cast
the following steps are executed:
keyword_ser:do_query
return all document ids that contain one or more of the user’s query terms, including the relevance ranking score, which will be discussed below.- All original documents are stored during indexing in a
document_ser
process. All matching documents are collected. - The callback function is invoked with the matching documents and their ranking scores as arguments.
- Facet results are retrieved for any
FacetCategories
that are specified by callingfacet_ser:get_facets
. - And the callback function is invoked a second time with the facet results as arguments.
Relevance Ranking
Relevance in this context denotes how well a retrieved document matches the user’s search query. Most fulltext search-engines use the BM25 algorithm to determine the ranking score of each document, so let’s use that too.
BM25 calculates a ranking score based on the query term frequency in each documents.
See the async_bm25.erl for the implementation.
Displaying the Search Results
As discussed, the query_ser:do_async_query
can be called to query our full-text search engine. To allow users to send queries and see the result the websocket_ser
module is created. This singleton gen_server
starts up a Misultin HTTP server on Port 3000. If you browse to http://localhost:3000 you will see a search box. Communication with the search engine is done through websockets.
So, when a user posts a query, this message is received by the websockets_ser:handle_websocket
receive block. The query_ser:do_async_query
function is called and query results are expected on websockets_ser:query_results
function.
The query_results
function formats the results as HTML and sends this through the websocket. When received, the HTML is appended to the user’s page.
A similar process is executed when the facet results are received:
Improvements
Some obvious features that are lacking from this sample application:
- The author of this post is an Erlang newbie. Corrections/suggestions to the code are most welcome. You can send them to
<ward@tty.nl>
- Pretty much no attention is given to performance / memory usage.
- Fault tolerence for the index data. When a server containing index state dies, it will not be revived.
- Tuple structures passed between modules are not specified. Would be nice to use record syntax for it.
- No unit/quickcheck/common test added.
- No function/type specifications.
- etc..
So, that why it’s called a sample application ;-)