A Basic Full Text Search Server in Erlang

written in Erlang

In the beginner’s mind there are many possibilities, but in the expert’s there are few

Shunryu Suzuki Zen Mind Beginner’s Mind

This post explains how to build a basic full text search server in Erlang. The server has the following features:

  • indexing
  • stemming
  • ranking
  • faceting
  • asynchronous search results
  • web frontend using websockets

Familiarity with the OTP design principles is recommended.

The sample application (build with help from my colleague Michel Rijnders mies@tty.nl) uses the Creative Commons Data Dump from StackExchange as demo data.

We cover the following subjects:

Running the Sample Application

Clone the source from GitHub:

 git clone git://github.com/tty/async_search.git

And start the application:

$ rebar get-deps compile && erl -pa `pwd`/ebin `pwd`/deps/*/ebin +P 134217727
Eshell> application:start(async).
Eshell> stackoverflow_importer_ser:import().

Visit http://localhost:3000, you should see the following page:

http://localhost:3000/

Sample ranked search output for the query erlang armstrong:

http://localhost:3000/

Sample tags facets output for the query java:

http://localhost:3000/

OTP Supervision Tree

supervisor tree

Looking at the OTP application supervision tree is a good way to understand the architecture of an OTP application.

The application supervisor async_sup starts up the following supervisors:

  • keyword_sup. A keyword_ser process is created for every unique word in the StackExchange posts. This keyword_ser is linked to the keyword_sup supervisor (a simple_one_for_one supervisor). The keyword_ser child process maintains a list of document positions of a keyword (an inverted index).
  • facet_sup. A keyword_ser process is also created for every unique facet category in the StackExchange posts. This keyword_ser process is linked to the facet_sup supervisor (a simple_one_for_one supervisor as well). The keyword_ser child process maintains a list of facet values with the IDs of the documents the facets appear in.

The application supervisor also start the following gen_server singleton processes:

  • stackoverflow_importer_ser. This server imports the demo Stack Overflow data.
  • document_ser. This server holds a copy of all documents, so it can return the original title and body of matching Stack Overflow posts in the results.
  • query_ser. This server’s task is to run the actual query and return results.
  • websocket_ser. This server provides a HTTP frontend for the search engine.

No attention is given to fault tolerance (apart from the basic restart strategies), thus parts of the search index are lost if a keyword_ser process terminates.

Demo Data Import

The StackExchange data is provided as XML. Since some of the documents are quite large, it’s not recommended to load the full XML documents in memory. The solution is to use a SAX parser which treats a XML file as a stream, and triggers events when new elements are discovered. The search server uses the excellent SAX parser from the Erlsom library by Willem de Jong.

In the example below erlsom:parse_sax reads the XML file from FilePath and calls the function sax_event if an XML element is found.

When the element is a row element (i.e. a post element), attributes like Id, Title and Body are stored in a dictionary. For every post a copy of all the attributes in document_ser is saved. This is used for returning the actual posts for a query match. After that the add_attribute_tokens function is called:

The add_attribute_tokens function does two things. It calls add_facet (discussed later) and it creates a list of tuples with all the words and their position in the document. This process is called tokenization. Each token/position tuple is then submitted to the add_keyword_position function of the keyword_ser for indexing.

Indexing

Indexing of the tuples, or keywords, is handled by the keyword_ser. For every unique word a keyword_ser process is started if not already present. The state of a keyword_ser process is a dictionary with the document ID as key and a list of positions as value. The document ID corresponds to the ID of the Stack Overflow post.

The keyword_server_name function generates a unique name under which the keyword_ser process is registered, so the module can check if a keyword already has a process or a new process needs to be created.

Stemming

Stemming is the process for reducing inflected words to their base form. Computing and computer both are stemmed to comput. So when a user searches on computing, it also matches text that contains computer. This makes it possible to return results that are relevant, but do not exactly match the query.

In our sample application all keywords are stemmed using the popular Porter Algorithm. The Erlang implementation by Alden Dima is used in the application.

erlang:phash2 is used to transform the stemmed name to a hash, to make sure the registered process name is valid.

Faceting

Faceted search is an important navigation feature for search engines. A user can drill down the search results by filtering on pre-defined attributes, like in this example of a digital camera search on CNET:

Faceted search example

As mentioned above, the data import the function add_attribute_tokens also calls the add_facet function. Using pattern matching the Tags and the Creationdate attributes are selected for faceting. Tags is a so called multivalue facet, as a Stack Overflow post can have one or more tags assigned. For every tag and creation date the facet_ser:add_facet_value function is called.

facet_ser works very similar to keyword_ser. For every facet category, Tag or Creationdate in our case, a facet_ser processes is started. The state of a facet_ser is a dictionary with the Tag or Creationdate values as key and their document IDs as dictionary values.

Querying and Relevance Ranking

In previous sections is shown:

  • how the XML demo data is parsed.
  • how this data is stemmed and indexed by creating a keyword_ser process for every unique keyword.
  • how this data is indexed for faceted search by creating a facet_ser process for every facet category.

With the function stackoverflow_importer_ser:import() these steps are executed, and your Erlang node is now ready for querying. So how does that work?

Querying

Querying is handled by passing the user’s query terms to the function do_async_query of the singleton query_ser server. When calling this function you need to specify the module, function and optional reference attribute which will be called when query results are available.

In the handle_cast the following steps are executed:

  • keyword_ser:do_query return all document ids that contain one or more of the user’s query terms, including the relevance ranking score, which will be discussed below.
  • All original documents are stored during indexing in a document_ser process. All matching documents are collected.
  • The callback function is invoked with the matching documents and their ranking scores as arguments.
  • Facet results are retrieved for any FacetCategories that are specified by calling facet_ser:get_facets.
  • And the callback function is invoked a second time with the facet results as arguments.

Relevance Ranking

Relevance in this context denotes how well a retrieved document matches the user’s search query. Most fulltext search-engines use the BM25 algorithm to determine the ranking score of each document, so let’s use that too.

BM25 calculates a ranking score based on the query term frequency in each documents.

See the async_bm25.erl for the implementation.

Displaying the Search Results

As discussed, the query_ser:do_async_query can be called to query our full-text search engine. To allow users to send queries and see the result the websocket_ser module is created. This singleton gen_serverstarts up a Misultin HTTP server on Port 3000. If you browse to http://localhost:3000 you will see a search box. Communication with the search engine is done through websockets.

So, when a user posts a query, this message is received by the websockets_ser:handle_websocket receive block. The query_ser:do_async_query function is called and query results are expected on websockets_ser:query_results function.

The query_results function formats the results as HTML and sends this through the websocket. When received, the HTML is appended to the user’s page.

A similar process is executed when the facet results are received:

Improvements

Some obvious features that are lacking from this sample application:

  • The author of this post is an Erlang newbie. Corrections/suggestions to the code are most welcome. You can send them to <ward@tty.nl>
  • Pretty much no attention is given to performance / memory usage.
  • Fault tolerence for the index data. When a server containing index state dies, it will not be revived.
  • Tuple structures passed between modules are not specified. Would be nice to use record syntax for it.
  • No unit/quickcheck/common test added.
  • No function/type specifications.
  • etc..

So, that why it’s called a sample application ;-)


Comments