A Basic Full Text Search Server in Erlang
In the beginner’s mind there are many possibilities, but in the expert’s there are few
This post explains how to build a basic full text search server in Erlang. The server has the following features:
- asynchronous search results
- web frontend using websockets
Familiarity with the OTP design principles is recommended.
The sample application (build with help from my colleague Michel Rijnders
email@example.com) uses the Creative Commons Data Dump from StackExchange as demo data.
We cover the following subjects:
- running the sample application
- OTP supervision tree
- importing demo data
- querying and relevance ranking
- displaying search results
Running the Sample Application
Clone the source from GitHub:
git clone git://github.com/tty/async_search.git
And start the application:
$ rebar get-deps compile && erl -pa `pwd`/ebin `pwd`/deps/*/ebin +P 134217727 Eshell> application:start(async). Eshell> stackoverflow_importer_ser:import().
Visit http://localhost:3000, you should see the following page:
Sample ranked search output for the query
Sample tags facets output for the query
OTP Supervision Tree
Looking at the OTP application supervision tree is a good way to understand the architecture of an OTP application.
The application supervisor
async_sup starts up the following supervisors:
keyword_serprocess is created for every unique word in the StackExchange posts. This
keyword_seris linked to the
keyword_serchild process maintains a list of document positions of a keyword (an inverted index).
keyword_serprocess is also created for every unique facet category in the StackExchange posts. This
keyword_serprocess is linked to the
simple_one_for_onesupervisor as well). The
keyword_serchild process maintains a list of facet values with the IDs of the documents the facets appear in.
The application supervisor also start the following
gen_server singleton processes:
stackoverflow_importer_ser. This server imports the demo Stack Overflow data.
document_ser. This server holds a copy of all documents, so it can return the original title and body of matching Stack Overflow posts in the results.
query_ser. This server’s task is to run the actual query and return results.
websocket_ser. This server provides a HTTP frontend for the search engine.
No attention is given to fault tolerance (apart from the basic restart strategies), thus parts of the search index are lost if a
keyword_ser process terminates.
Demo Data Import
The StackExchange data is provided as XML. Since some of the documents are quite large, it’s not recommended to load the full XML documents in memory. The solution is to use a SAX parser which treats a XML file as a stream, and triggers events when new elements are discovered. The search server uses the excellent SAX parser from the Erlsom library by Willem de Jong.
In the example below
erlsom:parse_sax reads the XML file from
FilePath and calls the function
sax_event if an XML element is found.
When the element is a
row element (i.e. a post element), attributes like
Body are stored in a dictionary. For every post a copy of all the attributes in
document_ser is saved. This is used for returning the actual posts for a query match. After that the
add_attribute_tokens function is called:
add_attribute_tokens function does two things. It calls
add_facet (discussed later) and it creates a list of tuples with all the words and their position in the document. This process is called tokenization. Each token/position tuple is then submitted to the
add_keyword_position function of the
keyword_ser for indexing.
Indexing of the tuples, or keywords, is handled by the
keyword_ser. For every unique word a
keyword_ser process is started if not already present. The state of a
keyword_ser process is a dictionary with the document ID as key and a list of positions as value. The document ID corresponds to the ID of the Stack Overflow post.
keyword_server_name function generates a unique name under which the
keyword_ser process is registered, so the module can check if a keyword already has a process or a new process needs to be created.
Stemming is the process for reducing inflected words to their base form.
computer both are stemmed to
comput. So when a user searches on
computing, it also matches text that contains
computer. This makes it possible to return results that are relevant, but do not exactly match the query.
erlang:phash2 is used to transform the stemmed name to a hash, to make sure the registered process name is valid.
Faceted search is an important navigation feature for search engines. A user can drill down the search results by filtering on pre-defined attributes, like in this example of a digital camera search on CNET:
As mentioned above, the data import the function
add_attribute_tokens also calls the
add_facet function. Using pattern matching the
Tags and the
Creationdate attributes are selected for faceting.
Tags is a so called multivalue facet, as a Stack Overflow post can have one or more tags assigned. For every tag and creation date the
facet_ser:add_facet_value function is called.
facet_ser works very similar to
keyword_ser. For every facet category,
Creationdate in our case, a
facet_ser processes is started. The state of a
facet_ser is a dictionary with the
Creationdate values as key and their document IDs as dictionary values.
Querying and Relevance Ranking
In previous sections is shown:
- how the XML demo data is parsed.
- how this data is stemmed and indexed by creating a
keyword_serprocess for every unique keyword.
- how this data is indexed for faceted search by creating a
facet_serprocess for every facet category.
With the function
stackoverflow_importer_ser:import() these steps are executed, and your Erlang node is now ready for querying. So how does that work?
Querying is handled by passing the user’s query terms to the function
do_async_query of the singleton
query_ser server. When calling this function you need to specify the module, function and optional reference attribute which will be called when query results are available.
handle_cast the following steps are executed:
keyword_ser:do_queryreturn all document ids that contain one or more of the user’s query terms, including the relevance ranking score, which will be discussed below.
- All original documents are stored during indexing in a
document_serprocess. All matching documents are collected.
- The callback function is invoked with the matching documents and their ranking scores as arguments.
- Facet results are retrieved for any
FacetCategoriesthat are specified by calling
- And the callback function is invoked a second time with the facet results as arguments.
Relevance in this context denotes how well a retrieved document matches the user’s search query. Most fulltext search-engines use the BM25 algorithm to determine the ranking score of each document, so let’s use that too.
BM25 calculates a ranking score based on the query term frequency in each documents.
See the async_bm25.erl for the implementation.
Displaying the Search Results
As discussed, the
query_ser:do_async_query can be called to query our full-text search engine. To allow users to send queries and see the result the
websocket_ser module is created. This singleton
gen_serverstarts up a Misultin HTTP server on Port 3000. If you browse to http://localhost:3000 you will see a search box. Communication with the search engine is done through websockets.
So, when a user posts a query, this message is received by the
websockets_ser:handle_websocket receive block. The
query_ser:do_async_query function is called and query results are expected on
query_results function formats the results as HTML and sends this through the websocket. When received, the HTML is appended to the user’s page.
A similar process is executed when the facet results are received:
Some obvious features that are lacking from this sample application:
- The author of this post is an Erlang newbie. Corrections/suggestions to the code are most welcome. You can send them to
- Pretty much no attention is given to performance / memory usage.
- Fault tolerence for the index data. When a server containing index state dies, it will not be revived.
- Tuple structures passed between modules are not specified. Would be nice to use record syntax for it.
- No unit/quickcheck/common test added.
- No function/type specifications.
So, that why it’s called a sample application ;-)