Implementing BK-tree in Clojure

As part of returning back to blogging, or at least attempting to, I've decided to implement a data structure that can be quite useful but yet can be considered somewhat obscure: BK-tree. Named so after its inventors, WA Burkhard and RM Keller it can provide significant performance boost to applications in need of fuzzy string search.

What is BK-tree?

An underlying of BK-tree is relatively simple, however just by skimming the depths of the Internet I've seen a lot of people struggling with understanding it.

Most common use case for BK-tree is spell-checking applications, however most spell-checkers out there use simpler and, to be frank, often more efficient approaches. Nevertheless, we'd

Before we go further, we need to understand how spell-checking is commonly used. I'd split spell-checking in modern applications into 3 parts:

  • Check whether the word exists in the dictionary
  • Find possible fixes for the misspelled word
  • Order suggestions based on some sort of heuristic

While BK-tree can be used to address the first point, it's often faster and easier to use set/bag implementation to achieve close to constant time for checks. The second point can be tackled variety of ways, but for the purpose of the article we'll consider only the simplest ones:

  • naively in linear time by scanning all words in the dictionary and calculating edit distance
  • by generating all possible edits of the word being checked or by http://norvig.com/spell-correct.html
  • BK-tree (duh)

With this in mind, let's write some code.

Implementing BK-tree

Since BK-tree is an index, we have to separate steps -- building said index and querying it. I've decided to illustrate it instead of trying to use algorithm description, to make it easier. Well, at least I found it easier to understand it by using drawings.

Building the tree

Building BK-tree is pretty easy, our algorithm is basically this:

  • Calculate Levenstein distance between root of the tree and the word being indexed
  • If there is a free slot for resulting distance, insert new node with the word being indexed
  • Otherwise, take the node under said slot and continue recursively

Let's imagine that our dictionary consists of words squirrel, square, shard and circus. We start our process by setting squirrel at the root of the tree:

Setting the root

As this is the root, no calculation had to be done. Next in queue is the word shard:

Adding first word

Since the Levenstein distance between square and shard is 6 we check whether there's any child node with the same distance and since there's not we simply insert shard as a child node of shard. Next up square:

Adding second word

Again, we calculate Levenstein distance (which is 3 this time) and happily insert new child node. Now to the next word:

A WILD CONFLICT APPEARS

Oops, this time we can't just insert the child node, since there's already a child node with a Levenstein distance of 6. Following the algorithm we simply try enter the word under the child node with the same Levenstein distance, which happens to be shard node. In the end we end up with something that looks like this:

Look how the tree has grown

That should show the basic principle of building the tree.

Querying the tree

It's a bad, bad, bad, bad math!

Querying the tree is straightforward, however the efficiency is guaranteed due to a simple property of Levenstein distance, namely the triangle inequality. Warning, what follows is math which may be a bit off. So, given the root of the tree \(R\), children \(C_i\), where \(i = Levenstein(R, C_i)\) and the query word \(W\) we can say that

$$Levenstein(W, R) + Levenstein(W, C_i) \ge Levenstein(R, C_i)$$

and consecutively

$$Levenstein(W, R) - Levenstein(W, C_i) \le Levenstein(R, C_i)$$

Since \(i = Levenstein(R, C_i)\) and \(Levenstein(W, C_i) \le d_{max}\) where \(d_{max}\) is our maximum tolerable distance between query word and dictionary words, we end with a simple filter, where we only need to recursively query children nodes \(C_i, \forall i \in [Levenstein(W, R) - d_{max}, Levenstein(W, R) + d_{max}]\).

FIXME Make an illustration based on circles

Illustrating the query

Given this overly complicated and most definitely flawed explanation of the math behind BK-tree, let's illustrate how this works in real world. For the purpose of illustration I've created a pretty simple tree.

Just the tree

We'd like to find all words in the tree that are within Levenstein distance of at most 1 of the word "brine". First step is to match the root, which in our tree is "trine".

Querying root

Since the Levenstein distance between those words is 1, we got our first match. Now, the trick part which makes BK-tree efficient comes into play: from all children of this tree we only need to query children within distance range of \([0..2]\). While in this simple example it seems like a small improvement, in real-world scenarios this would effectively lead to cutting off significant amount of branches.

Highlight matching children nodes

We then continue using the same approach recursively, until we either terminate at the leaf node or when all edges fail the triangle inequality.

Show matched values

Voila! Basically, searching in BK-tree is a BFS with small filter on top of it.

Performance

Theory and pictures is all nice and dandy, however I couldn't just write about BK-tree without actually implementing it. It's a pretty bare implementation, with very little though put into optimizations and such, thus the performance can not be directly compared to other existing approaches without embarrassing the author of this article. This of course didn't stop me from doing exactly that. For the performance test I've used a standard dictionary found in OS X, under /usr/share/dict/words. For no reason other than to entertain myself I've decided to look into word length distribution of said dictionary, so I proudly present to you the graphical representation of my research:

Word length bar chart

Benchmark

Armed with that information I've went through and benchmarked my implementation of BK-tree and compared it to using linear dictionary check and Peter Norvig's approach. The methodology was pretty simple: each benchmark was executed with the same list of misspelled words with only the maximum edit distance changing. Below you can see the results of said test:

Benchmark results

One glaring omission is Norvig's spellchecker for edit distance of 3. This is due to the fact that for longer words the number of all misspellings at an edit distance of 3 is humongous (we are talking order of tens of millions).

The actual results show that while BK-tree is in fact faster when compared to brute force approach, it loses significantly to Norvig's approach for edit distances of 1 and 2. The latter is also helped by extremely tuned implementations of clojure.lang.PersistentHashSet. Obviously, proper real world spell checkers use more sophisticated solutions, such as suffix tries, phonetic algorithms, bitap, etc.

Conclusion

BK-tree is an interesting application of seemingly unrelated concept (triangle inequality) to a well-established problem space. Implementing BK-tree is easy, however I'm still on the fence regarding practicality of using BK-trees. Some smart people before me considered using BK-tree in Lucene but ultimately decided that it is simply not worth it. I still liked writing it. :)