KD-Tree fun in QuestionTime
This past week, I was working on QuestionTime, and needed to work out where the nearest points of interest were to the user’s location in an expedient fashion. Now, the last time I did anything involving location data, my queries were ridiculously slow, and would not have held up to the speed required for a performant Android application to be demonstrable as a university project.
So to say that I freaked out a little bit was an understatement. Fortunately, this time I didn’t need to search across shapes; I just needed to search between discrete points, making it the ideal candidate for some interesting nearest neighbour algorithms.
My good friend, @IndecisiveMatt, suggested I use a kd-tree to perform the search, as it has a reasonably reliable read rate, and is quite fast for a nearest neighbour search at roughly O(log(n)) time on average. I had read a little bit about them in my pre-research for this project (to investigate whether what I wanted to do was possible), but compsci data structures are not my strong suit. I won’t cover the theory here - if you want a good intro to kd-trees, here’s a good start. It assumes some knowledge about binary search trees, so set aside an hour or two to go through it all.
Luckily the QuestionTime API is built in Python, meaning I have all the neat mathsy and machine learning libraries at my disposal. The one Matt pointed me to was scikit-learn’s KDtree class implementation, which abstracts the data struct away from you and makes it ridiculously easy to query.
How to implement it
All you need to do is create a numpy array from the coordinates:coords = numpy.array([[latitude1, longitude1], [latitude2, longitude2]...[latitudeN, longitudeN]])
And add them to your kd-tree:
kdtree = sklearn.neighbors.KDTree(coords)
Finally run your query with the relevant latitude and longitude (there’s an outer list because scikit-learn allows for multiple points to be queried against at the same time):
# Search for five nearest points, and store the indices of the first result
_, (indices,) = kdtree.query([[latitude,longitude], k=5]
Use the indices to fetch your original dataset :
closest_coords = [coords[indice] for indice in indices]
Here’s our implementation in questiontime on GitHub if you want some pointers.
The amazing thing is how performant this query is; it runs in an instant even on a low-grade Heroku box - here’s an example query. I encourage you to give it a go with your own parameters to see how fast the thing can be!
Hint: if your dataset isn’t likely to change frequently, cache the kdtree once it’s been constructed using sklearn.joblib for better ongoing performance.