Posts Tagged ‘algorithm’

How to search a data tree efficiently

Monday, February 23rd, 2009

This nice little method is one we used for PenguinPlay.

Assumptions: You know some C, and understand tree structures

If you have a whole load of data, lets say a hierarchy of categories with items in them, like in an online store, it is common to store this data in a tree. This allows you to move up and down the tree logically to access the data in the same way it is visualised.

With the logical way of doing this, there comes a problem of scalability. How do you find all of the categories that are below a certain point in the tree. Lets take a tree as an example

            +----(2)Windows
(1)Games----|               +----(5)Strategy
            +----(3)Linux---|                       +----(7)Action
            |               +----(6)First Person----|
            +----(4)Mac                             +----(8)RPG

Now to find all categories that are children of Linux, how do you do this?

Well, most of the time, the simplest way to write some code for this will find all of the categories that are children of Linux (5,6) and then for each of these find all of the children of these (7,8) and so on and so forth down the hierarchy.

This isn’t too inefficient in smaller trees, but imagine having ten levels, each with five branches at each level. You very quickly get an impossible number of items to deal with using this method.

So, how do we do this efficiently?

The solution is quite simple. It takes a moment to understand, and then it just makes you think ‘Why didn’t I think of that!’

Initially lets draw the above example again, but in a different way

+----------+
|          |
|          +------------+
|          | (2)Windows |
|          +------------+
|          |
|          +------------+
|          |            |
|          |            +-----------------+
|          |            | (5)Strategy     |
|          |            +-----------------+
|          |            |
|          |            +-----------------+
|          |            |                 |
|          |            |                 +-----------+
|          | (3)Linux   |                 | (7)Action |
| (1)Games |            | (6)First Person +-----------+
|          |            |                 |
|          |            |                 +-----------+
|          |            |                 | (8)RPG    |
|          |            |                 +-----------+
|          |            |                 |
|          |            +-----------------+
|          |            |
|          +------------+
|          |
|          +------------+
|          | (4)Mac     |
|          +------------+
|          |
+----------+

As you can see, if you look at it, the layout above shows the exact same data that is shown in the first tree, just displayed differently.

So, now onto how this helps us. Firstly, take your first category, (1)Games. We assign each category a minimum and maximum number. As we are starting, we use 1 and 2

1 +---------+
  | (1)Game |
2 +---------+

Simple enough. Now, lets add the first child, the (2)Windows. Now, as this is a child of (1)Game, then its minimum and maximum must be inside of the min and max for its parent. So, we simply change the maximum of (1)Game to 4, and set the min of (2)Windows to be 2 and the max to be 3

1  +----------+
   |          |
2  |          2------------+
   | (1)Games | (2)Windows |
3  |          3------------+
   |          |
4  +----------+

Now, lets add  in category 3 and 4, in each case, we insert the categories with min and max values that are inside of the parents, and not overlapping with any other children of that parent. Like so.

1  1----------+
   |          |
2  |          2------------+
   |          | (2)Windows |
3  |          3------------+
   |          |
4  |          4------------+
   | (1)Games | (3)Linux   |
5  |          5------------+
   |          |
6  |          6------------+
   |          | (4)Mac     |
7  |          7------------+
   |          |
8  8----------+

Simple enough, but what does that get us? Not a lot yet. Lets go a bit further down the tree, and add some children to the Linux category, and see what happens.

So, using the same rule, any child we add  must have a min and max inside its parents, and its parent must expand to accommodate it. Lets add (5)strategy in as a child of Linux.

1  1----------+
   |          |
2  |          2------------+
   |          | (2)Windows |
3  |          3------------+
   |          |
4  |          4------------+
   |          |            |
5  |          |            5-------------+
   | (1)Games | (3)Linux   | (5)Strategy |
6  |          |            6-------------+
   |          |            |
7  |          7------------+
   |          |
8  |          8------------+
   |          | (4)Mac     |
9  |          9------------+
   |          |
10 10---------+

Now, by adding the child to Linux, you have to change lots of mins and maxes. It looks like it would be a real pain to keep track of, but simply, you can just go through a flat list of all categories, and add 2 to each min and max that are greater than or equal to the old maximum of the parent. So in this case, the Linux category min/max were previously 4 and 5, so anything higher than a 4, add 2 to it, and then you have 2 spare numbers for the min/max for the new category. Simple enough. Now we can add in the final categories, and the tree looks like it did in the example further up, but with each category given a min and max number

 1 1----------+
   |          |
 2 |          2------------+
   |          | (2)Windows |
 3 |          3------------+
   |          |
 4 |          4------------+
   |          |            |
 5 |          |            5-----------------+
   |          |            | (5)Strategy     |
 6 |          |            6-----------------+
   |          |            |
 7 |          |            7-----------------+
   |          |            |                 |
 8 |          |            |                 8-----------+
   |          | (3)Linux   |                 | (7)Action |
 9 | (1)Games |            | (6)First Person 9-----------+
   |          |            |                 |
10 |          |            |                 10----------+
   |          |            |                 | (8)RPG    |
11 |          |            |                 11----------+
   |          |            |                 |
12 |          |            12----------------+
   |          |            |
13 |          13-----------+
   |          |
14 |          14-----------+
   |          | (4)Mac     |
15 |          15-----------+
   |          |
16 16---------+

OK, so there we have it, we have a tree, showing a hierarchy, and all of the categories have numbers for a min and a max value. All numbers are unique, and all numbers belonging to a child are inside the scope of their parents. Great, but how does this help us?

Take a look. If you want a list of all categories that are children of Linux (min=4, max=13) then simply, get a list of all categories whose min is more than 4 and whose max is less than 13

Simple. In a single pass through a flat list of categories, you obtain a full list of all children. No matter how deep the tree system goes, as long as you keep to the rules of calculating the min and max values, this will always work.

It also works the other way. If say, you want a list of all parents of the RPG category (min=10,max=11), simply look for all categories whose min is less than 10 and whose max is more than 11, and you get the entire ancestry of that category. All in one pass.

This all comes in especially useful if you combine it with SQL. Complete tree segments can be obtained from a single efficient SQL query. Queries are easy to generate that can find, using the above tree as an example, all Linux games of any category!

I hope some people found this useful, and I hope that some of you had the reaction I had when someone first introduced this to me several years ago – ‘Thats very cool!’

  • Share/Bookmark