Implementing a searching function has been a challenge for me personally. Recently I was assigned to implement a tagging function inside our product. Of course, searching records by tags are also needed. My first thought, well, went to ElasticSearch, which is definitely the best choice of implementing something really capable. But is it really worth it when you don't need to build a large scale system? Or what if you got limited time and you want to make it simple and stupid but you don't really want to compromise your system performance. The answer is using bitmap.
The problem can be described as, implementing a function which allows you to tag the DB entries records (uploaded by multiple users). The tags are later used in searching for specific DB entries associated with requested tags. The natural though would be adding database columns to store those tags, which is a simple and intuitive solution.
Let's make up a real world scenario, you've already got a database table that denotes all vehicle brands in North America, and you need to tag those brands with different properties, for example the vehicle types they sell in North America market as of 2020. Then the
| id | make | Sedan | Suv | Truck | Coupe |
|---|---|---|---|---|---|
| 1 | Toyota | Y | Y | Y | Y |
| 2 | BMW | Y | Y | N | Y |
| 3 | GMC | N | Y | Y | N |
| 4 | Volkswagen | Y | Y | N | N |
Well, that's it! If using ORM like Spring Data JPA, you would be able to write several method queries, although the method name itself could be a little longer than you usually write.
@Repository
interface CarBrandRepository : JpaRepository<CarBrand, Long> {
// A simple method query
fun findBySedanAndSuvAndTruckAndCoupe(sedan: Boolean, suv: Boolean, truck: Boolean, coupe: Boolean): List<CarBrand>
// Other combinations of parameters
}
This implementation would work really well before your manager want you to add more properties (tags) to the DB model, which will increase the number of method queries you need to write in a factorial scale. And you need to modify your entity model code every time you want to add a property which is not really practical and will ruin the code maintainability.
So is there a better way of doing it?
My major in college wasn't really related to computer science. But I still recalled when we were doing the Data Lab Assignment from course 15-513, I learned that we can do a plenty of logic operations using bits. So what if we store those tags by allocating just a 1-bit space to each DB record? And for example if we have 8 tags, all we need to do is assigning a 8-bit space to a DB record and the best part is we don't even need to store them as varchar, instead we just convert this 8-bit to an Integer. Using bitmap, the searching process of certain tags would become significantly faster (since bit operation itself is very fast) and it would be saving us a lot of memory spaces. More importantly, we can simply extend our definition of the tags without even modifying the code!
| Bitmap | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
|---|---|---|---|---|---|---|---|---|
tag.id |
8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 |
Table 1 - Simple demonstration of bitmap (Big endian)
| id | tag name |
|---|---|
| 1 | Compact Sedan |
| 2 | Midsize Sedan |
| 3 | MidSize SUV |
| 4 | Full Size SUV |
| 5 | Convertible |
| 6 | Pickup Truck |
| 7 | Station Wagon |
| 8 | Commercial truck |
Table 2 - Tags table
If we use the bitmap above to store the vehicle types (properties) we tagged to a car brand. This car brand comes with tags including Commercial Truck, Convertible, Midsize Sedan (The tags with id \(n\) where \(n^{th}\) bit in the bitmap has the value 1). Isn't that simple and intuitive?
But how do we filter out records with certain tags? That's easy, the answer is we can search the tags by doing a logical conjunction (aka bitwise AND operation in majority of the programming languages).
For example, we can loop over all database records by simply doing a query.
select * from cars c where c.bitmap & B'1' << (n - 1);
However, you still have to loop over all the records. For a very large table, that gradually becomes unacceptable, and you may also find it hard to change the definition of the tags, for example if you want to replace the \(3^{rd}\) bit's definition to another tag, you might want to do a full table update just to update the bitmap column for reflecting the definition changes.
Now considering an alternative solution, what if we store the records ID in the bitmap other than storing
— Jun 5, 2021
Made with ❤ and at Earth.