I hear you ask: “why are you writing about columnar databases?”. To answer that, I have spent some time researching about PowerPivot and Tabular Model these days which is powered by what Microsoft calls xVelocity (previously called Vertipaq) engine. It’s a columnar in memory Engine. And curiosity got better of me when I read the word: Columnar and I wondered what’s the benefit of Columnar Databases (aka column oriented databases) ? And why do they just not use the row-oriented database that powers OLTP workload and to answer the questions I did some research and here’s what I found:
First of all, let’s understand the difference between how row-oriented database and column-oriented database stores data:
Consider that we have a Table like this:
Now, in a row-oriented databases, it’s stored like this:
And in a column oriented databases, it’s stored like this:
Benefits of columnar approach:
Imagine that we want to compute the aggregation for column “Sales”. So if we have row-oriented database then we have to get access to every “page” and then extract the value of column from each row. Lot of work, right? But in columnar databases, if we want to compute aggregations over Sales column then we just access the page that has ALL values of sales column. Does it not reduce the IO by not accessing lot of pages?
I know there’s more to the story and this is just a over-simplification of the process – But you get the point, don’t you?
Thus remember that:
Column-oriented systems are more efficient when an aggregate needs to be computed over many rows but only for a notably smaller subset of all columns of data, because reading that smaller subset of data can be faster than reading all data. – wikipedia
The OTHER benefit is in the level of compression that can be achieved. Let’s see this concept in simplest of terms.
consider the following data stored in columnar database:
|Q1||Q1||Q1||Q1||Q1……30 more times||Q2||Q2||Q2…….40 more times|
The above data can be compressed by using the form:
|value||Begin position||End position|
And as you can see, we stored 76 values using just 9 values. So when they say we do 10x compression – this is how they do it!
This was an example of Run-length encoding.
Technically there are algorithms like: Run Length Encoding, Dictionary Encoding, GZIP, LZ compression that are used to compress data. The engine (like xVelocity) decides the best algorithm to use.
so let’s connect the dots.
In-memory technologies need that ENTIRE data is loaded into MEMORY before processing. It’s beneficial IF we can compress the data and so more data can be loaded into the same amount of memory. Also, in the realm of OLAP, aggregating needs to be as efficient as it can be and as we have seen columnar databases are efficient at computing aggregations over many rows of a given column. So there’s certainly some serious benefits of using columnar databases in OLAP scenario’s and NOW I GET IT that why xVelocity is columnar and not row-oriented. Do not get me wrong, I am not saying that row oriented databases are bad, but in context of OLAP, column stores offer some benefits. And in general, the row oriented databases and column oriented databases have their set of pros and cons and there’s no superior way of doing things. Also, column stores does not need to be in-memory or in-memory does not need be to columnar. In case of xVelocity – it’s in-memory + columnar. But again In-memory does NOT equal to columnar databases.
Thus, in this blog post:
1) How do columnar databases store data?
2) Benefits of column stores
What we’ve not seen:
1) we’ve not explored the disadvantages and caveats of columnar databases.