Column-Oriented Databases – Old Idea, New wave
During the last few years I went through several POC of different Column-Store databases reviewing their functionality, performance and use cases. Usually at the beginning of every exercise I saw the impressive vendor promises of reach functionality, great performance and scalability. Some even said: this is a new trend in database world, even a standard! You do not need RDBMS anymore!
In this type of cases I usually act as conservative database architect. And you know what – that always helped eliminating additional companies’ efforts and frustrations in implementing specialized database solutions. This time I share some experiences in evaluating Column-Store databases. But let start with basics first.
While most commercial RDBMS products store data in some form of row format, some database vendors provide column-oriented storage of data. The supposed advantages of storing the data by column rather than by row include a better ability to compress the data, something that would reduce the need for disk-I/O. The idea of column-based storage is not new and has been used in commercial products from former Sybase and Sand Technology for well over a decade. In reality, each storage format has its own set of advantages and disadvantages and there is no free lunch – only tradeoffs.
The tradeoffs associated with column-based storage include the cost of tracking and eventual reconstruction of the rows to which the column values belong as well as additional complexity for ETL and OLTP processing. While recognizing that each storage format has its pros and cons and that there are scenarios where a column-based format has some merit, it is worth examining whether the column-based format lives up to its recent hype.
Beware of disingenuous benchmark numbers
Yes folks – PERFORMANCE is the main sales factor of the columnar databases!
There are claims that Column-Stores outperform a commercial row-store RDBMS by large factors. I just want to warn you to not rely blindly on magic performance benchmarks the vendors have done, in house themselves. Usually these performance test cases are not similar to the real production database loads, created often for read-only data using database engines that lacks RDBMS features and functionality that would be required in a production system.
A second observation is that the often benchmarks against Column-Stores do not test joins. Instead, all the queries that contain joins are executed against materialized views where the joins have been pre-computed. While materialized views are a legitimate technology in data warehousing, it is hardly realistic to assume that so many of them are maintained that there is never a need to perform a join in an ad-hoc query environment.
If you start digging into those performance results you can be surprised that playing with creating proper bitmap indexes in Oracle and using the right materialized views you can get much better performance results that those represented by Column-Store vendors.
So I suggest if you seriously want to compare apples with apples try to simulate a real performance comparison experiment between those database systems and do not trust magic benchmark numbers generated to promote any product or technology.
Column-Stores lack functionality
I personally consider RDBMS as a universal solder that can do everything in database world. Well, perhaps not all the operations can be done with extreme performance there but still they can be tuned well with proper coding and right hardware. Regarding Column-Stores I sometimes have the feeling that some of them explicitly remove important product functionality to achieve best performance results. Imaging, if two cars compete on fuel consumption and one of them refuses from hitting, air conditioning, automatic transmission and extra luggage locker on the roof. Well those tricks might definitely help in reducing fuel consumption but who would like driving this car in reality? For tests – maybe, for live – better not
Most Column-Stores in general support ACID (Atomicity, Consistency, Isolation, Durability) – an approach to a set of properties that guarantee that database transactions are processed reliably. However I would properly test this “in general approach to ACID”. Yes, test extreme disaster scenarios including all cluster nodes failure and see if your cluster can come up with consistent data and all transactions automatically recovered.
On the other hand, the commercial RDBMS presumably have features and functionality relating to locking, logging, backup and recovery, security, auditing, manageability and diagnoseability, replication, etc. that however require more complex data structures and longer code paths. Additionally Oracle database for example have data mining, OLAP, relational, in-memory, TEXT, XML DB and many more comprehensive components built-in in the database engine. That all definitely can eat performance to a small extent but enrich your functionality enormously.
RDBMS is fast with proper coding
I always suggest before blaming RDBMS for poor performance, teach your developers to use proper coding and database structures like bitmap indexes, materialized join view, index organized tables, etc. I often see developers to complain on slow data loads in RDBMS admiring Column-Store tests results. I often say on that: if you are guys slow on RDBMS why you think you will be faster on another database system. It will still be you who can not write the efficient code.
Organizing data by rows does have its advantages. And you have to use this performance benefits over Column-Stores in proper coding. For example, writing data to disk in row format is faster than doing so by columns. As a result, columnar databases have certain “loading” issues that are special. Another approach to significantly mitigating the performance problem in RDBMS is doing batch INSERTs and performing sorting, compression, and disk writes in large groups. Developers that complain on slow row-by-row processing in RDBMS should simply change the way they do the things.
After performance comparison I’ll continue in the next article with other differences between a Columnar database (Column-Store) and RDBMS