As the upcoming Cloudera IPO approaches (at a much noticed discount from their valuation 4 years ago), they have taken the opportunity to call Greenplum a “Traditional Analytic Database” worth benchmarking against their Impala engine. To quote a line from the movie Moana, “What can I say except, You’re Welcome” as I’m glad that Greenplum, the world’s first open source, MPP analytics database platform was able to be there as a Juggernaut to throw rocks at. Not surprisingly, as with many vendor provided benchmarks, I feel they their conclusion (link here) that Impala is faster than Greenplum is quite a bit misleading.
A few key points to consider:
- Cloudera picked Greenplum as the gold standard to benchmark against showing which player in the market they have respect for. Thanks Cloudera, I’ve been saying this for years – great to be validated!
- Impala and Greenplum both crushed all other SQL on hadoop database and database-like technology including Presto, SparkSQL, Hive, etc. in Cloudera’s testing.
- Cloudera had to remove a significant number of queries from the benchmark, presumably because Cloudera’s engine struggles with those. These are the queries Cloudera has excluded:
11 queries with ROLLUP (TPC-DS permitted variants not used in this testing)
3 queries with INTERSECT or EXCEPT
8 queries with advanced subquery placements (eg. subquery in HAVING clause, etc.)
I find it surprising that Impala does not handle ROLLUP. Cube, I get, but ROLLUP? That is a very basic operation analysts use to do aggregation. When the Greenplum team ran experiments with Greenplum’s GPORCA query optimizer vs. Cloudera Impala, it could not support such operations and hung.
The HAVING clause is a super awesome case of a correlated subquery. If anyone is serious about using SQL technology you need to support correlated queries. Maybe the reason they removed it is because such queries can be a memory hog if you do not implement it right. Given that Impala is a memory hog by itself, this is probably the case.
- Cloudera has omitted important test details. We don’t really know their test setup. Did they use the old Greenplum query optimizer, Planner, or are they using GPORCA? Did they do data partitioning or not? Lots of unknowns and again one of the challenges with vendor provided benchmarks is that they know how to configure their own platforms, and are less knowledgeable in other platforms.
- Looking at the open source momentum of Impala, in the last 30 days it has had 23 code contributors and 71 commits. Greenplum has had 60 contributors and 374 code commits. Greenplum’s momentum and growth curve continue to outpace Impala even more so when you consider Greenplum is backed by PostgreSQL core technology which has an additional 100’s of contributors while Impala is trying to reinvent the database wheel from scratch on top of a hadoop infrastructure not originally designed for SQL databases.
- Gartner believes Greenplum is more capable than Impala. Gartner has ranked Pivotal Greenplum quite a bit higher in the Critical Capabilities ranking for data warehouse technologies when compared to Cloudera. See image below:
After nearly a decade of Hadoop deployments, it seems the real technology advances that have come from Hadoop are HDFS (for scale out file system storage) and Spark (for a in memory compute engine). But yet, customers are still looking for essentially data warehouse features, so vendors like Cloudera keep trying to turn Hadoop into a SQL Relational Data Warehouse due to pent up customer demand. Having a pure play data warehouse is a much more straightforward approach to filling this customer demand IMhO.