Jacque Istok - My Personal Site About Data Wrangling

Change Data Capture with Amazon RDS MariaDB and Pivotal Cloud Foundry

admin July 29, 2017 No Comments

Recently I had a need to determine how to implement change data capture against an existing MariaDB database running with Amazon RDS. The applications that were using this database were written in a variety of languages (Perl, PHP, Node, Java) and were spread across several different environments, so it really wanted to be something done via the database itself. Amazon RDS gives us fantastic operational benefits, but also some limitations which we feared would hamper our ability to do this. We have been making great use of Pivotal Web Services (PWS), which is an implementation of Cloud Foundry and there was big appetite to have this solution run within our PWS space. Continue reading →

Analytics Anywhere

admin May 18, 2017 No Comments

Having just had the opportunity to present at DELLEMC World 2017, I thought I would post a quick test of my marketing skills here. The talk track was around Analytics Anywhere. Being able to run your analytics in an infrastructure agnostic way is paramount to creating a modern data environment.

The teaser I created is here:

The platform I advocate is here:

Cloudera Just Legitimized Greenplum’s Dominance in Data

admin April 27, 2017 No Comments

As the upcoming Cloudera IPO approaches (at a much noticed discount from their valuation 4 years ago), they have taken the opportunity to call Greenplum a “Traditional Analytic Database” worth benchmarking against their Impala engine. To quote a line from the movie Moana, “What can I say except, You’re Welcome” as I’m glad that Greenplum, the world’s first open source, MPP analytics database platform was able to be there as a Juggernaut to throw rocks at. Not surprisingly, as with many vendor provided benchmarks, I feel they their conclusion (link here) that Impala is faster than Greenplum is quite a bit misleading.

A few key points to consider:

Cloudera picked Greenplum as the gold standard to benchmark against showing which player in the market they have respect for. Thanks Cloudera, I’ve been saying this for years – great to be validated!
Impala and Greenplum both crushed all other SQL on hadoop database and database-like technology including Presto, SparkSQL, Hive, etc. in Cloudera’s testing.
Cloudera had to remove a significant number of queries from the benchmark, presumably because Cloudera’s engine struggles with those. These are the queries Cloudera has excluded:

11 queries with ROLLUP (TPC-DS permitted variants not used in this testing)
3 queries with INTERSECT or EXCEPT
8 queries with advanced subquery placements (eg. subquery in HAVING clause, etc.)

I find it surprising that Impala does not handle ROLLUP. Cube, I get, but ROLLUP? That is a very basic operation analysts use to do aggregation. When the Greenplum team ran experiments with Greenplum’s GPORCA query optimizer vs. Cloudera Impala, it could not support such operations and hung.

The HAVING clause is a super awesome case of a correlated subquery. If anyone is serious about using SQL technology you need to support correlated queries. Maybe the reason they removed it is because such queries can be a memory hog if you do not implement it right. Given that Impala is a memory hog by itself, this is probably the case.

Cloudera has omitted important test details. We don’t really know their test setup. Did they use the old Greenplum query optimizer, Planner, or are they using GPORCA? Did they do data partitioning or not? Lots of unknowns and again one of the challenges with vendor provided benchmarks is that they know how to configure their own platforms, and are less knowledgeable in other platforms.
Looking at the open source momentum of Impala, in the last 30 days it has had 23 code contributors and 71 commits. Greenplum has had 60 contributors and 374 code commits. Greenplum’s momentum and growth curve continue to outpace Impala even more so when you consider Greenplum is backed by PostgreSQL core technology which has an additional 100’s of contributors while Impala is trying to reinvent the database wheel from scratch on top of a hadoop infrastructure not originally designed for SQL databases.
Gartner believes Greenplum is more capable than Impala. Gartner has ranked Pivotal Greenplum quite a bit higher in the Critical Capabilities ranking for data warehouse technologies when compared to Cloudera. See image below:

After nearly a decade of Hadoop deployments, it seems the real technology advances that have come from Hadoop are HDFS (for scale out file system storage) and Spark (for a in memory compute engine). But yet, customers are still looking for essentially data warehouse features, so vendors like Cloudera keep trying to turn Hadoop into a SQL Relational Data Warehouse due to pent up customer demand. Having a pure play data warehouse is a much more straightforward approach to filling this customer demand IMhO.

I’m a Greenplum, I’m a Netezza – Details & Bloopers

admin September 6, 2016 No Comments

Hopefully you remember my previous post where I introduced my first set of a series of videos. It almost goes without saying, but I’ll say it anyway. The creation of a video, any video, from a full length feature film to a home video requires a lot of content. From that content, much of which is repetitive and redundant, we whittle it down into something consumable.

You may have seen pictures of a movie clapper:

movie clapper

And you may have heard stories of doing the same “scene” over and over again. In recent years, it has been standard operating procedure at the end of many videos, to include some of the outtakes or bloopers (sometimes those are even funnier than the movie itself). Today I would like to show you the outtakes from my last personal adventure.

Bloopers

Jeff and I had a lot of fun doing this project, and at times got a little silly as Amy would attest to. But I think that when you’re locked in a room for 5 hours – wearing a tie and jacket, with nothing but body heat and hot lights, sweating like a big man eatin’ pasta – silliness is bound to happen.

I also received some questions around the idea and the motivation behind each of these videos. Due to the extreme lack of creativity of this particular engineer, the ideas and content were largely spoofed verbatim. So, for a side by side comparison please look here:

Original	Jacque-ified
InfoWorld

Better

Counselor

Self-Pity

Roadmap

I’m a Greenplum, I’m a Netezza

admin August 22, 2016 No Comments

I travel. A lot. This means that I get the opportunity to watch a lot of Movies, a lot of TV, surf a lot of Internet, and see a lot of Commercials. When you are as lucky as me to do that, it gives you ideas. Some of those ideas are awesome (others are really not). For example, I was recently reminded of the clever, funny I’m a Mac, I’m a PC commercials from Apple many years ago.

This gave me an idea. I really wanted to spoof them in some way that was true to form, but more relevant to things that I work on directly (Mac, long before these commercials, won the PC battle for me). It’s in the spirit of those and with a bit of nostalgia, that I enlisted some help from my colleagues Amy Benoit and Jeff Kelly. With extra special thanks to my new friends Vinny DiBenedetto and Jeff Magni I’d like to introduce you to my own personal pet project of humor.

I plan to write a separate post describing how we came up with each of these spoofs, with some insights around where the ideas came from and what the videos actually tried to say.

Hope you have fun watching these spoofs that I made as much as I enjoyed making them. Stay tuned for a future post where I present the outtakes/bloopers!

InfoWorld

Better

Counselor

Self-Pity

Roadmap

InfoWorld

Better

Counselor

Self-Pity

Roadmap

Best "I'm a Greenplum Spoof"?