How to set up a data warehouse in a day without any code

December 20, 2017

How to set up a data warehouse in a day without any code

For a short period of time, I was the Director of User Acquisition for a company called US Mobile. In my eyes, user acquisition requires a huge amount of cross functional visibility in order to get the answers to questions like "how do I spend our budget?" or "where is the growth bottleneck?" At the time, we didn't really have easy ways to get answers to those questions, so we had to solve the problem of getting easy access to data.

At every company that I've ever worked for (ie. Amazon), the gap between what the engineers can get access to and what the business team can get their hands on is large. There are, of course, vertical specific solutions that tailor themselves to specific demographics. For example, Google Analytics, Mixpanel, and Kissmetrics give a view into website analytics which is great for PMs and marketers. Intercom, Olark, et al. typically have some analytics for customer success. Some of the data is going to be internally in the database. Others will be in logs. Maybe you might have Adwords, Facebook, and affiliate CPC data. Maybe there's a CRM that contains B2B sales data. Oh, and email analytics in Mailchimp (for marketing emails), Sendgrid (for transactional emails), maybe GMail (for adhoc emails). Stripe/Braintree for transactional data. And so on.

That's an awful lot of data sources that you'd need to correlate to get any kind of answers that require the whole picture. If I wanted the answer to the question "for customers who have been contacted using email drip sequence A, can I increase LTV by increasing on site customer interactions through the chat window on the landing page?" It'd take me an awful lot of work as I'd need to stitch together data from my email provider, order data, session data, chat provider, and user data. If my entire customer acquisition strategy depends on email drip to landing page, for example, this is an important thing to know.

There are three common approaches to this problem:

Gut driven decision making.
The business person will painstakingly download CSVs from a variety of sources, dump, them into Excel, attempt to analyze the raw data, crash their computer because there is too much data, and eventually get to a point where they can put a graph into a deck. In companies that move at the speed of molasses, this is par for the course.
The engineering person (or team) will spend half a year authoring various ETL processes (programs that nicely aggregate data into a data warehouse) to import data into a data warehouse, then perpetually spend time maintaining that process.

To be clear, I think in a Fortune 500-style enterprise, the third option isn't a bad option and is likely the right way to go. But for everybody else, having added friction to get answers means less likelihood to get them. For most early stage startups, half a man-year of engineering time just isn't a tradeoff that is easily justified. In our case, #2 was what we did. There just so happens to be another way and it takes on the order of hours (minutes in some cases) to set up.

These are the 4 steps of how we set this up:

Inventory: Create an inventory of all the data sources that your company is interested in.
Data Store: Select and set up a data store.
Aggregate: Aggregate all data sources using a service provider.
Visualization: Select visualization tools according to team needs, technical competencies, and cultural quirks.

Inventory: Create an inventory of all the data sources that your company is interested in

Probably the most straightforward step. Write down all the data sources that you've got that informs or will inform business decisions. It's important to consider future data sources including sources to be made by the engineering team. Finally, prioritize them in terms of need as it may not be possible to get all data sources integrated on the first pass. In US Mobile's case, we had a mix of external services and internal databases.

Data Store: Select and set up a data store.

Before we head into the weeds, a disclaimer: because of the depth of the topic, I'm only going to touch on the topic of data store. So please note that this a very light treatment.

In any case, the data store has a tremendous effect on who can effectively use this project, who can't, and to what degree they can get their answers. It dictates what software integrates well, how fast people can get answers (if at all), if there's engineering time required to maintain the data store, etc.

There's no one size fits all, however, the following requirements are what I'd think would be good assumption for most teams:

Should integrate with as many third party BI tools as possible
Most BI tools use SQL underneath it all, any analyst worth their salt will know SQL, SQL is the most mainstream language for data, so it must have a SQL interface
Should be able to crunch massive amounts of data without having expertly written/optimized SQL (as business analysts tend not to be at the level of DB administrators)
Shouldn't require (much) maintenance
Should be cheap
Should be scalable by the amount of data, consumers, frequency of requests, etc.
Should have enterprise level support available
Should have evidence of longevity (ie. not some database written that's going to disappear tomorrow)

Given those requirements, there are the two choices we had that tick all/most of those boxes:

Amazon Redshift
Google BigQuery

BigQuery was picked over Redshift. Here's why:

BigQuery fully abstracts the machine from the database meaning less maintenance, easier scalability, greater granularity in billing
Startups tend to love Google Drive/Docs, and BigQuery integration with Google Sheets is wonderful
BigQuery has a free tier, so if this is done early enough, we've got the benefit of having this part of the equation for free

Once you've chosen your database and set it up, head over to step 3.

Aggregate: Aggregate all data sources using a service provider.

Traditionally, normalizing and aggregating data into a single source of truth has been the most time consuming step for most companies. However, because the Internet isn't so new anymore, many data sources are standardized. For example, most companies are probably not going to implement their own Google Analytics from scratch when Google Analytics is free. This has led to a new breed of companies which build the extraction for one data source and sell that as a service to many companies.

A few of those companies include:

Stitch
Alooma
Xplenty
Matillion
Fivetran
Rivery
Treasure Data
Blend

It's important to note that not only do these services work with external services, but also databases that work in house. Because they do work with internal databases (ie. PostgreSQL, MariaDB, MongoDB, ElasticSearch), there are some security issues involved. Whole most of this can be implemented by someone who's not technical, this step might require an engineer's help.

Based on step 1, you should have a list of data sources you're interested in. Your choice of which provider you go with should likely include:

How much overlap you have with the provider
Cost
Support/reputation/risk

Once you've selected your provider, aggregating all of your data sources into a single data warehouse should take on the order of a few hours. For the most part, these services a something of a commodity with fairly limited differentiating factors (at least for the time being). At the time of writing and on the low end, Stitch was by far the most cost effective option and what we used for that reason.

Visualization: Select visualization tools according to team needs, technical competencies, and cultural quirks.

Every team, company, and person has different needs. There might be someone in marketing with zero technical experience and the need for understanding spend, ROI, churn, and LTV. Accounting and legal might want financial and compliance data. Engineers might need to understand where and how users are hitting bugs. Product managers might be concerned, specifically, with user experience and conversion metrics.

As a result, there's really no one size fits all but a whole slew of different solutions out there. The wonderful thing about aggregating all our data sources is not only can we correlate data from disparate sources, but most visualization tools will work with the data warehouse you should have chosen.

It's of note, that at the time of writing, there are likely over 200 companies working on this particular problem, so it's likely there's a solution that's perfect for you.

The following is a list of my personal favorite visualization tools:

Redash (Free and OSS)
Periscope
Looker
Chartio
Google Data Studio (Free)

Here's a set of criteria that I had used:

Cost
Team's data concerns
Team's technical competence
Integration with third parties (ie. Slack)
Reporting capabilities

My recommendation in many cases would be Looker. However, if your team is highly competent in SQL and you don't have a budget, Redash is a wonderful solution and what we used at US Mobile.

Hook the tool up to your data warehouse and we're done! You now should have the ability to visualize your entire business, get answers to virtually any question at your disposal without having written a single line of code in a few hours.

Search This Blog

Brandon Wang