This page provides you with instructions on how to extract data from a MySQL database and load it into Amazon Redshift on an ongoing basis. (If this manual process is a bit more involved than you’d prefer, check out Stitch, which can do all the heavy lifting for you in just a few clicks.)
What is MySQL?
MySQL is the world’s most popular open-source relational database management system (RDBMS). It is the backbone of countless websites and applications, and chances are you interact with MySQL-powered technology every day. However, MySQL is largely used as a transactional or operational database, and is not nearly as optimized for analytics as Amazon Redshift.
Pulling Data Out of MySQL
There are several methods for extracting data from MySQL, and the one you use will probably be dependent upon your needs (and skill set).
The most common way is simply writing queries. SELECT queries allow you to pull exactly the data you want by specifying filters, ordering, and limiting results. If you have a specific subset of data in mind or are looking to continuously monitor a subset of a specific table, SELECT queries may be a good fit.
If you’re just looking to export data in bulk, however, there may be an easier way. Most MySQL installs include a handy command-line tool called mysqldump that allows you to export entire tables and databases in a format you specify (i.e. delimited text, CSV, or SQL queries that would restore the database if run).
Preparing MySQL Data for Redshift
Here’s the tricky part: for every table in MySQL, you need a matching table in Redshift to receive the data. Thankfully, Redshift’s syntax is based on the syntax of Postgres, another relational database built on the SQL standard. If you run MySQL command like SHOW CREATE TABLE, you will receive syntax that will serve as a good starting point for creating a Redshift table that can receive the data.
That said, however, it’s not a one-to-one match. You’ll need to familiarize yourself with the Redshift CREATE TABLE statement and the implications of, for example, selecting certain sort keys at creation time. Redshift is a very different beast than MySQL and it’s important that you appreciate the performance implications that your table structure can create.
Inserting MySQL Data into Redshift
With a table built, it may seem like the easiest way to add your exported data (especially if there isn’t much of it), is to build INSERT statements to add data to your Redshift table row-by-row. If you have any experience with SQL, this will be your gut reaction. But beware! Redshift isn’t optimized for inserting data one row at a time, and if you have any kind of high-volume data being inserted, you would be much better off loading the data into Amazon S3 and then using the COPY command to load it into Redshift.
Keeping Data Up-To-Date
So, now what? You’ve built a script that pulls data from MySQL and loads it into Redshift, but what happens tomorrow when you have new and updated records in your MySQL database?
Depending on how you’ve built your script, you may be forced to load your entire database into Redshift again. This might be slow and painful, or even have performance implications on your MySQL instance.
The key is to build your script in such a way that it can also identify incremental updates to your data. If your MySQL tables have fields like modified_at or auto-incrementing primary keys, you can build a script that can quickly identify records that are new or changed since your last update (or since the newest record you’ve copied into Redshift). You can set your script up as a cron job or continuous loop to keep pulling down new data as it appears.
Other Data Warehouse Options
Redshift is totally awesome, but sometimes you need to start smaller or optimize for different things. In this case, many people choose to get started with Postgres, which is an open source RDBMS that uses nearly identical SQL syntax to Redshift. If you’re interested in seeing the relevant steps for loading this data into Postgres, check out MySQL to Postgres
Easier and Faster Alternatives
Here’s the deal: if you have all the skills necessary to go through this process, chances are building and maintaining a script like this isn’t a very high-leverage use of your time.
Thankfully, products like Stitch were built to solve this problem automatically. With just a few clicks, Stitch starts extracting your MySQL data, building a matching structure in your Amazon Redshift data warehouse, and inserting that data automatically as it changes.