Benchmarking Redshift COPY

    I’ve recently been working with AWS Redshift to build a data warehouse for Toro. Redshift is an interesting technology for a number of reasons. MPP is compelling, in contrast to Hadoop, since it has a SQL interface which is far simpler/efficient than writing MapReduce jobs and has superior query performance[1,2,3,4]. Among MPP databases Redshift is an attractive option since it is a hosted solution, reducing DevOps workload and, in addition, has lower TCO than competing MPP options.

    For a bit of context, the system is for real time analytics, where events are produced to a data broker in real time and then loaded into the warehouse periodically, in this case, every 15 minutes. Vacuum isn’t run on every load but once a day at nigh. At a high level, this is the same architecture as the system at Zynga which processed billions of events daily in real time. Since the data is loaded incrementally and as it is produced, I expect to see far smaller load sizes than if I were primarily using Hadoop for storing event data and then occasionally loading Redshift via ETL for a query store (eg, Pinterest’s use case for Redshift). Benchmarking loads with billions of rows at once isn’t interesting in the context of this system, the loading data volumes are moderate; however, the constraints around how long the load takes are much more stringent as long loads would impact analysis and potentially impact operational stability of the warehouse. For this exercise, I wanted to put concrete performance numbers around a number of schemes for data loading and to get a sense of how quickly we should plan to add capacity to the warehouse, since load times can directly affect how quickly the cluster needs to be expanded.

    As for actually getting data into Redshift, there are a number of options. However, when working with data of any substantial volume, your options whittle down to one: the COPY command, preferably from a file in S3. I tested 3 factors for performance – file format, compression, and splitting an input file and loading via Manifest. The tests were performed on a single node Redshift Cluster with a d2.xlarge instance. The table was dropped at the start of each run and then each test run loading Redshift as much as 96 times (to simulate 1 days worth of data loading).

    The tested data sets contained 10K, 100K, and 1M events (using a set of events collected from past activity). For reference, this translates into the following event volumes in a system loading events every 15 minutes:

    Events loaded every Interval Events/Sec  Events/Day
    10,000 11 960,000
    100,000 111 9,600,000
    1,000,000 1,111 96,000,000

    Benchmarking Formats & Compression

    I tested 2 formats using JSON (‘auto’ option for mapping properties to Redshift column) and Pipe delimited. I tested both formats for both compressed and uncompressed data. With no further ado, the results:

    10K events

    Format Avg Copy Time (s) Size (MB) Speed (MB/sec)
    JSON 3.28 5.19 1.58
    Pipe Delimited 2.35 2.05 0.87
    Compressed JSON 1.30 0.18 0.14
    Compressed Pipe Delimited 1.23 0.16 0.13

    100K events

    Format Avg Copy Time (s) Size (MB) Speed
    (MB/sec)
    JSON 25.83 51.90 2.01
    Pipe Delimited 9.57 20.48 2.14
    Compressed JSON 2.97 1.85 0.63
    Compressed Pipe Delimited 2.30 1.56 0.68

    1M events

    Format Avg Copy Time (s) Size (MB) Speed (MB/sec)
    JSON 262.98 518.95 1.97
    Pipe Delimited 99.35 204.85 2.06
    Compressed JSON 17.92 18.54 1.03
    Compressed Pipe Delimited 10.32 15.58 1.51

    While I wasn’t surprised that compressing the input file would significantly improve performance, I was surprised by just how much it made a difference. As the data clearly illustrates, the size of the input data file is the dominant factor in terms of loading performance in this test. Additionally, we can see that, for a single dw2-xlarge instance, the data load speeds have an upper bound of ~2MB/sec for a single file.

    Additionally, I found it surprising how quickly loading with uncompressed data becomes unmanageable. With 10M events daily, loaded as uncompressed JSON, the node would already spend 26 seconds every 15 minutes doing a load and spend cumulatively 45 minutes daily loading data. At 100M events daily, all the node would be spending 30% of its time loading data, assuming nothing else is running on the cluster.

    Benchmarking Manifest

    In order to load Redshift into a multi-node cluster, you absolutely want to use multiple files so that redshift can spread the loading task amongst the nodes in the cluster [1]. The question I had here is how much of a difference this would make for a single node, there are some indications that it would significantly improve performance. [1,2].

    In this case, I tested, for 1M rows, splitting the file into 2 and loading via a manifest of those 2 files.

    Compressed JSON, 1M events

    Format Avg Copy Time (s) Size (MB) Speed (MB/sec) Performance Improvement
    Single file 17.92 18.54 1.03
    Manifest – 2 files 15.14 18.54 1.22 15.52%

    Compressed Pipe-Delimited, 1M events

    Format Avg Copy Time (s) Size (MB) Speed (MB/sec) Performance Improvement
    Single file 10.32 15.58 1.51
    Manifest – 2 files 7.30 15.58 2.14 29.32%

    In this test, I only tested 1M events since 10K and 100K are both trivial to load compressed data with a single file. The speedup on one node from splitting into 2 files is noticeable, though it appears the load speeds are limited to ~2MB/sec.

    Conclusion

    The key takeaway is that the loaded file size is the dominating factor for copy performance into redshift. Even at moderate data amounts, the performance difference between different methods of loading data into Redshift is incredibly significant. Loading 1M events took, on one extreme, 262.8 Seconds as uncompressed JSON, and, on the other extreme 7.3 seconds as a manifest to 2 compressed Pipe-delimited files. That difference is a factor of 36x!!

    The other takeaway is that a single node Redshift cluster doing periodic incremental loading can easily load 100M daily events, which translates to 3B events / 45GB compressed data monthly. That is pretty good, as other factors (eg, query performance or storage capacity / desired data retention) will become a limiting factor well before loading performance.

    Advertisements


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s