Lazy, impatient and hubristic: July 2013

Are you trying out RedShift, but not quite getting what you want? Confused why there are no INDEX statements? Here is a very quick troubleshooting guide. Disclaimer: RedShift is still new and I haven't used it for that long yet. If you find something inaccurate in this article, please let me know.

Common design mistakes

Writing directly from applications

Directly writing from applications incurs high write latency. The most common way of loading data to RedShift is to periodically export your data to S3 and then use the COPY command.

Frequently modifying rows

Addition of rows can be handled much better in RedShift. You should consider if representing your modifications as new rows (i.e. insert 'deleted' rows instead of actually deleting the original row) will work better.

Choosing a bad DISTKEY (we'll cover this in detail later)

The most expensive queries in RedShift are those that do large re-distribution of data. This occurs when you join tables that use a different DISTKEYs.
Another common mistake is to choose a DISTKEY that causes a "data skew".

Overly avoiding JOINs

Joining large tables isn't something to be scared of if both tables use the join key as DISTKEY. If it makes things easier, don't be afraid of it.

Common confusions

Why do I have inconsistent data in my tables?! I had defined primary key/foreign key/unique constraints!

RedShift uses them to optimise queries, but it does not enforce it. You need to enforce it yourself in the ETL process.

OK, how do I create indexes?

RedShift doesn't have the usual INDEXes you'll find in other RDBMS.
You have knobs to turn, though. DISTKEY and SORTKEY can be thought as indexes that you fiddle with.

Stuff that you should do

Consider choosing DISTKEY

What is "DISTKEY" anyways?

DISTKEY essentially decides which row goes to which node. For example, if you declare "user_id" as DISTKEY, RedShift will do node_id = hash(user_id) % num_nodes to choose the node to store that row. Well, it's not THAT simple, but you get the idea.

Why does it matter?

DISTKEY primarily matters when you do a join. Let's say a SQL statement SELECT * FROM User INNER JOIN Post ON (User.UserId = Post.UserId) WHERE Post.Type = 1 is issued. If User and Post both used UserId as DISTKEY, a RedShift node can just take the allocated shard, join them, filter them and send the (much smaller) contribution over the wire to be combined. However, if User was distributed by UserId and Post was distributed by ArticleId, Posts that belong to Users on a node will be on other nodes. Therefore the nodes have to ship the entire shard over the network to perform the join, which is expensive.

What should I do?

If a table is large and you anticipate a join with another large table, then consider choosing the key that will be used for the join to be the DISTKEY. In other words, unless this is the case don't declare a DISTKEY (RedShift will distribute the rows evenly)

What is "data skew"?

Data skew is when data concentrates on small number of nodes due to a badly chosen DISTKEY. Imagine you have a huge user base which are predominantly located in US. If you use "country_code" as DISTKEY, most of the data will end up on one node because most users will have the same counry_code "US". This means that this one node will do most of the work while other nodes will remain idle, which is inefficient. Therefore, it's important to choose a DISTKEY that will result in an even(-ish) distribution among the nodes.

Consider "series table" to deal with writes to your tables

A big part of RedShift's performance comes from the optimised data storage. When you newly load data into a table, its storage is neatly optimised. As you make modifications to the table, you start to disrupt this optimised state, a bit like "fragmenting your hard disk". That's why you have to perform ANALYZE/VACUUM time to time to correct this (a bit like doing a "defrag"). This can however become expensive at some point. This is where "series tables" helps. For example, you can create a "daily" table for each day and use UNION statement to provide a view that combines these tables. This way, you can perform ANALYZE/VACUUM only on the latest table as you load data & simply get rid of old tables to expire data rather than having to delete rows from a huge table and optimising it afterwards. This is also recommended in the RedShift documentation.

Use SORTKEY

SORTKEY essentially defines how the data will be sorted in the storage.
This feature is useful to limit the amount of data that has to be scanned. For example, if I have a large table full of news paper articles over a century and want to find article published between 1980 - 1985 that mention "Tiger", it's useful to have articles sorted by published_date on the storage, because that way I can limit the scanning on blocks that contain these dates.
They are also useful for joining if the key is also the DISTKEY because the query planner can skip a lot of work.
You *can* specify multiple SORTKEYs. When you specify SORTKEY(a, b), the data is effectively sorted as if with "ORDER BY (a, b). If cardinality of a is high enough, filtering by a is very effective, but having a second SORTKEY will make small sense, and vice versa. Therefore the utility of setting multiple SORTKEY is more difficult to judge. Start with a single SORTKEY and see how it goes.

Consider replicating the table as a "JOIN INDEX" ala Teradata if you have more than one column you want to choose as DISTKEY

You have more than one column you'd want to elect as DISTKEY but RedShift only lets you choose one. In such cases, you can simply create a replicated table that only differs in which key is declared as DISTKEY. This might seem like a poor idea, but it's essentially what Teradata (a similar technology to RedShift) does for its join index feature. You might be worried about maintaining the consistency between these tables, but because you are usually doing analytics & load data in bulk, it's usually not a problem.

Miscellaneous Notes

Don't worry if your CPU utilisation is high

Part of what makes these technologies powerful is the ability to exploit HW through efficient parallell processing, which means high CPU utilisation (spikes). Don't think you need to add nodes just because CPU utilisation sometimes hits 100%.
Don't focus on CPU and overlook other signs, like high network usage (which may indicate data re-distribution).

Use WLM to counter resource hogging

When queries are issued concurrently, resource hogging can become a problem. For example, if somebody issues 10 queries that take 1 hour each, another guy with a 5 min query can wait for a long time before he can get his query done. To prevent this kind of problem, consider using WLM.

Happy RedShift development! :)

Lazy, impatient and hubristic

Saturday, July 6, 2013

A simplistic RedShift troubleshooting guide

Table of contents:

Common design mistakes

Common confusions

Stuff that you should do

Miscellaneous Notes

What will happen with Ethereum in 30 years?

Search This Blog