Saturday, July 6, 2013

A simplistic RedShift troubleshooting guide

Are you trying out RedShift, but not quite getting what you want? Confused why there are no INDEX statements? Here is a very quick troubleshooting guide. Disclaimer: RedShift is still new and I haven't used it for that long yet. If you find something inaccurate in this article, please let me know. 


Table of contents:

  • Are you using it for something it wasn't built for?
  • Common design mistakes
  • Common confusions
  • Stuff that you should do
  • Miscellaneous Notes

Are you using it for something it wasn't built for?
This is the most common way to get problems. If you answer YES to the following questions, you might be using RedShift for the WRONG problem.
  1. I'm using it for something non-analytics
    • Yellow flag. RedShift isn't a replacement for "normal" relational DBs (aka "OLTP" DBs) like MySQL, Oracle etc. If you are directly writing to RedShift from your business applications, it's probably wrong.
  2. I don't do aggregation (SUM, COUNT etc.) or TopN (ORDER BY/LIMIT) queries
    • RedShift is primarily good at scanning large number of rows. If you want to quickly fetch small number of records, you might get better result with other DBs that indexes these attributes.
  3. I'm looking for very short response time for reads (e.g. < 1 sec)
    • RedShift is not optimised for "light"queries and tends to have longer response time compared to other DBs in this domain.
  4. I'm looking for very short response time for writes
    • RedShift's latency & throughput for trickle loading was very poor in our experiments. ATM I recommend periodically performing a bulk load. (However, AWS claims it does do trickle loading well; let me know of your experiences!)
  5. I want to run custom functions on my data
    • At the moment, you can't run arbitrary operations on your data in RedShift. If you want to scrub data, parse text, apply models etc. using non-SQL languages like Java, Python you can't use RedShift for that (hint: use EMR for that stage). 

Common design mistakes

  1. Writing directly from applications
    • Directly writing from applications incurs high write latency. The most common way of loading data to RedShift is to periodically export your data to S3 and then use the COPY command.  
  2. Frequently modifying rows
    • Addition of rows can be handled much better in RedShift. You should consider if representing your modifications as new rows (i.e. insert 'deleted' rows instead of actually deleting the original row) will work better.
  3. Choosing a bad DISTKEY (we'll cover this in detail later)
    • The most expensive queries in RedShift are those that do large re-distribution of data. This occurs when you join tables that use a different DISTKEYs.
    • Another common mistake is to choose a DISTKEY that causes a "data skew".
  4. Overly avoiding JOINs
    • Joining large tables isn't something to be scared of if both tables use the join key as DISTKEY. If it makes things easier, don't be afraid of it.

Common confusions

  1. Why do I have inconsistent data in my tables?! I had defined primary key/foreign key/unique constraints!
    • RedShift uses them to optimise queries, but it does not enforce it. You need to enforce it yourself in the ETL process.
  2. OK, how do I create indexes?
    • RedShift doesn't have the usual INDEXes you'll find in other RDBMS.
    • You have knobs to turn, though. DISTKEY and SORTKEY can be thought as indexes  that you fiddle with. 

Stuff that you should do

  1. Consider choosing DISTKEY
    • What is "DISTKEY" anyways?
      • DISTKEY essentially decides which row goes to which node. For example, if you declare "user_id" as DISTKEY, RedShift will do node_id = hash(user_id) % num_nodes to choose the node to store that row. Well, it's not THAT simple, but you get the idea.
    • Why does it matter?
      • DISTKEY primarily matters when you do a join. Let's say a SQL statement SELECT * FROM User INNER JOIN Post ON (User.UserId = Post.UserId) WHERE Post.Type = 1 is issued. If User and Post both used UserId as DISTKEY, a RedShift node can just take the allocated shard, join them, filter them and send the (much smaller) contribution over the wire to be combined. However, if User was distributed by UserId and Post was distributed by ArticleId, Posts that belong to Users on a node will be on other nodes. Therefore the nodes have to ship the entire shard over the network to perform the join, which is expensive.
    • What should I do?
      • If a table is large and you anticipate a join with another large table, then consider choosing the key that will be used for the join to be the DISTKEY. In other words, unless this is the case don't declare a DISTKEY (RedShift will distribute the rows evenly)
    • What is "data skew"?
      • Data skew is when data concentrates on small number of nodes due to a badly chosen DISTKEY. Imagine you have a huge user base which are predominantly located in US. If you use "country_code" as DISTKEY, most of the data will end up on one node because most users will have the same counry_code "US". This means that this one node will do most of the work while other nodes will remain idle, which is inefficient. Therefore, it's important to choose a DISTKEY that will result in an even(-ish) distribution among the nodes.
  2. Consider "series table" to deal with writes to your tables
    • A big part of RedShift's performance comes from the optimised data storage. When you newly load data into a table, its storage is neatly optimised. As you make modifications to the table, you start to disrupt this optimised state, a bit like "fragmenting your hard disk".  That's why you have to perform ANALYZE/VACUUM time to time to correct this (a bit like doing a "defrag"). This can however become expensive at some point. This is where "series tables" helps. For example, you can create a "daily" table for each day and use UNION statement to provide a view that combines these tables. This way, you can perform ANALYZE/VACUUM only on the latest table as you load data & simply get rid of old tables to expire data rather than having to delete rows from a huge table and optimising it afterwards. This is also recommended in the RedShift documentation.
  3. Use SORTKEY
    • SORTKEY essentially defines how the data will be sorted in the storage.
    • This feature is useful to limit the amount of data that has to be scanned. For example, if I have a large table full of news paper articles over a century and want to find article published between 1980 - 1985 that mention "Tiger", it's useful to have articles sorted by published_date on the storage, because that way I can limit the scanning on blocks that contain these dates.
    • They are also useful for joining if the key is also the DISTKEY because the query planner can skip a lot of work.
    • You *can* specify multiple SORTKEYs. When you specify SORTKEY(a, b), the data is effectively sorted as if with "ORDER BY (a, b). If cardinality of a is high enough, filtering by a is very effective, but having a second SORTKEY will make small sense, and vice versa. Therefore the utility of setting multiple SORTKEY is more difficult to judge. Start with a single SORTKEY and see how it goes.
  4. Consider replicating the table as a "JOIN INDEX" ala Teradata if you have more than one column you want to choose as DISTKEY
    • You have more than one column you'd want to elect as DISTKEY but RedShift only lets you choose one. In such cases, you can simply create a replicated table that only differs in which key is declared as DISTKEY. This might seem like a poor idea, but it's essentially what Teradata (a similar technology to RedShift) does for its join index feature. You might be worried about maintaining the consistency between these tables, but because you are usually doing analytics & load data in bulk, it's usually not a problem.

Miscellaneous Notes

  1. Don't worry if your CPU utilisation is high
    • Part of what makes these technologies powerful is the ability to exploit HW through efficient parallell processing, which means high CPU utilisation (spikes). Don't think you need to add nodes just because CPU utilisation sometimes hits 100%.
    • Don't focus on CPU and overlook other signs, like high network usage (which may indicate data re-distribution). 
  2. Use WLM to counter resource hogging
    • When queries are issued concurrently, resource hogging can become a problem. For example, if somebody issues 10 queries that take 1 hour each, another guy with a 5 min query can wait for a long time before he can get his query done. To prevent this kind of problem, consider using WLM.



Happy RedShift development! :)

2 comments:

  1. Nicely done!
    Is there any chance distkey on delta encoded tables to screw data distribution on the nodes? Thanks

    ReplyDelete