• About Dangerous DBA
  • Table of Contents
Dangerous DBA A blog for those DBA's who live on the edge

Testing Amazon Redshift: Distribution keys and styles

May 24, 2015 10:01 am / Leave a Comment / dangerousDBA

We have been using Amazon Redshift for a little while using schema’s and ideas lifted straight from our designs for SQL server when we thought we were going to be using that platform. This was mainly because we did not really have time to test other solutions to storing our data. For a while me and my colleagues have been wanting to test this for a while and myself and Nicola finally got chance to do some testing the other day.

So based on this article on Optimizing for Star Schemas on Amazon Redshift we wanted to see the affects of extending the “primary distribution key” across tables that in our current star (snowflake schema) they would not normally be present on.

The scenario:

The example we tested was our companies web data where there are the following basic concepts:

  • A session is the time that a user is on the site and is created when a user first comes to the site. Table:sessions PK:session_id
  • A user view pages. Table:pages PK:page_id FK:sessions.session_id
  • When a user views a page it generates parameters (from the URL). Table:page_params PK:page_param_id FK:pages.page_id
  • A user carries out actions on a page. Table:page_action_params PK:page_action_id FK:pages.page_id
  • An action can generate parameters based on what the user did or potentially input in the page. Table:page_action_params PK:page_action_param_id FK:page_actions.page_action_id
  • There are a number of supporting tables for normalising data like paths, domains and protocols and they join to sessions and pages.

In our conversion between SQL server we decided that essentially (wrongly or rightly) that the foreign keys were going to be the distribution keys on the tables in attempt to co-located nodes / slices. Therefore from the above the sessions and the pages data got co-located but data related to the pages that is far more numerous was not co-located and was spread between the nodes and slices. This did not lead to noticeable querying issues but we did wonder if querying would be improved by adding the session_id to the tables down the chain e.g. page_action_params which is 3 tables away from sessions in the original snowflake design.

Hypothesis:

The inclusion of session_id on all tables in the snowflake would mean that the skew of data across the nodes and slices in Amazon Redshift becomes greater and therefore co-locate all the related session data together and with that query times will improve as it does not have to seek as much.

Measures of success:

1) Skew across the nodes and slices has changed
2) Query costs to be reduced (based on explain function values)
3) Time for queries to execute to be reduced

What we did:

  1. Created a new 4 node dw2.large redshift cluster
  2. Took the current web data schema and created two versions
    1. Normal version – As is currently
    2. Session id version – session_id added to various tables and distribution styles changed to key apart from sessions which is even. Changing also the smaller dimension tables too all distribution styles
  3. We created two databases on the cluster one to hold the original schema and one to hold the new schema that we were testing
  4. Exported and imported 2 weeks worth of production data (approx 200 million rows in page actions and large volumes in other tables)
  5. Analyzed and vacuum’ed all the tables
  6. Ran queries to asses the success of our testing:
    1. The skew of the data across the nodes and slices
    2. Queries against the data to test timings
    3. Complex queries that run long to see if there is a speed increase

What did we find:

  1. The cost of queries in the session_id version of the schemas was lower than that in the original schemas. Normally in the range of 10x higher for the original schema vs session_id version.
  2. The cost difference between the first row value and the last row value was closer together for the session_id schema
  3. The queries had mixed results in timings but in general the original version was slower. Some queries executed a lot faster (original:25 seconds VS session_id:13 seconds) for complex reporting queries. Some executed in the same time for complex reporting queries and simple “select * from a number of tables joined together). Some executed slightly faster for the original (original:3 seconds VS session_id:4 seconds) none of these were complex reporting queries.
  4. There was more skew across the nodes in the session_id version of the schema. This was especially noticeable on the tables where we added the session_id to proving that the rows are now co-located on the same slices as the “parent” session
GOTCHA:

We had previously tried to do the different schema designs on the same database but got very inconclusive results. We think this is because schemas in Amazon Redshift are only for logical grouping (opposed to separation) and therefore the presence of so many schemas trying to do the same thing clouded the results.

What we have not tested:

Impact on loading tasks. This would take a couple of days to set up but we feel it would not adversely affect loading

Conclusions:

  1. Smaller tables used for dimensions and lookups should be set to distribution style all
  2. Putting the session_id across more tables and including the field in the joins did speed up queries and therefore you should include the distribution key of the main table that will be in most of your queries in as many of your tables as possible; set as its distribution key as distribution style key in these other tables.
Posted in: Amazon, Distribution Key, Distribution Style, Example, Redshift, Results, Schema Design, Testing / Tagged: ALL, Amazon, Distribution Keys, Distribution Styles, EVEN, KEY, Redshift, Schema Design

Leave a Reply Cancel reply

Post Navigation

← Previous Post
Next Post →

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Join 757 other subscribers

Recent Posts

  • Self generating Simple SQL procedures – MySQL
  • Google Cloud Management – My Idea – My White Whale?
  • Position Tracker – The Stub – Pandas:
  • Position Tracker – The Stub
  • Position Tracker – In the beginning
  • Whats been going on in the world of the Dangerous DBA:
  • QCon London Day 1
  • Testing Amazon Redshift: Distribution keys and styles
  • Back to dangerous blogging
  • DB2 10.1 LUW Certification 611 notes 1 : Physical Design

Dangerous Topics

added functionality ADMIN_EST_INLINE_LENGTH Bootcamp colum convert data types DB2 db2 DB2 Administration DB2 Development db2advis db2licm Decompose XML EXPORT GCP Google IBM IBM DB2 LUW idug information centre infosphere IOT LOAD merry christmas and a happy new year Position Tracking python Recursive Query Recursive SQL Reorganisation Reorganise Reorganise Indexes Reorganise Tables Runstats sql statement Stored Procedures SYSPROC.ADMIN_CMD Time UDF User Defined Functions V9.7 V10.1 Varchar XML XML PATH XMLTABLE

DangerousDBA Links

  • DB2 for WebSphere Commerce
  • My Personal Blog

Disclaimer:

The posts here represent my personal views and not those of my employer. Any technical advice or instructions are based on my own personal knowledge and experience, and should only be followed by an expert after a careful analysis. Please test any actions before performing them in a critical or nonrecoverable environment. Any actions taken based on my experiences should be done with extreme caution. I am not responsible for any adverse results. DB2 is a trademark of IBM. I am not an employee or representative of IBM.

Advertising

© Copyright 2023 - Dangerous DBA
Infinity Theme by DesignCoral / WordPress