We have been using Amazon Redshift for a little while using schema’s and ideas lifted straight from our designs for SQL server when we thought we were going to be using that platform. This was mainly because we did not really have time to test other solutions to storing our data. For a while me and my colleagues have been wanting to test this for a while and myself and Nicola finally got chance to do some testing the other day.
So based on this article on Optimizing for Star Schemas on Amazon Redshift we wanted to see the affects of extending the “primary distribution key” across tables that in our current star (snowflake schema) they would not normally be present on.
The scenario:
The example we tested was our companies web data where there are the following basic concepts:
- A session is the time that a user is on the site and is created when a user first comes to the site. Table:sessions PK:session_id
- A user view pages. Table:pages PK:page_id FK:sessions.session_id
- When a user views a page it generates parameters (from the URL). Table:page_params PK:page_param_id FK:pages.page_id
- A user carries out actions on a page. Table:page_action_params PK:page_action_id FK:pages.page_id
- An action can generate parameters based on what the user did or potentially input in the page. Table:page_action_params PK:page_action_param_id FK:page_actions.page_action_id
- There are a number of supporting tables for normalising data like paths, domains and protocols and they join to sessions and pages.
In our conversion between SQL server we decided that essentially (wrongly or rightly) that the foreign keys were going to be the distribution keys on the tables in attempt to co-located nodes / slices. Therefore from the above the sessions and the pages data got co-located but data related to the pages that is far more numerous was not co-located and was spread between the nodes and slices. This did not lead to noticeable querying issues but we did wonder if querying would be improved by adding the session_id to the tables down the chain e.g. page_action_params which is 3 tables away from sessions in the original snowflake design.
Hypothesis:
The inclusion of session_id on all tables in the snowflake would mean that the skew of data across the nodes and slices in Amazon Redshift becomes greater and therefore co-locate all the related session data together and with that query times will improve as it does not have to seek as much.
Measures of success:
1) Skew across the nodes and slices has changed
2) Query costs to be reduced (based on explain function values)
3) Time for queries to execute to be reduced
What we did:
- Created a new 4 node dw2.large redshift cluster
- Took the current web data schema and created two versions
- Normal version – As is currently
- Session id version – session_id added to various tables and distribution styles changed to key apart from sessions which is even. Changing also the smaller dimension tables too all distribution styles
- We created two databases on the cluster one to hold the original schema and one to hold the new schema that we were testing
- Exported and imported 2 weeks worth of production data (approx 200 million rows in page actions and large volumes in other tables)
- Analyzed and vacuum’ed all the tables
- Ran queries to asses the success of our testing:
- The skew of the data across the nodes and slices
- Queries against the data to test timings
- Complex queries that run long to see if there is a speed increase
What did we find:
- The cost of queries in the session_id version of the schemas was lower than that in the original schemas. Normally in the range of 10x higher for the original schema vs session_id version.
- The cost difference between the first row value and the last row value was closer together for the session_id schema
- The queries had mixed results in timings but in general the original version was slower. Some queries executed a lot faster (original:25 seconds VS session_id:13 seconds) for complex reporting queries. Some executed in the same time for complex reporting queries and simple “select * from a number of tables joined together). Some executed slightly faster for the original (original:3 seconds VS session_id:4 seconds) none of these were complex reporting queries.
- There was more skew across the nodes in the session_id version of the schema. This was especially noticeable on the tables where we added the session_id to proving that the rows are now co-located on the same slices as the “parent” session
GOTCHA:
We had previously tried to do the different schema designs on the same database but got very inconclusive results. We think this is because schemas in Amazon Redshift are only for logical grouping (opposed to separation) and therefore the presence of so many schemas trying to do the same thing clouded the results.
What we have not tested:
Impact on loading tasks. This would take a couple of days to set up but we feel it would not adversely affect loading
Conclusions:
- Smaller tables used for dimensions and lookups should be set to distribution style all
- Putting the session_id across more tables and including the field in the joins did speed up queries and therefore you should include the distribution key of the main table that will be in most of your queries in as many of your tables as possible; set as its distribution key as distribution style key in these other tables.