• About Dangerous DBA
  • Table of Contents
Dangerous DBA A blog for those DBA's who live on the edge

QCon London Day 1

March 7, 2016 11:15 pm / Leave a Comment / dangerousDBA

So this is not a run of the mill conference for a DBA but in my evolving role then it is good to get out there and experience a conference that is a little different from the normal data conferences that I go to. Qcon is primarily aimed at software developers and has been running for about 10 years and takes place all over the world now. This first day I took exclusively the “Stream Processing @ Scale” track apart from the keynotes as this is the focus of the project at work at the moment.

QCon London Day 1: Keynote – UNEVENLY DISTRIBUTED by Adrian Colyer

Adrian reads a paper a day, summerises it and then publishes it on “the morning paper” no mean feat and I admire him for the dedication this must take! This was an interesting keynote that raised the virtues of reading a paper a day because:

  1. They are great thinking tools – They get you to think about what people are doing and what you could do or try
  2. They raise your expectations – Make your solutions better, or what you think you should be getting as a solution
  3. They give you real life lessons you can learn from – Read about what people have implemented, or given to the community to implement for themselves
  4. They are a great conversation – Can see how ideas progress through time, who has built what on top of what
  5. They are unevenly distributed – Across subjects

Basically read more papers you will know more stuff and you will be more awesome in your job, bringing more to the table for your employer and yourself.

QCon London Day 1: Talks

PATTERNS OF RELIABLE IN-STREAM PROCESSING @ SCALE – Alexey Kharlamov

This was a rather short talk but interesting and started a theme for the day, which has left me with a question that as of yet has not been fully answered. Alexey went through all the different patterns that the company that he works for have gone through to process data in streams. They had tried LAMBDA and KAPPA and were working towards something else now but that was not eluded too.

Learn’t
  • Need event time as well as event capture time for proper windowing. This reinforces what we see at my worplace and validates everything else that is out there

STREAM PROCESSING WITH APACHE FLINK – Robert Metzger

New product in a way that will get its full release tomorrow (08/03/2016). It is promising to completely subsume batch by allowing windowing over “large” timescales by utilising in memory and disk persisted aggregations as well as a host of other interesting features that other systems do not offer.

Learn’t
  • Google Dataflow is being made into an Apache incubator project called Apache Beam

MICROSERVICES FOR A STREAMING WORLD – Ben Stopford

This is a brave new world and its a world where things that you (I) would traditionally use databases for a job (lookup values) you can now use variations of the open source streaming projects. This talk looked at an addon for Apache Kafka called KStreams that allow you to persist the latest version of a key so that it could be use by a micro service in combination with a stream to create other services. We also need to embrace decentralisation.

Learn’t
  • KStreams can be used to make KTables that can be joined with data from a stream to enable querying for a micro service
  • Kafka has compacted tables that allow you to store the latest value for a key if you so wish!

STREAMING AUTO-SCALING IN GOOGLE CLOUD DATAFLOW – Manuel Fahndrich

This talk seemed like quite a long explanation of the planning that goes into the “auto scaler” to make its decision on to scale or not. Interesting seeing some of the formulas, but in practice as this is removed from the user through the developer console then this was for informational purposes only. Manuel also went through some of the challenges that they had not solved yet e.g. Quanta. Quanta is where as you downsize the number of machines (and the virtual disks that sit behind them) you end up with an uneven distribution and therefore other machines can only process at the rate of the machine with the least disks.

Learn’t
  • Google are still going to make money from you even if you enable auto scaling, just maybe not as much!
  • Quanta

DATA STREAMING OPEN SPACE

This was not very well subscribed with there being only me, the facilitator and three other participants. I was hoping for some more people to be there to learn about what people are doing have tried and warnings of what to avoid! As it turns out I was the 2nd most experienced with streaming there, which at our infancy of usage is slightly worrying about what the rest of the world is doing.

Learn’t
  • Streaming is new and not many people are sharing, if they are doing it.

REALTIME STREAM COMPUTING &ANALYTICS @UBER – Sudhir Tonse

Good to see what a disruptive tech company is doing and seeing that they are building tools because they can’t find any to support their needs.

QCon London Day 1: Evening Keynote – BLT: BABBAGE LOVELACE TURING (SO WHO DID INVENT THAT COMPUTER?) John Graham-Cumming & Sydney Padua

Found this a little long and not quite the content I was hoping for; thinking drunk histories. Was interesting and well put together, but pondered some points too much. The talk took you through the Uber tech stack for producing data, processing data, storage of the data, querying and consumption.

Learn’t
  • Ubers world is hexagons
  • There are loads of tools out there; that come out all the time; use the one that best suits your needs at the time you need it. Change only when there is a better one, not just because it is two weeks later

Questions From Today

  1. Why do people use Apache X Y and Z and manage all of that themselves rather than using an “autoscaling solution” such a GCS?
  2. Why if there are so many people that are using Apache X Y and Z are there not more people talking about it in production apart from large “disruptive” organisations such as UBer?
  3. Why if we have the ability to output this data to so many different (heterogeneous) stores are there very few (any) tools that pull it all together again?
Posted in: 2016, Conference, QCon, Streaming / Tagged: 2016, GCS, Google, Kafka, Kstreams, London, QCon, Stream processing, Stream processing at scale

Testing Amazon Redshift: Distribution keys and styles

May 24, 2015 10:01 am / Leave a Comment / dangerousDBA

We have been using Amazon Redshift for a little while using schema’s and ideas lifted straight from our designs for SQL server when we thought we were going to be using that platform. This was mainly because we did not really have time to test other solutions to storing our data. For a while me and my colleagues have been wanting to test this for a while and myself and Nicola finally got chance to do some testing the other day.

So based on this article on Optimizing for Star Schemas on Amazon Redshift we wanted to see the affects of extending the “primary distribution key” across tables that in our current star (snowflake schema) they would not normally be present on.

The scenario:

The example we tested was our companies web data where there are the following basic concepts:

  • A session is the time that a user is on the site and is created when a user first comes to the site. Table:sessions PK:session_id
  • A user view pages. Table:pages PK:page_id FK:sessions.session_id
  • When a user views a page it generates parameters (from the URL). Table:page_params PK:page_param_id FK:pages.page_id
  • A user carries out actions on a page. Table:page_action_params PK:page_action_id FK:pages.page_id
  • An action can generate parameters based on what the user did or potentially input in the page. Table:page_action_params PK:page_action_param_id FK:page_actions.page_action_id
  • There are a number of supporting tables for normalising data like paths, domains and protocols and they join to sessions and pages.

In our conversion between SQL server we decided that essentially (wrongly or rightly) that the foreign keys were going to be the distribution keys on the tables in attempt to co-located nodes / slices. Therefore from the above the sessions and the pages data got co-located but data related to the pages that is far more numerous was not co-located and was spread between the nodes and slices. This did not lead to noticeable querying issues but we did wonder if querying would be improved by adding the session_id to the tables down the chain e.g. page_action_params which is 3 tables away from sessions in the original snowflake design.

Hypothesis:

The inclusion of session_id on all tables in the snowflake would mean that the skew of data across the nodes and slices in Amazon Redshift becomes greater and therefore co-locate all the related session data together and with that query times will improve as it does not have to seek as much.

Measures of success:

1) Skew across the nodes and slices has changed
2) Query costs to be reduced (based on explain function values)
3) Time for queries to execute to be reduced

What we did:

  1. Created a new 4 node dw2.large redshift cluster
  2. Took the current web data schema and created two versions
    1. Normal version – As is currently
    2. Session id version – session_id added to various tables and distribution styles changed to key apart from sessions which is even. Changing also the smaller dimension tables too all distribution styles
  3. We created two databases on the cluster one to hold the original schema and one to hold the new schema that we were testing
  4. Exported and imported 2 weeks worth of production data (approx 200 million rows in page actions and large volumes in other tables)
  5. Analyzed and vacuum’ed all the tables
  6. Ran queries to asses the success of our testing:
    1. The skew of the data across the nodes and slices
    2. Queries against the data to test timings
    3. Complex queries that run long to see if there is a speed increase

What did we find:

  1. The cost of queries in the session_id version of the schemas was lower than that in the original schemas. Normally in the range of 10x higher for the original schema vs session_id version.
  2. The cost difference between the first row value and the last row value was closer together for the session_id schema
  3. The queries had mixed results in timings but in general the original version was slower. Some queries executed a lot faster (original:25 seconds VS session_id:13 seconds) for complex reporting queries. Some executed in the same time for complex reporting queries and simple “select * from a number of tables joined together). Some executed slightly faster for the original (original:3 seconds VS session_id:4 seconds) none of these were complex reporting queries.
  4. There was more skew across the nodes in the session_id version of the schema. This was especially noticeable on the tables where we added the session_id to proving that the rows are now co-located on the same slices as the “parent” session
GOTCHA:

We had previously tried to do the different schema designs on the same database but got very inconclusive results. We think this is because schemas in Amazon Redshift are only for logical grouping (opposed to separation) and therefore the presence of so many schemas trying to do the same thing clouded the results.

What we have not tested:

Impact on loading tasks. This would take a couple of days to set up but we feel it would not adversely affect loading

Conclusions:

  1. Smaller tables used for dimensions and lookups should be set to distribution style all
  2. Putting the session_id across more tables and including the field in the joins did speed up queries and therefore you should include the distribution key of the main table that will be in most of your queries in as many of your tables as possible; set as its distribution key as distribution style key in these other tables.
Posted in: Amazon, Distribution Key, Distribution Style, Example, Redshift, Results, Schema Design, Testing / Tagged: ALL, Amazon, Distribution Keys, Distribution Styles, EVEN, KEY, Redshift, Schema Design

Back to dangerous blogging

May 20, 2015 9:51 pm / Leave a Comment / dangerousDBA

Looking at this I have not written a blog post on here in over two years! I think it is time for a change. First of all to let you guys know what I have been up too and then too let you in what I will blogging about in the future:

What I have been up too

The past two years have been very interesting the company that I work for has seen a data explosion and unfortunately this meant that the DB2 estate that is needed to support this is quite expensive in terms of licensing and hardware costs compared to other offerings that are out there. Also we were struggling to find people with the requisite skills that supported our architecture of DB2 with an in house built ETL tools. It was rather unique solution in an area of the country were London sucks a lot of the people with DB2 skills.

Therefore we started looking at other platforms to house our data warehouse and power the businesses data future. We started looking at SQL server and all its offerings. This seemed like a winner for a time we started and building out the platform with an home built auto runner and other interesting features that I may blog about later. This solution was far cheaper in terms of licensing than DB2 (sorry IBM) and also came with a lot of features such as SSRS and SSIS and also a more readily available talent pool in the local area. This architecture still had one major problem hardware, we fill it up and stress it out far quicker than our normal 5 year planning caters for.

Then Amazon brought out Redshift. This is a winner it has features that will support our businesses data growth and has an eco system of products round it that will support the businesses data hunger. Although it does not come with an ETL tool like SSIS; in an effort to standardise opposed to our current scripted batch system we found Talend. AS a reporting solution found an awesome product called Looker. We are also looking at other Big Data solutions such as Hadoop, EMR, Databricks and many other things data related.

What will I be blogging about the future

So I now intend to get back to my blogging about my solutions to technical issues we face and solve, cool features we use and you can you to around Amazon Redshift, Looker, Talend and maybe occasionally DB2 as we wind down our usage. I will also be blogging more about the conferences I have attended, big data solutions we try out. So watch out for more content and hopefully interesting articles and findings.



Posted in: Amazon, Big Data, Redshift / Tagged: Amazon, Big Data, databricks, Hadoop, Looker, Redshift, Talend

DB2 10.1 LUW Certification 611 notes 1 : Physical Design

June 3, 2013 1:00 pm / Leave a Comment / dangerousDBA

As part of my career development it has been on the cards for me to do some certification at my current company for about four and a half years, and now they are giving me some time for it again. As you know from a previous post then I have being trying to revise for this using something that IBM published just after DB2 10.1 LUW came out and it is very verbose and boring (sorry). It seems more like a technical manual of all the ins and outs of the application than revision guide. There are now some better resources produced and can be found here. This blog post is going to cover some of the things I learn’t from Part 2: Physical Design which can be found here, please take the time to read the doc for yourself it is full of good stuff some of which I think I will be trying out in future.

Physical Design – Exam notes

WITH CHECK OPTION

This seems to be a crafty little three word clause that can add a little security to your views, but it seems at the cost of having more of a maintenance overhead for an application. The example in the document says “WHERE dept = 10“, and goes on to say how if you tried the insert for any other dept no it would not work. This is good as it would stop insertion for dept 11, but would mean that you would need one of these views for each “dept” that you needed.

Informational Constraints

I have come across these before still not really sure what they are useful for, and there is not a good explanation of them in this doc. If they are ENFORCED then how is this different from a check constraint? and if it is ENABLED QUERY OPTIMIZATION then why would you want to make it difficult to find the wrong data rows?

Storage Groups / Multi Temperature data

This is a new feature to DB2 LUW 10.1 and I have found it hard so far to get a clear definition of how the original paths you supply a create database for its data paths and automated storage relate too multi temperature data storage groups. This doc does not make it as easy as I think I have now got it but I will give it a go in a series of bullet points (how I understand it):

  • The data paths that you define when you create a database go to create the default storage group : IBMSTOGROUP.
  • You can then create additional storage groups using the CREATE STOGROUP command for hot etc data
  • The three basic tables spaces that are created at database creation (USERSPACE1, TEMPSPACE1 and SYSCATSPACE1) are created in the default IBMSTOGROUP, that is on the paths that you specified to begin with.
  • Tablespaces are then created USING the newly created storage groups and rather than the data being stored in the default data path it is stored where specified.
  • If you are lucky enough to have workload manager then Storage groups can be given tags to optimize the performance of this – bonus.
  • As I understand it though if you only specify one path in you storage group then you data will not be striped as only one container will be created
  • You then create your partitioned tables (or tables) in these tablespaces appropriately. So the hot part of your partitioned table will go in the hot STOGROUP attached tablespace.

So hopefully the above makes this a little clearer, it does for me. We are currently looking at completely re architecturing our data-pile at the moment and a lot of the design decisions that this offers will be taken into consideration.

Range Clustered Tables

This seems like a most useful feature for star schema data warehouses where there are surrogate or business keys that are generally some form of whole number (SMALLINT, INTEGER or BIGINT). There does seem to be two massive GOTCHA’s in the text though (taken from the text):

  1. DB2 pre-allocates the disk space required for the RCT at creation time. This is done by calculating the number of distinct key values and multiplying it by the table row size. This mandates that the space required for holding the entire table should be available at creation time
  2. you cannot issue an ALTER TABLE statement to alter the physical characteristics of an RCT after its creation

Now I can see this being an issue if you have the full range of a BIGINT * the size of the table fields for each row allocated and used at time of creation, but you can limit this by using the “starting from” and “ending” clauses to the key sequences.

Range clustered tables also have another useful clause of “DISALLOW OVERFLOWS”, this apparently means that reorganization operations are not required, but you will not be allowed to insert any rows that exist outside the limits that you have set, so could be another way to stop GIGO.

Admin_move_table

This seems a most useful stored procedure for moving tables from say DEV to PRODUCTION or if you find that your creation script has accidentally left the wrong table space in the statement and your table needs moving out. Tables can be moved online but this takes more resource in terms of processor and disk space. Most interesting in terms of keeping your data online and generating a new compression dictionary (and therefore also a reorg?). The full command can be found here

Temporal Tables

These are new and will defiantly feature when we re-architecture the data dump that we have currently. I have had to code logic to capture the versions of the subject and then other objects to find the latest version efficiently across multi million row data sets. The new temporal tables will take care of this for you, and still allow you to access the old data in a nearly standard query. The other bonus is that these tables can also be partitioned, but you must make sure the history table can cope with the rows that will be inserted into it.

Conclusion

Well these are my high level take away’s on the extra stuff that seems to be on the paper compared too the V9.7 exam (that I never got round to sitting). There seems to be a lot less detail in this paper compared to the one that has taken me so long to read as it contain so much detail, which leaves me to wonder is the exam a middle ground of the two? Would highly suggest reading the serise and I will be doing more posts in the mean time on the other papers in the set, found at the link at the top of the article.



Posted in: ADMIN_MOVE_TABLE, DB2, DB2 Administration, DB2 Built-in Stored Procedures, DB2 DBA Certification, DB2 DBA Exam 611, IBM, IBM DB2 LUW, Physical Design, Physical Design / Tagged: ADMIN_MOVE_TABLE, Certification, DB2, DB2 Administration, DB2 DBA Exam 611, DB2 Development, Exam, IBM DB2 LUW, Informational Constraints, Multi temperature data, Storage groups, Stored Procedures, Temporal tables

Getting an estimate – DB2 LUW V10.1 Compression

May 20, 2013 8:00 am / Leave a Comment / dangerousDBA

So you want to add compression to your house you need to get a tradesman in to give you an estimate, then carry out the work, DB2 can do all of this. Just like building an extension you need to make sure that you need all the appropriate permissions from the “council” (IBM) in place, you either need to buy the Storage Optimisation as a “feature” or as part of Advanced Enterprise Edition of DB2. Please be careful when trying to use compression because as soon as you include “COMPRESSION YES” it will set the features used to YES for compression and if you get audited you could face a hefty bill.

Benefit’s to extending to compression

At a high level the there are three ways of looking at this.
No compression
Benefits
Not having to pay the licensing fee to IBM for compression.
Costs
Large amounts of disk space used for the data, minimal amounts of data in your bufferpools as the page sizes are not made any smaller
Classic Compression
Benefits
Data is compressed on disk and saves you here, data is also compressed in the bufferpools so more pages in them; less I/0 quicker queries. Data is also compressed in the backup images.
Costs
Licensing fee to IBM. Slight increase in CPU usage for the compression dictionary usage. You need to reset the dictionary with a REORG from time to time to make sure that you get the most out of the compression.
Adaptive Compression
Benefits
Data is compressed on disk, data is also compressed in the bufferpools so more pages in them; less I/0 quicker queries. Data is also compressed in the backup images. Data is continually compressed, no need for the RESETDICTIONARY REORG in the same way as the Classic compression.
Costs
Licensing fee to IBM. Increase in CPU usage for the compression dictionary usage. Only available in the latest DB2 V10.1

Here’s what you could be saving – SYSPROC.ADMIN_GET_TAB_COMPRESS_INFO

Handley IBM have included a very useful table function SYSPROC.ADMIN_GET_TAB_COMPRESS_INFO. The full information for this can be found in the information centre here. This table function will estimate the savings that you will get with no compression, “standard” compression and adaptive compression, GOTCHA’s for this are below:

SYSPROC.ADMIN_GET_TAB_COMPRESS_INFO – GOTCHA’s

  1. Tables that are partitioned will come through the stored procedure as multiple rows. You do get a partition ID which you will be able to either join out too or look up in the table SYSCAT.DATAPARTITIONS.
  2. If the table has an (or more) XML column then you will get an additional row in the results returned, a “DATA” and an “XML” compression estimation row. Together with the other gotcha you could end up a lot of a rows returned for a partitioned table with XML columns.

Getting an estimate – SYSPROC.ADMIN_GET_TAB_COMPRESS_INFO

This procedure can be used to get information on either a table or an entire schema, obviously the later can take some time to run from what I have found especially when the tables are large. The most simple form of the stored procedure is:


SELECT * 
FROM TABLE(SYSPROC.ADMIN_GET_TAB_COMPRESS_INFO({SCHEMA NAME}, {TABLE NAME}))

This will get you a result a little like this (sorry for the formatting):


TABSCHEMA   TABNAME   DBPARTITIONNUM   DATAPARTITIONID     OBJECT_TYPE     ROWCOMPMODE     PCTPAGESSAVED_CURRENT     AVGROWSIZE_CURRENT     PCTPAGESSAVED_STATIC     AVGROWSIZE_STATIC     PCTPAGESSAVED_ADAPTIVE     AVGROWSIZE_ADAPTIVE    
------------        ----------     -----------------     ------------------  --------------  --------------  ------------------------  ---------------------  -----------------------  --------------------  -------------------------  ---------------------- 
SCHEMA         TABLE      0                  0                   DATA            S               0                         495                    65                       173                   65                         170   

The example above shows that this table currently is using “Classic” compression, represented by the S, a blank would mean no row compression and an A would be the new adaptive compression in DB2. As you can see it gives you an estimate on the average row size in the different compression modes, this is in bytes and you will then need to work out what the full Gb / Mb size might be based on the cardinality of the table.

The table function is telling us though that there are potentially 65% savings to be made in both adaptive and classic compression, but there is a 3 byte difference and adaptive compression in my opinion is far better so I would ALTER TABLE to COMPRESS YES ADAPTIVE.

If you want to run the table function against a schema leave the table part a blank string


SELECT * 
FROM TABLE(SYSPROC.ADMIN_GET_TAB_COMPRESS_INFO({SCHEMA NAME}, ''))

This will get you a row per table in the schema (plus any extra for XML / partitioned tables)

The future

In a future post I will look at using this table function to record the values for all tables, you can then look at a before and after and therefore prove that the change in compression and the associated REORG’s have worked.



Posted in: DB2, DB2 Administration, DB2 Built in commands, DB2 Built-in Stored Procedures, DB2 Maintenance, db2licm, IBM, IBM DB2 LUW, SYSPROC.ADMIN_GET_TAB_COMPRESS_INFO / Tagged: ADMIN_GET_TAB_COMPRESS_INFO, DB2, DB2 Administration, db2licm, IBM DB2 LUW, Stored Procedures, SYSPROC.ADMIN_GET_TAB_COMPRESS_INFO, V10.1, V9.7, XML

Post Navigation

← Older Posts
 

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Join 703 other subscribers

Recent Posts

  • QCon London Day 1
  • Testing Amazon Redshift: Distribution keys and styles
  • Back to dangerous blogging
  • DB2 10.1 LUW Certification 611 notes 1 : Physical Design
  • Getting an estimate – DB2 LUW V10.1 Compression
  • Usage Lists – Big Brother is watching, but only where he is looking
  • DB2 LUW Exam 611 – Holiday reading
  • Record the size of your DB2 tables – SYSIBMADM.ADMINTABINFO
  • Bash: Screen most useful command for DB2
  • Lazy RUNSTATS using SYSPROC.ADMIN_CMD

Dangerous Topics

added functionality ADMIN_EST_INLINE_LENGTH Amazon Blogging Bootcamp colum convert data types DB2 db2 DB2 Administration DB2 Development db2advis db2licm Decompose XML Exam EXPORT IBM IBM DB2 LUW idug information centre infosphere LOAD merry christmas and a happy new year Recursive Query Recursive SQL Redshift Reorganisation Reorganise Reorganise Indexes Reorganise Tables Runstats sqlcode sql statement Stored Procedures SYSPROC.ADMIN_CMD Time UDF User Defined Functions V9.7 V10.1 Varchar XML XML PATH XMLTABLE

DangerousDBA Links

  • DB2 for WebSphere Commerce
  • My Personal Blog

Disclaimer:

The posts here represent my personal views and not those of my employer. Any technical advice or instructions are based on my own personal knowledge and experience, and should only be followed by an expert after a careful analysis. Please test any actions before performing them in a critical or nonrecoverable environment. Any actions taken based on my experiences should be done with extreme caution. I am not responsible for any adverse results. DB2 is a trademark of IBM. I am not an employee or representative of IBM.

Advertising

© Copyright 2019 - Dangerous DBA
Infinity Theme by DesignCoral / WordPress