• About Dangerous DBA
  • Table of Contents
Dangerous DBA A blog for those DBA's who live on the edge

Tag Archives: Export

XML Shredding – Data Model and code

October 19, 2012 10:01 pm / Leave a Comment / dangerousDBA
ShreddingERD

In a previous post I wrote about shredding xml especially when you don’t know how your XML is formatted or it is always different, which where I work the XML is not XML it is a pseudo XML that conforms to an XML column.

What am I giving you

So below you will find some “light” documentation on the system that is like the system that I have created at my workplace to shred XML and store it in a way that will be easy to query. I will also go through some of the GOTCHA’s that I have seen so far in doing this. At the bottom of the page you will find a file with all the code you need.

GOTCHA’s

  1. You may find that no size of VARCHAR column is big enough to hold your XML data in the query especially when the initial stages of the recursive query therefore you may need to to use a CLOB column. I have found this is a little quicker than a very large VARCHAR column, and you will not need to create a temporary tablespace bufferpool so large to hold the query.
  2. You may have elements that are named the same thing (e.g. email, Email, EMAIL, eMAIL) and exist down the same XPATH then I have created the SHREDDED_REQUESTS_ELEMENTS_GROUP table and this will create a store for all the same named and located elements normalised around the XPATH and element name being lowered.
  3. The query code will produce entries that contain multiple values when the element is not the lowest element. So if you say had /booking/personal with /email and /name under it then you would get three rows output with /booking/personal with a value of the the values of /email and /name concatenated together, and the two additional rows of /booking/personal/email and /booking/personal/name, therefore you will need to build into the WHERE clause to exclude any paths like /booking/personal out of the result sets.

XML Shredding – ERD

XML Shredding – The Tables

There are four tables in the solution a staging table and then three others that make up a mini-flake like schema.

SHREDDED_REQUESTS_STAGE

The staging tables for the solution where the initial shredding of the XML is inserted into. This table is where the rest of the processing starts. The thing to watch out for here is the size of the XML_ELEMENT_VALUE column; here you need to make sure that you do not make this column too small. You could get rid of the date column but I find it useful if you are processing multiple days.

SHREDDED_REQUESTS

This table is where the data resides at the end of the process and resides for the rest of the data life cycle. It is essentially the same staging table but I have made a surrogate key of the the column XML_ELEMENT_NAME and XML_XPATH. This is to reduce the storage needs of the table and makes it easier to search.

SHREDDED_REQUEST_ELEMENTS

This is the store for the XML_ELEMENT_NAME and XML_XPATH once they have been made a surrogate. It also holds the key out too the grouping table.

SHREDDED_REQUESTS_ELEMENTS_GROUPS

This is the “final” table in the solution, and you may be able to leave it off because I had to create it too enable the normalisation of all the same element names in different case (e.g. EMAIL, Email, email and eMail) issues in the XML_ELEMENT_NAME and XML_XPATH.

XML Shredding – The Process

The process happens in four stages:

1.Shed the XML into the staging table

So this stage has to happen otherwise there is very little point in this process. It uses a bit of recursive SQL as out lined in a previous post and all the element values, XPaths and element names with the appropriate identifier.

To speed this up and in the code supplied in the attachment you will see that I use an EXPORT statement to unload and an LOAD to make sure that the process happens as fast as possible, with the minimum of log contention and use. It is basic maths that you are going to be inserting a lot of rows, say your site does 100,000 requests and there are 50 elements on each request that is 5,000,000 rows!

So something like:


CREATE PROCEDURE {SCHEMA}.ADD_SHREDDED_REQUESTS_STAGE(IN IN_START DATE, IN IN_END DATE)
LANGUAGE SQL
BEGIN
    ----------------------------------------------------------------------------
    ----------------------------------------------------------------------------
    --Takes in two dates
    --Does this using a EXPORT and LOAD so that it works quicker than insert
    --   as the actual query takes no time at all and it is only the inserting 
    --   that takes the time.
    --Uses the ADMIN_EST_INLINE_LENGTH Function from http://publib.boulder.ibm.com/infocenter/db2luw/v9r7/topic/com.ibm.db2.luw.sql.rtn.doc/doc/r0054083.html 
    ----------------------------------------------------------------------------
    ----------------------------------------------------------------------------

    ----------------------------------------------------------------------------
    ----------------------------------------------------------------------------
    --Declare Vars
        DECLARE EXPORTSTRING VARCHAR(3000);
        DECLARE LOADSTRING VARCHAR(3000);

    ----------------------------------------------------------------------------
    ----------------------------------------------------------------------------
    --Process
            --Create the EXPORT string
            SET EXPORTSTRING = 'EXPORT TO "/{some file system you can output too}/SHREDDED_REQUESTS_STAGE" OF DEL MODIFIED BY COLDEL0x09 
                                WITH pathstable (REQUEST_ID,REQUEST_DATE, name, node, xpath, value) AS (
                                  SELECT HM.REQUEST_ID,
                                         HM.REQUEST_DATE,
                                         x.name AS name, 
                                         x.node AS xmlnode,
                                         ''/'' || x.name AS xpath,
                                         x.value as value
                                  FROM {SCHEMA}.{TABLE WITH XML COLUMN} HM,
                                       XMLTABLE(''$REQUEST_XML/*''
                                        COLUMNS
                                          name varchar(300) PATH ''./local-name()'',
                                          value varchar(12000) PATH ''xs:string(.)'',
                                          node    XML PATH ''.'') AS x
                                   WHERE HR.REQUEST_DATE BETWEEN ''' || IN_START || ''' AND ''' || IN_END || '''                                        
                                        AND ADMIN_EST_INLINE_LENGTH(HM.REQUEST_XML) {} -1
                                  UNION ALL
                                  SELECT REQUEST_ID,
                                         REQUEST_DATE,
                                         y.name AS name, 
                                         y.node AS xmlnode, 
                                         xpath|| ''/'' || y.name AS xpath,
                                         y.value as value
                                  FROM pathstable,
                                       XMLTABLE(''$XMLNODE/(*,@*)'' PASSING pathstable.node AS "XMLNODE"
                                        COLUMNS
                                         name varchar(300) PATH ''local-name()'',
                                         value varchar(12000) PATH ''xs:string(.)'',
                                         node    XML PATH ''.'') AS y
                                ) SELECT REQUEST_ID, REQUEST_DATE, name, xpath, value
                                  FROM pathstable
                                  WHERE NAME {} ''serialized''
                                    AND NAME {} ''_viewstate''
                                    AND NAME {} ''__shorturl_postdata'' 
                                    AND NAME {} ''API_Reply''
                                    AND NAME {} ''apiReply'' ';

            --Execute the EXPORT string
            CALL SYSPROC.ADMIN_CMD(EXPORTSTRING);

            --Create the LOAD string
            SET LOADSTRING = 'LOAD FROM "/{some file system you can output too}/SHREDDED_REQUESTS_STAGE" OF DEL MODIFIED BY COLDEL0x09
                              METHOD P (1,2,3,4,5)
                              INSERT INTO {SCHEMA}.SHREDDED_REQUESTS_STAGE(
                                REQUEST_ID,
                                REQUEST_DATE,
                                XML_ELEMENT_NAME,
                                XML_XPATH,
                                XML_ELEMENT_VALUE
                              ) COPY YES TO "/{some file system you can output too}-copy/" INDEXING MODE AUTOSELECT';

            --Execue the LOAD string
            CALL SYSPROC.ADMIN_CMD(LOADSTRING);

END

2.Gather the group elements

If you have nicely formed XML without upper / lower case and everything in-between you might not need this stage, but here I gather the element names and paths and normalise them to lower, and insert the new ones to be used in the next stage. All the normalised values are given an ID

So something like:


CREATE PROCEDURE {SCHEMA}.ADD_SHREDDED_REQUEST_ELEMENTS_GROUPS()
LANGUAGE SQL
BEGIN
    ----------------------------------------------------------------------------
    ----------------------------------------------------------------------------
    --Add based on what is not in the main elements table the new combinations
    --   from the shredded requests stage table that do not exist. 
    ----------------------------------------------------------------------------
    ----------------------------------------------------------------------------

    INSERT INTO {SCHEMA}.SHREDDED_REQUEST_ELEMENTS_GROUPS(
        LOWER_XML_ELEMENT_NAME,
        LOWER_XML_XPATH
    )
    SELECT DISTINCT LOWER(HSRS.XML_ELEMENT_NAME) AS LOWER_XML_ELEMENT_NAME,
                    LOWER(HSRS.XML_XPATH) AS LOWER_XML_XPATH
     FROM {SCHEMA}.SHREDDED_REQUESTS_STAGE HSRS
     WHERE NOT EXISTS (SELECT *
                      FROM {SCHEMA}.SHREDDED_REQUEST_ELEMENTS_GROUPS HSRE
                      WHERE LOWER(HSRS.XML_ELEMENT_NAME) = HSRE.LOWER_XML_ELEMENT_NAME
                        AND LOWER(HSRS.XML_XPATH) = HSRE.LOWER_XML_XPATH);

    COMMIT;
END

3.Gather the elements

At this stage I gather all the new elements and paths (un-normailised) and put them into the table matched against the grouped version. This gives you a table that can be used in the next stage to create the final table with a far more index friendly integer for the elements and paths.

Again an example of the SP:


CREATE PROCEDURE {SCHEMA}.ADD_SHREDDED_REQUEST_ELEMENTS()
LANGUAGE SQL
BEGIN
    ----------------------------------------------------------------------------
    ----------------------------------------------------------------------------
    --Add based on what is not in the main elements table the new combinations
    --   from the shredded requests stage table that do not exist. 
    ----------------------------------------------------------------------------
    ----------------------------------------------------------------------------

    INSERT INTO {SCHEMA}.SHREDDED_REQUEST_ELEMENTS(
        XML_ELEMENT_NAME,
        XML_XPATH,
        SHREDDED_REQUEST_ELEMENT_GROUP_ID
    )
    SELECT DISTINCT HSRS.XML_ELEMENT_NAME,
                    HSRS.XML_XPATH,
                    HSREG.SHREDDED_REQUEST_ELEMENT_GROUP_ID
    FROM {SCHEMA}.SHREDDED_REQUESTS_STAGE HSRS INNER JOIN {SCHEMA}.SHREDDED_REQUEST_ELEMENTS_GROUPS HSREG ON LOWER(HSRS.XML_ELEMENT_NAME) = HSREG.LOWER_XML_ELEMENT_NAME
                                                                                                    AND LOWER(HSRS.XML_XPATH) = HSREG.LOWER_XML_XPATH
    WHERE NOT EXISTS (SELECT *
                      FROM {SCHEMA}.SHREDDED_REQUEST_ELEMENTS HSRE
                      WHERE HSRS.XML_ELEMENT_NAME = HSRE.XML_ELEMENT_NAME
                        AND HSRS.XML_XPATH = HSRE.XML_XPATH);

    COMMIT;
END
GO

4.Add to the permanent store

Use the table from the previous stage and load the data into the final table SHREDDED_REQUESTS. Loading only the surrogate key for the element and path of the XML value.

Something like:


CREATE PROCEDURE {SCHEMA}.ADD_SHREDDED_REQUESTS()
LANGUAGE SQL
BEGIN
    ----------------------------------------------------------------------------
    ----------------------------------------------------------------------------
    --Add to the main table based on what is in the staging table. 
    --Everything to go over so if there is 1 day or 100 all goes!!
    --Using Export and load to get it done quicker and save on log space. 
    ----------------------------------------------------------------------------
    ----------------------------------------------------------------------------

    ----------------------------------------------------------------------------
    ----------------------------------------------------------------------------
    --Declare Vars
        DECLARE EXPORTSTRING VARCHAR(3000);
        DECLARE LOADSTRING VARCHAR(3000);

    ----------------------------------------------------------------------------
    ----------------------------------------------------------------------------
    --Process

    --Create the EXPORT string
    SET EXPORTSTRING = 'EXPORT TO "/{some file system you can output too}/SHREDDED_REQUESTS" OF DEL MODIFIED BY COLDEL0x09
                        SELECT HSRS.REQUEST_ID,
                               HSRS.REQUEST_DATE,
                               HSRE.SHREDDED_REQUEST_ELEMENT_ID,
                               HSRS.XML_ELEMENT_VALUE
                        FROM {SCHEMA}.SHREDDED_REQUESTS_STAGE HSRS INNER JOIN {SCHEMA}.SHREDDED_REQUEST_ELEMENTS HSRE ON HSRS.XML_ELEMENT_NAME = HSRE.XML_ELEMENT_NAME
                                                                                                                AND HSRS.XML_XPATH = HSRE.XML_XPATH';

    --Execute the EXPORT string
    CALL SYSPROC.ADMIN_CMD(EXPORTSTRING);

    --Create the LOAD string
    SET LOADSTRING = 'LOAD FROM "/{some file system you can output too}/SHREDDED_REQUESTS" OF DEL MODIFIED BY COLDEL0x09
                      METHOD P (1,2,3,4)
                      INSERT INTO {SCHEMA}.SHREDDED_REQUESTS(
                        REQUEST_ID, 
                        REQUEST_DATE, 
                        SHREDDED_REQUEST_ELEMENT_ID, 
                        XML_ELEMENT_VALUE
                      ) COPY YES TO "/{some file system you can output too}-copy/" INDEXING MODE AUTOSELECT';
                      
    --Execue the LOAD string
    CALL SYSPROC.ADMIN_CMD(LOADSTRING);
END

XML Shredding – The file

As a reward for getting through all of the above, or for skipping it all and going for the juicy bit at the end here is the file – enjoy

GOTCHA’s
  1. I have compression on the servers at work, I have kept COMPRESS YES statement in, if you don’t have it then you will need to remove it other wise you will be in violation of your licences.
  2. You will have to change all {values} to get it to work, to your schemas etc that are relevant to your servers
  3. Please make sure you test before you use, you have been warned



Posted in: Date, DB2 Administration, DB2 Data Types, DB2 Development, Decompose XML, IBM DB2 LUW, Recursive Query, Recursive SQL, Shred XML, V10, V9.7, Vargraphic, XML, XMLTABLE / Tagged: ADMIN_EST_INLINE_LENGTH, DB2, DB2 Administration, DB2 Development, Decompose XML, EXPORT, IBM, IBM DB2 LUW, LOAD, Recursive Query, Recursive SQL, V10.1, V9.7, XML, XML PATH, XMLTABLE

SYSPROC.ADMIN_CMD – Start of an adventure

July 4, 2012 2:01 pm / Leave a Comment / dangerousDBA

On my old site I did an post on the SYSPROC.ADMIN_CMD and how you can get it too carry out a runstats of a table. The full list of everything it can do can be found in the IBM DB2 LUW Info center and as the title of this page suggests “ADMIN_CMD procedure – Run administrative commands” it allows you to run administrative tasks that you would normally have to type in manually or use the GUI (Data Studio) for.

Use of SYSPROC.ADMIN_CMD

The command is quite useful in it takes upto a 2Mb string in as a parameter and executes it, it is unuseful in the fact that it has a limited list of commands it will take, e.g. it will not take any of the SIUD (SELECT, INSERT, UPDATE, DELETE) statements which should be eveident from the “administrative commands” part of the description. But equally I can think of many occasions that this would be more useful then PREPARE and EXECUTE for running dynamic commands.

The real power of this command in my opionion is that SYSPROC.ADMIN_CMD allows you to create Stored Procedures in DB2 where you would normally create a bash script. This makes your database administration scritpts a lot more protable and give you the ability to create an “Database Maintenace” (DB_MAIN) Schema that is installable as part of a standard set up.

Standard useage is along the lines of:

CALL SYSPROC.ADMIN_CMD('{Command}')

The SP can be called by anyone who has permissions on it, but the commands issued through the SP by the user, the user has to be of an appropriate level to run them therefore you cant just have any joe (or joanne) running a backup of the database.

Typical of how I use it is along the lines of

CALL SYSPROC.ADMIN_CMD('RUNSTATS ON TABLE {Schema}.{Table Name} ON ALL COLUMNS AND INDEXES ALL ALLOW WRITE ACCESS')

But as you will notice this still requires me to type in a command that would in essence still work at the command line. This brings me onto …..

Future posts

I use it to run things like Runstats, Re-organisations of table and indexes, Exporting and Loading data. In later blog posts I will be looking at how I use this command and how you might be able to as well to run standard, but dynamic administrative commands.



Posted in: DB2 Administration, DB2 Built-in Stored Procedures, DB2 Development, EXPORT, IBM DB2 LUW, LOAD, Reorganisation, Reorganise Indexes, Reorganise Tables, Runstats, SYSPROC.ADMIN_CMD / Tagged: DB2, DB2 Administration, DB2 Development, EXPORT, IBM DB2 LUW, LOAD, Reorganisation, Reorganise, Reorganise Indexes, Reorganise Tables, Runstats, Stored Procedures, SYSPROC.ADMIN_CMD

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Join 757 other subscribers

Recent Posts

  • Self generating Simple SQL procedures – MySQL
  • Google Cloud Management – My Idea – My White Whale?
  • Position Tracker – The Stub – Pandas:
  • Position Tracker – The Stub
  • Position Tracker – In the beginning
  • Whats been going on in the world of the Dangerous DBA:
  • QCon London Day 1
  • Testing Amazon Redshift: Distribution keys and styles
  • Back to dangerous blogging
  • DB2 10.1 LUW Certification 611 notes 1 : Physical Design

Dangerous Topics

added functionality ADMIN_EST_INLINE_LENGTH Bootcamp colum convert data types DB2 db2 DB2 Administration DB2 Development db2advis db2licm Decompose XML EXPORT GCP Google IBM IBM DB2 LUW idug information centre infosphere IOT LOAD merry christmas and a happy new year Position Tracking python Recursive Query Recursive SQL Reorganisation Reorganise Reorganise Indexes Reorganise Tables Runstats sql statement Stored Procedures SYSPROC.ADMIN_CMD Time UDF User Defined Functions V9.7 V10.1 Varchar XML XML PATH XMLTABLE

DangerousDBA Links

  • DB2 for WebSphere Commerce
  • My Personal Blog

Disclaimer:

The posts here represent my personal views and not those of my employer. Any technical advice or instructions are based on my own personal knowledge and experience, and should only be followed by an expert after a careful analysis. Please test any actions before performing them in a critical or nonrecoverable environment. Any actions taken based on my experiences should be done with extreme caution. I am not responsible for any adverse results. DB2 is a trademark of IBM. I am not an employee or representative of IBM.

Advertising

© Copyright 2023 - Dangerous DBA
Infinity Theme by DesignCoral / WordPress