• About Dangerous DBA
  • Table of Contents
Dangerous DBA A blog for those DBA's who live on the edge

Tag Archives: Admin_est_inline_length

XML Shredding – Data Model and code

October 19, 2012 10:01 pm / Leave a Comment / dangerousDBA
ShreddingERD

In a previous post I wrote about shredding xml especially when you don’t know how your XML is formatted or it is always different, which where I work the XML is not XML it is a pseudo XML that conforms to an XML column.

What am I giving you

So below you will find some “light” documentation on the system that is like the system that I have created at my workplace to shred XML and store it in a way that will be easy to query. I will also go through some of the GOTCHA’s that I have seen so far in doing this. At the bottom of the page you will find a file with all the code you need.

GOTCHA’s

  1. You may find that no size of VARCHAR column is big enough to hold your XML data in the query especially when the initial stages of the recursive query therefore you may need to to use a CLOB column. I have found this is a little quicker than a very large VARCHAR column, and you will not need to create a temporary tablespace bufferpool so large to hold the query.
  2. You may have elements that are named the same thing (e.g. email, Email, EMAIL, eMAIL) and exist down the same XPATH then I have created the SHREDDED_REQUESTS_ELEMENTS_GROUP table and this will create a store for all the same named and located elements normalised around the XPATH and element name being lowered.
  3. The query code will produce entries that contain multiple values when the element is not the lowest element. So if you say had /booking/personal with /email and /name under it then you would get three rows output with /booking/personal with a value of the the values of /email and /name concatenated together, and the two additional rows of /booking/personal/email and /booking/personal/name, therefore you will need to build into the WHERE clause to exclude any paths like /booking/personal out of the result sets.

XML Shredding – ERD

XML Shredding – The Tables

There are four tables in the solution a staging table and then three others that make up a mini-flake like schema.

SHREDDED_REQUESTS_STAGE

The staging tables for the solution where the initial shredding of the XML is inserted into. This table is where the rest of the processing starts. The thing to watch out for here is the size of the XML_ELEMENT_VALUE column; here you need to make sure that you do not make this column too small. You could get rid of the date column but I find it useful if you are processing multiple days.

SHREDDED_REQUESTS

This table is where the data resides at the end of the process and resides for the rest of the data life cycle. It is essentially the same staging table but I have made a surrogate key of the the column XML_ELEMENT_NAME and XML_XPATH. This is to reduce the storage needs of the table and makes it easier to search.

SHREDDED_REQUEST_ELEMENTS

This is the store for the XML_ELEMENT_NAME and XML_XPATH once they have been made a surrogate. It also holds the key out too the grouping table.

SHREDDED_REQUESTS_ELEMENTS_GROUPS

This is the “final” table in the solution, and you may be able to leave it off because I had to create it too enable the normalisation of all the same element names in different case (e.g. EMAIL, Email, email and eMail) issues in the XML_ELEMENT_NAME and XML_XPATH.

XML Shredding – The Process

The process happens in four stages:

1.Shed the XML into the staging table

So this stage has to happen otherwise there is very little point in this process. It uses a bit of recursive SQL as out lined in a previous post and all the element values, XPaths and element names with the appropriate identifier.

To speed this up and in the code supplied in the attachment you will see that I use an EXPORT statement to unload and an LOAD to make sure that the process happens as fast as possible, with the minimum of log contention and use. It is basic maths that you are going to be inserting a lot of rows, say your site does 100,000 requests and there are 50 elements on each request that is 5,000,000 rows!

So something like:


CREATE PROCEDURE {SCHEMA}.ADD_SHREDDED_REQUESTS_STAGE(IN IN_START DATE, IN IN_END DATE)
LANGUAGE SQL
BEGIN
    ----------------------------------------------------------------------------
    ----------------------------------------------------------------------------
    --Takes in two dates
    --Does this using a EXPORT and LOAD so that it works quicker than insert
    --   as the actual query takes no time at all and it is only the inserting 
    --   that takes the time.
    --Uses the ADMIN_EST_INLINE_LENGTH Function from http://publib.boulder.ibm.com/infocenter/db2luw/v9r7/topic/com.ibm.db2.luw.sql.rtn.doc/doc/r0054083.html 
    ----------------------------------------------------------------------------
    ----------------------------------------------------------------------------

    ----------------------------------------------------------------------------
    ----------------------------------------------------------------------------
    --Declare Vars
        DECLARE EXPORTSTRING VARCHAR(3000);
        DECLARE LOADSTRING VARCHAR(3000);

    ----------------------------------------------------------------------------
    ----------------------------------------------------------------------------
    --Process
            --Create the EXPORT string
            SET EXPORTSTRING = 'EXPORT TO "/{some file system you can output too}/SHREDDED_REQUESTS_STAGE" OF DEL MODIFIED BY COLDEL0x09 
                                WITH pathstable (REQUEST_ID,REQUEST_DATE, name, node, xpath, value) AS (
                                  SELECT HM.REQUEST_ID,
                                         HM.REQUEST_DATE,
                                         x.name AS name, 
                                         x.node AS xmlnode,
                                         ''/'' || x.name AS xpath,
                                         x.value as value
                                  FROM {SCHEMA}.{TABLE WITH XML COLUMN} HM,
                                       XMLTABLE(''$REQUEST_XML/*''
                                        COLUMNS
                                          name varchar(300) PATH ''./local-name()'',
                                          value varchar(12000) PATH ''xs:string(.)'',
                                          node    XML PATH ''.'') AS x
                                   WHERE HR.REQUEST_DATE BETWEEN ''' || IN_START || ''' AND ''' || IN_END || '''                                        
                                        AND ADMIN_EST_INLINE_LENGTH(HM.REQUEST_XML) {} -1
                                  UNION ALL
                                  SELECT REQUEST_ID,
                                         REQUEST_DATE,
                                         y.name AS name, 
                                         y.node AS xmlnode, 
                                         xpath|| ''/'' || y.name AS xpath,
                                         y.value as value
                                  FROM pathstable,
                                       XMLTABLE(''$XMLNODE/(*,@*)'' PASSING pathstable.node AS "XMLNODE"
                                        COLUMNS
                                         name varchar(300) PATH ''local-name()'',
                                         value varchar(12000) PATH ''xs:string(.)'',
                                         node    XML PATH ''.'') AS y
                                ) SELECT REQUEST_ID, REQUEST_DATE, name, xpath, value
                                  FROM pathstable
                                  WHERE NAME {} ''serialized''
                                    AND NAME {} ''_viewstate''
                                    AND NAME {} ''__shorturl_postdata'' 
                                    AND NAME {} ''API_Reply''
                                    AND NAME {} ''apiReply'' ';

            --Execute the EXPORT string
            CALL SYSPROC.ADMIN_CMD(EXPORTSTRING);

            --Create the LOAD string
            SET LOADSTRING = 'LOAD FROM "/{some file system you can output too}/SHREDDED_REQUESTS_STAGE" OF DEL MODIFIED BY COLDEL0x09
                              METHOD P (1,2,3,4,5)
                              INSERT INTO {SCHEMA}.SHREDDED_REQUESTS_STAGE(
                                REQUEST_ID,
                                REQUEST_DATE,
                                XML_ELEMENT_NAME,
                                XML_XPATH,
                                XML_ELEMENT_VALUE
                              ) COPY YES TO "/{some file system you can output too}-copy/" INDEXING MODE AUTOSELECT';

            --Execue the LOAD string
            CALL SYSPROC.ADMIN_CMD(LOADSTRING);

END

2.Gather the group elements

If you have nicely formed XML without upper / lower case and everything in-between you might not need this stage, but here I gather the element names and paths and normalise them to lower, and insert the new ones to be used in the next stage. All the normalised values are given an ID

So something like:


CREATE PROCEDURE {SCHEMA}.ADD_SHREDDED_REQUEST_ELEMENTS_GROUPS()
LANGUAGE SQL
BEGIN
    ----------------------------------------------------------------------------
    ----------------------------------------------------------------------------
    --Add based on what is not in the main elements table the new combinations
    --   from the shredded requests stage table that do not exist. 
    ----------------------------------------------------------------------------
    ----------------------------------------------------------------------------

    INSERT INTO {SCHEMA}.SHREDDED_REQUEST_ELEMENTS_GROUPS(
        LOWER_XML_ELEMENT_NAME,
        LOWER_XML_XPATH
    )
    SELECT DISTINCT LOWER(HSRS.XML_ELEMENT_NAME) AS LOWER_XML_ELEMENT_NAME,
                    LOWER(HSRS.XML_XPATH) AS LOWER_XML_XPATH
     FROM {SCHEMA}.SHREDDED_REQUESTS_STAGE HSRS
     WHERE NOT EXISTS (SELECT *
                      FROM {SCHEMA}.SHREDDED_REQUEST_ELEMENTS_GROUPS HSRE
                      WHERE LOWER(HSRS.XML_ELEMENT_NAME) = HSRE.LOWER_XML_ELEMENT_NAME
                        AND LOWER(HSRS.XML_XPATH) = HSRE.LOWER_XML_XPATH);

    COMMIT;
END

3.Gather the elements

At this stage I gather all the new elements and paths (un-normailised) and put them into the table matched against the grouped version. This gives you a table that can be used in the next stage to create the final table with a far more index friendly integer for the elements and paths.

Again an example of the SP:


CREATE PROCEDURE {SCHEMA}.ADD_SHREDDED_REQUEST_ELEMENTS()
LANGUAGE SQL
BEGIN
    ----------------------------------------------------------------------------
    ----------------------------------------------------------------------------
    --Add based on what is not in the main elements table the new combinations
    --   from the shredded requests stage table that do not exist. 
    ----------------------------------------------------------------------------
    ----------------------------------------------------------------------------

    INSERT INTO {SCHEMA}.SHREDDED_REQUEST_ELEMENTS(
        XML_ELEMENT_NAME,
        XML_XPATH,
        SHREDDED_REQUEST_ELEMENT_GROUP_ID
    )
    SELECT DISTINCT HSRS.XML_ELEMENT_NAME,
                    HSRS.XML_XPATH,
                    HSREG.SHREDDED_REQUEST_ELEMENT_GROUP_ID
    FROM {SCHEMA}.SHREDDED_REQUESTS_STAGE HSRS INNER JOIN {SCHEMA}.SHREDDED_REQUEST_ELEMENTS_GROUPS HSREG ON LOWER(HSRS.XML_ELEMENT_NAME) = HSREG.LOWER_XML_ELEMENT_NAME
                                                                                                    AND LOWER(HSRS.XML_XPATH) = HSREG.LOWER_XML_XPATH
    WHERE NOT EXISTS (SELECT *
                      FROM {SCHEMA}.SHREDDED_REQUEST_ELEMENTS HSRE
                      WHERE HSRS.XML_ELEMENT_NAME = HSRE.XML_ELEMENT_NAME
                        AND HSRS.XML_XPATH = HSRE.XML_XPATH);

    COMMIT;
END
GO

4.Add to the permanent store

Use the table from the previous stage and load the data into the final table SHREDDED_REQUESTS. Loading only the surrogate key for the element and path of the XML value.

Something like:


CREATE PROCEDURE {SCHEMA}.ADD_SHREDDED_REQUESTS()
LANGUAGE SQL
BEGIN
    ----------------------------------------------------------------------------
    ----------------------------------------------------------------------------
    --Add to the main table based on what is in the staging table. 
    --Everything to go over so if there is 1 day or 100 all goes!!
    --Using Export and load to get it done quicker and save on log space. 
    ----------------------------------------------------------------------------
    ----------------------------------------------------------------------------

    ----------------------------------------------------------------------------
    ----------------------------------------------------------------------------
    --Declare Vars
        DECLARE EXPORTSTRING VARCHAR(3000);
        DECLARE LOADSTRING VARCHAR(3000);

    ----------------------------------------------------------------------------
    ----------------------------------------------------------------------------
    --Process

    --Create the EXPORT string
    SET EXPORTSTRING = 'EXPORT TO "/{some file system you can output too}/SHREDDED_REQUESTS" OF DEL MODIFIED BY COLDEL0x09
                        SELECT HSRS.REQUEST_ID,
                               HSRS.REQUEST_DATE,
                               HSRE.SHREDDED_REQUEST_ELEMENT_ID,
                               HSRS.XML_ELEMENT_VALUE
                        FROM {SCHEMA}.SHREDDED_REQUESTS_STAGE HSRS INNER JOIN {SCHEMA}.SHREDDED_REQUEST_ELEMENTS HSRE ON HSRS.XML_ELEMENT_NAME = HSRE.XML_ELEMENT_NAME
                                                                                                                AND HSRS.XML_XPATH = HSRE.XML_XPATH';

    --Execute the EXPORT string
    CALL SYSPROC.ADMIN_CMD(EXPORTSTRING);

    --Create the LOAD string
    SET LOADSTRING = 'LOAD FROM "/{some file system you can output too}/SHREDDED_REQUESTS" OF DEL MODIFIED BY COLDEL0x09
                      METHOD P (1,2,3,4)
                      INSERT INTO {SCHEMA}.SHREDDED_REQUESTS(
                        REQUEST_ID, 
                        REQUEST_DATE, 
                        SHREDDED_REQUEST_ELEMENT_ID, 
                        XML_ELEMENT_VALUE
                      ) COPY YES TO "/{some file system you can output too}-copy/" INDEXING MODE AUTOSELECT';
                      
    --Execue the LOAD string
    CALL SYSPROC.ADMIN_CMD(LOADSTRING);
END

XML Shredding – The file

As a reward for getting through all of the above, or for skipping it all and going for the juicy bit at the end here is the file – enjoy

GOTCHA’s
  1. I have compression on the servers at work, I have kept COMPRESS YES statement in, if you don’t have it then you will need to remove it other wise you will be in violation of your licences.
  2. You will have to change all {values} to get it to work, to your schemas etc that are relevant to your servers
  3. Please make sure you test before you use, you have been warned



Posted in: Date, DB2 Administration, DB2 Data Types, DB2 Development, Decompose XML, IBM DB2 LUW, Recursive Query, Recursive SQL, Shred XML, V10, V9.7, Vargraphic, XML, XMLTABLE / Tagged: ADMIN_EST_INLINE_LENGTH, DB2, DB2 Administration, DB2 Development, Decompose XML, EXPORT, IBM, IBM DB2 LUW, LOAD, Recursive Query, Recursive SQL, V10.1, V9.7, XML, XML PATH, XMLTABLE

Shredding XML in DB2 – XML Document shredding

August 9, 2012 11:00 am / Leave a Comment / dangerousDBA

XML is most useful, especially when it is well formatted and you know what it contains and all the documents stored in your XML column are the same. When they are not and you can’t exploit fully DB2’s very good XML features, and you don’t know how to query your XML it can be very hard.

The problem

Millions of records of business information with data in them, no one really knows what’s in there, and no one knows what we might be missing out on. There is a solution in place at the moment that uses a Java application to read the record from DB2 parse it as a string, this is because the element names that you could use XQuery against are all in different cases and appear in different locations in the XML therefore writing an XQuery against it is near impossible without missing something, which the Java App does anyway.

The solution – XML Shredder

I can’t and don’t take credit fully for my solution as it is largely based on this article How to list the paths of all elements in an XML document? – Dr.Matthias Nicola. I have adapted it to list both the element name, path and value. First of all we will look at a simple version, then a more advanced probably more useful version with a built in function that I found very useful so that you don’t parse a document bigger than you buffer pool attached to the temp tablespaces where the query is run.

XML Shredder – Basic Code

WITH pathstable (name, node, xpath, value) AS (
  SELECT x.name AS name, 
         x.node AS xmlnode,
         '/' || x.name AS xpath,
         x.value as value
  FROM {SCHEMA}.{TABLE} {ALIAS} ,
       XMLTABLE('${XML COLUMN YOU WISH TO SHRED}/*'
        COLUMNS
          name varchar(300) PATH './local-name()',
          value varchar(12000) PATH 'xs:string(.)',
          node    XML PATH '.') AS x
  WHERE {ALIAS}.{FIELD} etc etc
  UNION ALL
  SELECT y.name AS name, 
         y.node AS xmlnode, 
         xpath|| '/' || y.name AS xpath,
         y.value as value
  FROM pathstable,
       XMLTABLE('$XMLNODE/(*,@*)' PASSING pathstable.node AS "XMLNODE"
        COLUMNS
         name varchar(300) PATH 'local-name()',
         value varchar(12000) PATH 'xs:string(.)',
         node    XML PATH '.') AS y
) SELECT name, xpath, value
  FROM pathstable

So that's the basic version of the SQL and you will probably supply the WHERE clause something to identify the row that you wish to shred. The section "'${XML COLUMN YOU WISH TO SHRED}/*'" you need to keep the '$ and /*' to wrap your XML column. It will get you a result set something like:

name          xpath                        value
-------------------------------------------------------
email         /personal/email              a@b.com
first_name    /personal/name/first_name    Philip

GOTCHA's

  1. The VARCHAR (in this case 12000) size attached to the value column needs to be long enough to all the values strung together, but if you are going to insert it anywhere this long return value can be taken out in the WHERE clause as it is useless any case and therefore the column that you insert into can be far shorter.
  2. Make sure that you can inline you XML otherwise this code will not work. This can be done using ADMIN_EST_INLINE_LENGTH({XML FIELD}) result is -1 as explained below.

XML Shredder - "Advanced" Code

So the original "Basic" code gave you no identifier to the XML, whatever yours may be so by using something like the below you can create an extended "triple-store" a quad store. I do this as the same element name can exist down different paths and can mean slightly different things depending on where it exists, plus the identifier for the row then it can make it unique and useful.


WITH pathstable ({ADDITIONAL FIELD(S)}, name, node, xpath, value) AS (
SELECT {ADDITIONAL FIELD(S)},
     x.name AS name, 
     x.node AS xmlnode,
     '/' || x.name AS xpath,
     x.value as value
FROM {SCHEMA}.{TABLE} {ALIAS} {JOIN TYPE} JOIN {SCHEMA}.{TABLE} {ALIAS} ON {ALIAS ... etc, etc},
       XMLTABLE('${XML COLUMN YOU WISH TO SHRED}/*'
    COLUMNS
      name varchar(300) PATH './local-name()',
      value varchar(12000) PATH 'xs:string(.)',
      node    XML PATH '.') AS x
WHERE ADMIN_EST_INLINE_LENGTH(XML FIELD) <> -1
    AND {ANY OTHER WHERE CLAUSES THAT AFFECT THE RESULT SET}
UNION ALL
SELECT {ADDITIONAL FIELD(S)},
     y.name AS name, 
     y.node AS xmlnode, 
     xpath|| '/' || y.name AS xpath,
     y.value as value
FROM pathstable,
   XMLTABLE('$XMLNODE/(*,@*)' PASSING pathstable.node AS "XMLNODE"
    COLUMNS
     name varchar(300) PATH 'local-name()',
     value varchar(12000) PATH 'xs:string(.)',
     node    XML PATH '.') AS y
) SELECT {ADDITIONAL FIELD(S)}, name, xpath, value
FROM pathstable
WHERE NAME {SOME PREDICATE} {SOME VALUE}
          OR
      VALUE {SOME PREDICATE} {SOME VALUE}
          OR
      {ADDITIONAL FIELD(S)} {SOME PREDICATE} {SOME VALUE}

In addition to the all the new code there to identify the record so you could store it in a table to use later you will notice the line ADMIN_EST_INLINE_LENGTH({XML FIELD}) <> -1. This inline function, full information from IBM DB2 Info Center here is used to make sure that no record that is too long too process. This is needed as your XML is stored separately therefore is not restricted by the page size of table space where your XML holding table exists and if you do not have a temporary table space large enough then this code will not work.

Future posts

Looking at how you could make this into a useful system, with the ability to be queried easily



Posted in: ADMIN_EST_INLINE_LENGTH, DB2 built in functions, DB2 Data Types, DB2 Development, Decompose XML, IBM DB2 LUW, Recursive Query, Recursive SQL, Shred XML, Varchar, XML, XMLTABLE / Tagged: ADMIN_EST_INLINE_LENGTH, DB2 Development, Decompose XML, IBM DB2 LUW, Recursive Query, Recursive SQL, Shred XML, Varchar, XML, XML PATH, XMLTABLE

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Join 757 other subscribers

Recent Posts

  • Self generating Simple SQL procedures – MySQL
  • Google Cloud Management – My Idea – My White Whale?
  • Position Tracker – The Stub – Pandas:
  • Position Tracker – The Stub
  • Position Tracker – In the beginning
  • Whats been going on in the world of the Dangerous DBA:
  • QCon London Day 1
  • Testing Amazon Redshift: Distribution keys and styles
  • Back to dangerous blogging
  • DB2 10.1 LUW Certification 611 notes 1 : Physical Design

Dangerous Topics

added functionality ADMIN_EST_INLINE_LENGTH Bootcamp colum convert data types DB2 db2 DB2 Administration DB2 Development db2advis db2licm Decompose XML EXPORT GCP Google IBM IBM DB2 LUW idug information centre infosphere IOT LOAD merry christmas and a happy new year Position Tracking python Recursive Query Recursive SQL Reorganisation Reorganise Reorganise Indexes Reorganise Tables Runstats sql statement Stored Procedures SYSPROC.ADMIN_CMD Time UDF User Defined Functions V9.7 V10.1 Varchar XML XML PATH XMLTABLE

DangerousDBA Links

  • DB2 for WebSphere Commerce
  • My Personal Blog

Disclaimer:

The posts here represent my personal views and not those of my employer. Any technical advice or instructions are based on my own personal knowledge and experience, and should only be followed by an expert after a careful analysis. Please test any actions before performing them in a critical or nonrecoverable environment. Any actions taken based on my experiences should be done with extreme caution. I am not responsible for any adverse results. DB2 is a trademark of IBM. I am not an employee or representative of IBM.

Advertising

© Copyright 2023 - Dangerous DBA
Infinity Theme by DesignCoral / WordPress