In a previous post I wrote about shredding xml especially when you don’t know how your XML is formatted or it is always different, which where I work the XML is not XML it is a pseudo XML that conforms to an XML column.
What am I giving you
So below you will find some “light” documentation on the system that is like the system that I have created at my workplace to shred XML and store it in a way that will be easy to query. I will also go through some of the GOTCHA’s that I have seen so far in doing this. At the bottom of the page you will find a file with all the code you need.
GOTCHA’s
- You may find that no size of VARCHAR column is big enough to hold your XML data in the query especially when the initial stages of the recursive query therefore you may need to to use a CLOB column. I have found this is a little quicker than a very large VARCHAR column, and you will not need to create a temporary tablespace bufferpool so large to hold the query.
- You may have elements that are named the same thing (e.g. email, Email, EMAIL, eMAIL) and exist down the same XPATH then I have created the SHREDDED_REQUESTS_ELEMENTS_GROUP table and this will create a store for all the same named and located elements normalised around the XPATH and element name being lowered.
- The query code will produce entries that contain multiple values when the element is not the lowest element. So if you say had /booking/personal with /email and /name under it then you would get three rows output with /booking/personal with a value of the the values of /email and /name concatenated together, and the two additional rows of /booking/personal/email and /booking/personal/name, therefore you will need to build into the WHERE clause to exclude any paths like /booking/personal out of the result sets.
XML Shredding – ERD
XML Shredding – The Tables
There are four tables in the solution a staging table and then three others that make up a mini-flake like schema.
SHREDDED_REQUESTS_STAGE
The staging tables for the solution where the initial shredding of the XML is inserted into. This table is where the rest of the processing starts. The thing to watch out for here is the size of the XML_ELEMENT_VALUE column; here you need to make sure that you do not make this column too small. You could get rid of the date column but I find it useful if you are processing multiple days.
SHREDDED_REQUESTS
This table is where the data resides at the end of the process and resides for the rest of the data life cycle. It is essentially the same staging table but I have made a surrogate key of the the column XML_ELEMENT_NAME and XML_XPATH. This is to reduce the storage needs of the table and makes it easier to search.
SHREDDED_REQUEST_ELEMENTS
This is the store for the XML_ELEMENT_NAME and XML_XPATH once they have been made a surrogate. It also holds the key out too the grouping table.
SHREDDED_REQUESTS_ELEMENTS_GROUPS
This is the “final” table in the solution, and you may be able to leave it off because I had to create it too enable the normalisation of all the same element names in different case (e.g. EMAIL, Email, email and eMail) issues in the XML_ELEMENT_NAME and XML_XPATH.
XML Shredding – The Process
The process happens in four stages:
1.Shed the XML into the staging table
So this stage has to happen otherwise there is very little point in this process. It uses a bit of recursive SQL as out lined in a previous post and all the element values, XPaths and element names with the appropriate identifier.
To speed this up and in the code supplied in the attachment you will see that I use an EXPORT statement to unload and an LOAD to make sure that the process happens as fast as possible, with the minimum of log contention and use. It is basic maths that you are going to be inserting a lot of rows, say your site does 100,000 requests and there are 50 elements on each request that is 5,000,000 rows!
So something like:
CREATE PROCEDURE {SCHEMA}.ADD_SHREDDED_REQUESTS_STAGE(IN IN_START DATE, IN IN_END DATE)
LANGUAGE SQL
BEGIN
----------------------------------------------------------------------------
----------------------------------------------------------------------------
--Takes in two dates
--Does this using a EXPORT and LOAD so that it works quicker than insert
-- as the actual query takes no time at all and it is only the inserting
-- that takes the time.
--Uses the ADMIN_EST_INLINE_LENGTH Function from http://publib.boulder.ibm.com/infocenter/db2luw/v9r7/topic/com.ibm.db2.luw.sql.rtn.doc/doc/r0054083.html
----------------------------------------------------------------------------
----------------------------------------------------------------------------
----------------------------------------------------------------------------
----------------------------------------------------------------------------
--Declare Vars
DECLARE EXPORTSTRING VARCHAR(3000);
DECLARE LOADSTRING VARCHAR(3000);
----------------------------------------------------------------------------
----------------------------------------------------------------------------
--Process
--Create the EXPORT string
SET EXPORTSTRING = 'EXPORT TO "/{some file system you can output too}/SHREDDED_REQUESTS_STAGE" OF DEL MODIFIED BY COLDEL0x09
WITH pathstable (REQUEST_ID,REQUEST_DATE, name, node, xpath, value) AS (
SELECT HM.REQUEST_ID,
HM.REQUEST_DATE,
x.name AS name,
x.node AS xmlnode,
''/'' || x.name AS xpath,
x.value as value
FROM {SCHEMA}.{TABLE WITH XML COLUMN} HM,
XMLTABLE(''$REQUEST_XML/*''
COLUMNS
name varchar(300) PATH ''./local-name()'',
value varchar(12000) PATH ''xs:string(.)'',
node XML PATH ''.'') AS x
WHERE HR.REQUEST_DATE BETWEEN ''' || IN_START || ''' AND ''' || IN_END || '''
AND ADMIN_EST_INLINE_LENGTH(HM.REQUEST_XML) {} -1
UNION ALL
SELECT REQUEST_ID,
REQUEST_DATE,
y.name AS name,
y.node AS xmlnode,
xpath|| ''/'' || y.name AS xpath,
y.value as value
FROM pathstable,
XMLTABLE(''$XMLNODE/(*,@*)'' PASSING pathstable.node AS "XMLNODE"
COLUMNS
name varchar(300) PATH ''local-name()'',
value varchar(12000) PATH ''xs:string(.)'',
node XML PATH ''.'') AS y
) SELECT REQUEST_ID, REQUEST_DATE, name, xpath, value
FROM pathstable
WHERE NAME {} ''serialized''
AND NAME {} ''_viewstate''
AND NAME {} ''__shorturl_postdata''
AND NAME {} ''API_Reply''
AND NAME {} ''apiReply'' ';
--Execute the EXPORT string
CALL SYSPROC.ADMIN_CMD(EXPORTSTRING);
--Create the LOAD string
SET LOADSTRING = 'LOAD FROM "/{some file system you can output too}/SHREDDED_REQUESTS_STAGE" OF DEL MODIFIED BY COLDEL0x09
METHOD P (1,2,3,4,5)
INSERT INTO {SCHEMA}.SHREDDED_REQUESTS_STAGE(
REQUEST_ID,
REQUEST_DATE,
XML_ELEMENT_NAME,
XML_XPATH,
XML_ELEMENT_VALUE
) COPY YES TO "/{some file system you can output too}-copy/" INDEXING MODE AUTOSELECT';
--Execue the LOAD string
CALL SYSPROC.ADMIN_CMD(LOADSTRING);
END
2.Gather the group elements
If you have nicely formed XML without upper / lower case and everything in-between you might not need this stage, but here I gather the element names and paths and normalise them to lower, and insert the new ones to be used in the next stage. All the normalised values are given an ID
So something like:
CREATE PROCEDURE {SCHEMA}.ADD_SHREDDED_REQUEST_ELEMENTS_GROUPS()
LANGUAGE SQL
BEGIN
----------------------------------------------------------------------------
----------------------------------------------------------------------------
--Add based on what is not in the main elements table the new combinations
-- from the shredded requests stage table that do not exist.
----------------------------------------------------------------------------
----------------------------------------------------------------------------
INSERT INTO {SCHEMA}.SHREDDED_REQUEST_ELEMENTS_GROUPS(
LOWER_XML_ELEMENT_NAME,
LOWER_XML_XPATH
)
SELECT DISTINCT LOWER(HSRS.XML_ELEMENT_NAME) AS LOWER_XML_ELEMENT_NAME,
LOWER(HSRS.XML_XPATH) AS LOWER_XML_XPATH
FROM {SCHEMA}.SHREDDED_REQUESTS_STAGE HSRS
WHERE NOT EXISTS (SELECT *
FROM {SCHEMA}.SHREDDED_REQUEST_ELEMENTS_GROUPS HSRE
WHERE LOWER(HSRS.XML_ELEMENT_NAME) = HSRE.LOWER_XML_ELEMENT_NAME
AND LOWER(HSRS.XML_XPATH) = HSRE.LOWER_XML_XPATH);
COMMIT;
END
3.Gather the elements
At this stage I gather all the new elements and paths (un-normailised) and put them into the table matched against the grouped version. This gives you a table that can be used in the next stage to create the final table with a far more index friendly integer for the elements and paths.
Again an example of the SP:
CREATE PROCEDURE {SCHEMA}.ADD_SHREDDED_REQUEST_ELEMENTS()
LANGUAGE SQL
BEGIN
----------------------------------------------------------------------------
----------------------------------------------------------------------------
--Add based on what is not in the main elements table the new combinations
-- from the shredded requests stage table that do not exist.
----------------------------------------------------------------------------
----------------------------------------------------------------------------
INSERT INTO {SCHEMA}.SHREDDED_REQUEST_ELEMENTS(
XML_ELEMENT_NAME,
XML_XPATH,
SHREDDED_REQUEST_ELEMENT_GROUP_ID
)
SELECT DISTINCT HSRS.XML_ELEMENT_NAME,
HSRS.XML_XPATH,
HSREG.SHREDDED_REQUEST_ELEMENT_GROUP_ID
FROM {SCHEMA}.SHREDDED_REQUESTS_STAGE HSRS INNER JOIN {SCHEMA}.SHREDDED_REQUEST_ELEMENTS_GROUPS HSREG ON LOWER(HSRS.XML_ELEMENT_NAME) = HSREG.LOWER_XML_ELEMENT_NAME
AND LOWER(HSRS.XML_XPATH) = HSREG.LOWER_XML_XPATH
WHERE NOT EXISTS (SELECT *
FROM {SCHEMA}.SHREDDED_REQUEST_ELEMENTS HSRE
WHERE HSRS.XML_ELEMENT_NAME = HSRE.XML_ELEMENT_NAME
AND HSRS.XML_XPATH = HSRE.XML_XPATH);
COMMIT;
END
GO
4.Add to the permanent store
Use the table from the previous stage and load the data into the final table SHREDDED_REQUESTS. Loading only the surrogate key for the element and path of the XML value.
Something like:
CREATE PROCEDURE {SCHEMA}.ADD_SHREDDED_REQUESTS()
LANGUAGE SQL
BEGIN
----------------------------------------------------------------------------
----------------------------------------------------------------------------
--Add to the main table based on what is in the staging table.
--Everything to go over so if there is 1 day or 100 all goes!!
--Using Export and load to get it done quicker and save on log space.
----------------------------------------------------------------------------
----------------------------------------------------------------------------
----------------------------------------------------------------------------
----------------------------------------------------------------------------
--Declare Vars
DECLARE EXPORTSTRING VARCHAR(3000);
DECLARE LOADSTRING VARCHAR(3000);
----------------------------------------------------------------------------
----------------------------------------------------------------------------
--Process
--Create the EXPORT string
SET EXPORTSTRING = 'EXPORT TO "/{some file system you can output too}/SHREDDED_REQUESTS" OF DEL MODIFIED BY COLDEL0x09
SELECT HSRS.REQUEST_ID,
HSRS.REQUEST_DATE,
HSRE.SHREDDED_REQUEST_ELEMENT_ID,
HSRS.XML_ELEMENT_VALUE
FROM {SCHEMA}.SHREDDED_REQUESTS_STAGE HSRS INNER JOIN {SCHEMA}.SHREDDED_REQUEST_ELEMENTS HSRE ON HSRS.XML_ELEMENT_NAME = HSRE.XML_ELEMENT_NAME
AND HSRS.XML_XPATH = HSRE.XML_XPATH';
--Execute the EXPORT string
CALL SYSPROC.ADMIN_CMD(EXPORTSTRING);
--Create the LOAD string
SET LOADSTRING = 'LOAD FROM "/{some file system you can output too}/SHREDDED_REQUESTS" OF DEL MODIFIED BY COLDEL0x09
METHOD P (1,2,3,4)
INSERT INTO {SCHEMA}.SHREDDED_REQUESTS(
REQUEST_ID,
REQUEST_DATE,
SHREDDED_REQUEST_ELEMENT_ID,
XML_ELEMENT_VALUE
) COPY YES TO "/{some file system you can output too}-copy/" INDEXING MODE AUTOSELECT';
--Execue the LOAD string
CALL SYSPROC.ADMIN_CMD(LOADSTRING);
END
XML Shredding – The file
As a reward for getting through all of the above, or for skipping it all and going for the juicy bit at the end here is the file – enjoy
GOTCHA’s
- I have compression on the servers at work, I have kept COMPRESS YES statement in, if you don’t have it then you will need to remove it other wise you will be in violation of your licences.
- You will have to change all {values} to get it to work, to your schemas etc that are relevant to your servers
- Please make sure you test before you use, you have been warned