XML is most useful, especially when it is well formatted and you know what it contains and all the documents stored in your XML column are the same. When they are not and you can’t exploit fully DB2’s very good XML features, and you don’t know how to query your XML it can be very hard.
The problem
Millions of records of business information with data in them, no one really knows what’s in there, and no one knows what we might be missing out on. There is a solution in place at the moment that uses a Java application to read the record from DB2 parse it as a string, this is because the element names that you could use XQuery against are all in different cases and appear in different locations in the XML therefore writing an XQuery against it is near impossible without missing something, which the Java App does anyway.
The solution – XML Shredder
I can’t and don’t take credit fully for my solution as it is largely based on this article How to list the paths of all elements in an XML document? – Dr.Matthias Nicola. I have adapted it to list both the element name, path and value. First of all we will look at a simple version, then a more advanced probably more useful version with a built in function that I found very useful so that you don’t parse a document bigger than you buffer pool attached to the temp tablespaces where the query is run.
XML Shredder – Basic Code
WITH pathstable (name, node, xpath, value) AS ( SELECT x.name AS name, x.node AS xmlnode, '/' || x.name AS xpath, x.value as value FROM {SCHEMA}.{TABLE} {ALIAS} , XMLTABLE('${XML COLUMN YOU WISH TO SHRED}/*' COLUMNS name varchar(300) PATH './local-name()', value varchar(12000) PATH 'xs:string(.)', node XML PATH '.') AS x WHERE {ALIAS}.{FIELD} etc etc UNION ALL SELECT y.name AS name, y.node AS xmlnode, xpath|| '/' || y.name AS xpath, y.value as value FROM pathstable, XMLTABLE('$XMLNODE/(*,@*)' PASSING pathstable.node AS "XMLNODE" COLUMNS name varchar(300) PATH 'local-name()', value varchar(12000) PATH 'xs:string(.)', node XML PATH '.') AS y ) SELECT name, xpath, value FROM pathstable
So that's the basic version of the SQL and you will probably supply the WHERE clause something to identify the row that you wish to shred. The section "'${XML COLUMN YOU WISH TO SHRED}/*'" you need to keep the '$ and /*' to wrap your XML column. It will get you a result set something like:
name xpath value ------------------------------------------------------- email /personal/email a@b.com first_name /personal/name/first_name Philip
GOTCHA's
- The VARCHAR (in this case 12000) size attached to the value column needs to be long enough to all the values strung together, but if you are going to insert it anywhere this long return value can be taken out in the WHERE clause as it is useless any case and therefore the column that you insert into can be far shorter.
- Make sure that you can inline you XML otherwise this code will not work. This can be done using ADMIN_EST_INLINE_LENGTH({XML FIELD}) result is -1 as explained below.
XML Shredder - "Advanced" Code
So the original "Basic" code gave you no identifier to the XML, whatever yours may be so by using something like the below you can create an extended "triple-store" a quad store. I do this as the same element name can exist down different paths and can mean slightly different things depending on where it exists, plus the identifier for the row then it can make it unique and useful.
WITH pathstable ({ADDITIONAL FIELD(S)}, name, node, xpath, value) AS (
SELECT {ADDITIONAL FIELD(S)},
x.name AS name,
x.node AS xmlnode,
'/' || x.name AS xpath,
x.value as value
FROM {SCHEMA}.{TABLE} {ALIAS} {JOIN TYPE} JOIN {SCHEMA}.{TABLE} {ALIAS} ON {ALIAS ... etc, etc},
XMLTABLE('${XML COLUMN YOU WISH TO SHRED}/*'
COLUMNS
name varchar(300) PATH './local-name()',
value varchar(12000) PATH 'xs:string(.)',
node XML PATH '.') AS x
WHERE ADMIN_EST_INLINE_LENGTH(XML FIELD) <> -1
AND {ANY OTHER WHERE CLAUSES THAT AFFECT THE RESULT SET}
UNION ALL
SELECT {ADDITIONAL FIELD(S)},
y.name AS name,
y.node AS xmlnode,
xpath|| '/' || y.name AS xpath,
y.value as value
FROM pathstable,
XMLTABLE('$XMLNODE/(*,@*)' PASSING pathstable.node AS "XMLNODE"
COLUMNS
name varchar(300) PATH 'local-name()',
value varchar(12000) PATH 'xs:string(.)',
node XML PATH '.') AS y
) SELECT {ADDITIONAL FIELD(S)}, name, xpath, value
FROM pathstable
WHERE NAME {SOME PREDICATE} {SOME VALUE}
OR
VALUE {SOME PREDICATE} {SOME VALUE}
OR
{ADDITIONAL FIELD(S)} {SOME PREDICATE} {SOME VALUE}
In addition to the all the new code there to identify the record so you could store it in a table to use later you will notice the line ADMIN_EST_INLINE_LENGTH({XML FIELD}) <> -1. This inline function, full information from IBM DB2 Info Center here is used to make sure that no record that is too long too process. This is needed as your XML is stored separately therefore is not restricted by the page size of table space where your XML holding table exists and if you do not have a temporary table space large enough then this code will not work.
Future posts
Looking at how you could make this into a useful system, with the ability to be queried easily