Xml shredding performance [Programming SQL]

Prev: Alter table
Next: SELECT SUBSTRING

From: Uri Dimant on 8 Mar 2010 09:26

Johnny
Unfortunately I have very limited knowledge about using XML in SQL
Server..., if Erland jumps in .....

"Johnny Persson" <a(a)a.a> wrote in message
news:%23qXjhgsvKHA.5940(a)TK2MSFTNGP02.phx.gbl...
> Thank you for your answer,
>
> we use XML as column data type so the first article does not seem to
> affect us. It is however interesting how the query optimizer work - or not
> work :)
>
> The second article had a really interesting part about typed/untyped xmls.
> We have had no xml schema for any xml so I tried that..
>
> I created an xml schema for an xml file (~5MB) and compared the
> performance between a stored procedure which use the xml schema and a
> stored procedure which does not.
>
> The result is to me a bit strange. When we use a "typed xml column" the
> execution time is ~15s and the subtree cost is 428. When we use the
> normal, untyped xml column, the execution time is the same BUT the subtree
> cost is however 866!
>
> Do you have any thoughts about the result and why the execution time isn't
> affected?
>
> Regards,
> Johnny
>
> On 2010-03-08 11:06, Uri Dimant wrote:
>> Johny
>>
>> http://sqlblog.com/blogs/adam_machanic/archive/2010/01/12/t-sql-tuesday-002-is-it-xml-or-not.aspx
>>
>>
>> http://sqlcat.com/msdnmirror/archive/2010/03/01/performance-tips-of-using-xml-data-in-sql-server.aspx
>>
>>
>>
>> "Johnny Persson"<a(a)a.a> wrote in message
>> news:%23Ht3vLqvKHA.1796(a)TK2MSFTNGP02.phx.gbl...
>>> Hi,
>>>
>>> we are having some performance issues regarding xml shredding.
>>>
>>> At this point we are extracting data from xmls from nearly 60 different
>>> companies - and therefore 60 different xml structures. The total amount
>>> of
>>> xml is about 350MB and we are trying to extract the data as fast as
>>> possible.
>>>
>>> Our current system extracts, transforms and loads the data in about five
>>> minutes. We would however like to do this in about one minute to be
>>> pleased.
>>>
>>> We use the "nodes/cross apply"-technique to shred the xmls into our
>>> internal format.
>>>
>>> This is how we shred the data.
>>> ------------------------------
>>>
>>> 1) Load xml into a temporary table (#XmlTable)
>>> 2) Set an xml index
>>> 3) Query (like below)
>>>
>>> INSERT INTO #TransformedData
>>> SELECT
>>> T0.T.value('asasd', 'asdadd')
>>> T1.T.value('asasd', 'asdadd')
>>> FROM
>>> #XmlTable
>>> CROSS APPLY
>>> data.nodes('asd') AS T0(T)
>>> T0.T.nodes('level1') AS T1(T)
>>>
>>> DROP #XmlTable
>>>
>>> 4) Pass the temporary table #TransformedData into the common/shared
>>> transformation procedure
>>>
>>> EXEC LookupData
>>>
>>> -------------------------------
>>>
>>> This is very I/O intensive and it makes the system slow. Are there any
>>> other good ways to parse the xmls in the sql server? Should we perhaps
>>> move the shredding outside the SQL environment into, for instance, a C#
>>> method which bulk loads the data?
>>>
>>> Regards,
>>> Johnny
>>
>>

From: Erland Sommarskog on 8 Mar 2010 18:05

Johnny Persson (a(a)a.a) writes:
> The result is to me a bit strange. When we use a "typed xml column" the
> execution time is ~15s and the subtree cost is 428. When we use the
> normal, untyped xml column, the execution time is the same BUT the
> subtree cost is however 866!

The subtree cost is just an estimate, so take it for what it's worth.

Schemabound XML may have some advantages, if you extract a lot from the
same document, but it can also cut the other way, as schema vaildation
is quite expensive.

As I said in another post, you should look at how you address the nodes.
There is a lot to win in that area.

--
Erland Sommarskog, SQL Server MVP, esquel(a)sommarskog.se

Links for SQL Server Books Online:
SQL 2008: http://msdn.microsoft.com/en-us/sqlserver/cc514207.aspx
SQL 2005: http://msdn.microsoft.com/en-us/sqlserver/bb895970.aspx
SQL 2000: http://www.microsoft.com/sql/prodinfo/previousversions/books.mspx

From: Michael Coles on 8 Mar 2010 21:31

"Johnny Persson" <a(a)a.a> wrote in message
news:%23Ht3vLqvKHA.1796(a)TK2MSFTNGP02.phx.gbl...
> INSERT INTO #TransformedData
> SELECT
> T0.T.value('asasd', 'asdadd')
> T1.T.value('asasd', 'asdadd')
> FROM
> #XmlTable
> CROSS APPLY
> data.nodes('asd') AS T0(T)
> T0.T.nodes('level1') AS T1(T)

Just adding to what Erland said, if your XML data looks like this:

<level1>
<asd>...</asd>
<asd>...</asd>
</level1>

Consider using the '//asd' path. Also consider putting a numeric predicate
on your .value() function calls like this 'asasd[1]'. It would help to have
some representational XML data if you can supply some (and some expected
results). The XML schema can give the optimizer some hints that can help
performance, but you have to weigh the benefit against the extra cost of
validating your data against it. Another option is to perform the shredding
outside of SQL Server in your application. If you don't want to bother with
SQLXML Bulk Loading you can use .NET to shred the data and bulk load it.

--
Thanks

Michael Coles
SQL Server MVP
Author, "Expert SQL Server 2008 Encryption"
(http://www.apress.com/book/view/1430224649)
----------------

From: Johnny Persson on 9 Mar 2010 02:03

Hi Erland,

Yes, below is a real example of one of our parsers. As far as I
understand the node addressing is correct.. Or am I wrong? It has
happened before :)

SELECT
-- /A/B
TG.T.value('@name', 'nvarchar(255)'),
TG.T.value('(ElementX)[1]', 'nvarchar(255)'),
TG.T.value('(ElementY)[1]', 'nvarchar(255)'),

-- /A/B/C
TOS.T.value('@type', 'nvarchar(255)'),
TOS.T.value('@name', 'nvarchar(255)'),

-- /A/B/C/D
TOO.T.value('@name', 'nvarchar(255)'),
TOO.T.value('@id', 'nvarchar(255)')
FROM
#XmlTable
CROSS APPLY Data.nodes('/A/B') AS TG(T)
CROSS APPLY TG.T.nodes('C') AS TOS(T)
CROSS APPLY TOS.T.nodes('D') AS TOO(T)

Regards,
Johnny

On 2010-03-09 00:05, Erland Sommarskog wrote:
> Johnny Persson (a(a)a.a) writes:
>> The result is to me a bit strange. When we use a "typed xml column" the
>> execution time is ~15s and the subtree cost is 428. When we use the
>> normal, untyped xml column, the execution time is the same BUT the
>> subtree cost is however 866!
>
> The subtree cost is just an estimate, so take it for what it's worth.
>
> Schemabound XML may have some advantages, if you extract a lot from the
> same document, but it can also cut the other way, as schema vaildation
> is quite expensive.
>
> As I said in another post, you should look at how you address the nodes.
> There is a lot to win in that area.
>
>

From: Johnny Persson on 9 Mar 2010 02:07

Hi,

I have now tried to shred the data in a C# console project. The
performance gain is huge.

Which is the best option to pass the data to the SQL server? Bulk load
into a table variable or "send the data" through a table valued CLR method?

Regards,
Johnny

On 2010-03-09 03:31, Michael Coles wrote:
> "Johnny Persson" <a(a)a.a> wrote in message
> news:%23Ht3vLqvKHA.1796(a)TK2MSFTNGP02.phx.gbl...
>> INSERT INTO #TransformedData
>> SELECT
>> T0.T.value('asasd', 'asdadd')
>> T1.T.value('asasd', 'asdadd')
>> FROM
>> #XmlTable
>> CROSS APPLY
>> data.nodes('asd') AS T0(T)
>> T0.T.nodes('level1') AS T1(T)
>
> Just adding to what Erland said, if your XML data looks like this:
>
> <level1>
> <asd>...</asd>
> <asd>...</asd>
> </level1>
>
> Consider using the '//asd' path. Also consider putting a numeric
> predicate on your .value() function calls like this 'asasd[1]'. It would
> help to have some representational XML data if you can supply some (and
> some expected results). The XML schema can give the optimizer some hints
> that can help performance, but you have to weigh the benefit against the
> extra cost of validating your data against it. Another option is to
> perform the shredding outside of SQL Server in your application. If you
> don't want to bother with SQLXML Bulk Loading you can use .NET to shred
> the data and bulk load it.
>

First | Prev | Next | Last
Pages: 1 2 3
Prev: Alter table
Next: SELECT SUBSTRING