For the best part of the last 12 months I am using the Azure stack in one of our clients Data Warehouse projects. Although the vast majority of our input files are comma separated (in some way), there are exceptions. One of these exceptions are XML files. Surprisingly enough they are not supported out of the box in the Copy Activity of the Azure Data Factory and I had to do some workarounds to get it going.
Good news! This workaround is obsolete by now, as Azure Data Factory now natively supports XML datatype for datasets.
Let’s start at the beginning - the naive way
The XML files should be loaded in the Staging Area in our SQL Server instance. Starting there the data is integrated into our Core Data Warehouse.
That task of staging sounded like a perfect fit for the Copy Activity of the Azure Data Factory. I created a pipeline and added a Copy Activity. My sink was a dataset pointing to a table named XML_Staging in the SQL Server. The SQL Server has good support to save and query XML. So I decided to use the native XML data type for now:
CREATE TABLE [dbo].[XML_Staging](
[XmlCol] [xml] NULL
) ON [PRIMARY]
The source XML files are copied to an Azure Data Lake Storage (v2). Creating a new dataset to use as source of the Copy Activity was a dead end: Alongside JSON, Delimited Files, and Binary files there are things like Avro, ORC, and Parquet. But no XML!!
Luckily there is a way around that. I remembered a Microsoft article about processing fixed-length text files with the Azure Data Factory - which isn’t supported right away either. In short: use a DelimitedText dataset with no column delimiter - meaning: read the full row at once - and split it into substrings in a Data Flow to derive all columns needed.
So why not read the full XML file at once from a pseudo DelimitedText dataset and dump it into the database?
Surprisingly that worked with my sample file: Just be careful with the connection parameters of the dataset. My XML files used Line Feed (\n) as line ending, so I configured the Row Delimiter of the dataset to be Carriage return (\r). As Column Delimiter I configured a Tab (\t) which would not appear in the data of the XML itself. I would have prefered to use “No delimiter” instead but that did not pass debugging / validation of the pipeline.
And that is all the magic…
With that configuration all there was left to do in the Copy Activity is to define a Mapping. The source dataset (configured as explained above) provided a single column named 1 which had to be mapped to the XmlCol of the database table.
Data truncation - Oh, there are limits!?
Of course there are limits. I created several sample files with 1.000, 10.000, and 30.000 customer records in XML format. That’s a five-liner using the FOR XML PATH clause in a SELECT statement in the SQL Server:
SELECT TOP (30000) [CUSTOMERNO]
FOR XML PATH, ROOT('Customers')
The resulting files were 0.2 to 4.5MB of size.
Running the Azure Data Factory pipeline to read the XML files into the database shows no errors, but only some data ends up in the target table. According to the logs the file is read completely, data is written to database but that :
I figured out the maximum file size is 2MB to be read completely. My interpretation is this is a limit of the Copy Activity. The docs stated that data is mapped to interim data types and then converted to native sink types. According to a statement of a Microsoft empoyee, there is no such limit for the Copy Activity. Anyhow…
Becoming more creative with Stored Procedures
The 2MB limit was a No-Go in my situation. Our XML files would almost always exceed that. But before rejecting this 3-Minute-Good-To-Go solution I wanted to test another option:
It is possible to invoke a stored sprocedure from a SQL sink to process the incoming data. I was hoping the best to get around the limit. When the incoming data would be passed to the procedure without the double conversion mentioned above, it could be working?! tl;dr: It didn’t work.
So I followed the instructions from the docs, created a Table Type in the SQL Server with one column to hold the incoming string, and a stored procedure. Using the Table Type should allow us to handle data larger that 4000 Unicode characters. The xml data type is limited to 2GB:
CREATE TYPE [dbo].[XmlDumpType] AS TABLE(
[xml] xml NOT NULL
GOCREATE PROCEDURE dbo.spOverwriteXmlStaging @inputs [dbo].[XmlDumpType] READONLY
BEGINMERGE dbo.XML_Staging AS target
USING @inputs AS source
WHEN MATCHED THEN
UPDATE SET XmlCol = source.xml
WHEN NOT MATCHED THEN
Merging was not neccessary because the target table was truncated before loading using a preCopyScript. But I wanted to be as close to the sample code in the docs as possible. The pipeline code as follows:
"value": "TRUNCATE TABLE [dbo].[XML_Staging];",
The result is comparable to the version without stored procedure: 4.676.487 bytes read, 9.352.968 bytes written. Doing a
SELECT TOP 1 XmlCol FROM dbo.XML_Staging
shows the data is still truncated.
At that point I reached out to Microsoft via Twitter.
UPDATE: Finally I solved the problem by decomposing the XML string in the stored procedure.
I couldn’t stop thinking about the “dataWritten” property of the log. When that value was correct, all data got passed to the SQL Server and neither the Copy Activity nor the Azure Data Factory were the culprits.
To get into the problem in more detail, I added another numeric column named XmlLength to the XML_Staging table, and altered the MERGE statement of the stored procedure to additionally log LEN(source.xml) into the new field. After another run of the pipeline I knew for certainty the complete XML string was available to the stored procedure!!
No flat staging, but decompose the XML
As stated initially, the XML should be stored in the Staging Area of our Data Warehouse to be decomposed and integrated into the Core data model afterwards. But at that point - knowing the dumping ofthe complete string into a single XML column wouldn’t work- it was time to add decomposition of XML to the stored procedure itself. So the extended version of the procesure looked like that:
CREATE PROCEDURE dbo.spOverwriteXmlStaging @inputs [dbo].[XmlDumpType] READONLY
BEGINDECLARE @XMLDOC xml;SET @XMLDOC = (SELECT TOP 1 xml FROM @inputs);INSERT INTO dbo.CUSTOMERS_FROM_XML
doc.col.value('CUSTOMERNO', 'bigint') AS [CUSTOMERNO],
doc.col.value('OPERATION', 'nvarchar(100)') AS [OPERATION],
doc.col.value('EMAIL', 'nvarchar(1000)') AS [EMAIL]
FROM @XMLDOC.nodes('//row') doc(col)END
Basically the procedure now selected every row of the XML, extracted the information needed with a suitable data type and inserted it into a new table. More on the XML functions available in SQL Server here.
At last, problem solved!