Data Factory: How to flatten json hierarchy
I have a json-file in a blob container in Azure SA and I want to use "Copy Data" activity in ADF to get the data in a SQL DB. I have also looked into using Data Flows in ADF but haven't succeeded there either.
Now when I use the copy data activity the output only contains the first entry in "lines".
The json-file has the following hierarchy:
And my goal is to have each "line" in "order" in a seperate line in the SQL DB.
EDIT 1: I am using Data Flows and data is added to both the Blob (sink1) and SQL DB (sink2) like I want to, i.e the data is flattened. The problem is that the Data Flow gives errors that I do not understand.
The flow looks like this:
And even though I have specified the file name in the Data Flow the output file is named "part-00000-609332d2-8494-4b68-b481-f237f62cc6c8-c000.json".
The output error details of the pipeline which runs the data flow is as follows:
{"StatusCode":"DFExecutorUserError","Message":"Job failed due to reason: at Sink 'sink1': org.apache.hadoop.fs.azure.AzureException: com.microsoft.azure.storage.StorageException: This operation is not permitted on a non-empty directory.","Details":"org.apache.hadoop.fs.azure.AzureException: com.microsoft.azure.storage.StorageException: This operation is not permitted on a non-empty directory.\n\tat org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.delete(AzureNativeFileSystemStore.java:2607)\n\tat org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.delete(AzureNativeFileSystemStore.java:2617)\n\tat org.apache.hadoop.fs.azure.NativeAzureFileSystem.deleteFile(NativeAzureFileSystem.java:2657)\n\tat org.apache.hadoop.fs.azure.NativeAzureFileSystem$2.execute(NativeAzureFileSystem.java:2391)\n\tat org.apache.hadoop.fs.azure.AzureFileSystemThreadPoolExecutor.executeParallel(AzureFileSystemThreadPoolExecutor.java:223)\n\tat org.apache.hadoop.fs.azure.NativeAzureFileSystem.deleteWithoutAuth(NativeAzureFileSystem.java:2403)\n\tat org.apache.hadoop.fs.azure.NativeAzureFileSystem.delete(NativeAzureFileSystem.java:2453)\n\tat org.apache.hadoop.fs.azure.NativeAzureFileSystem.delete(NativeAzureFileSystem.java:1936)\n\tat org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter."}"
Here is a sample of the JSON data as text:
{ "customerId": 2241, "soidGt": null, "timestampGt": "2022-04-25T00:00:00", "timestampLt": null, "orders": [ { "soid": 68810264, "id": "a4b84f56-c6a4-4b37-bffb-a34d04c513c4", "tableId": 4676, "revenueUnitId": 682, "lines": [ { "solid": 147557444, "articleId": 70949, "quantity": 3, "taxPctValue": 25, "articleName": "Diavola", "netAmount": 516, "grossAmount": 645 }, { "solid": 147557445, "articleId": 70961, "quantity": 1, "taxPctValue": 25, "articleName": "Parma ai pomodori secchi", "netAmount": 183.2, "grossAmount": 229 } ], "payments": [ { "soptid": 70655447, "paymentTypeId": 2, "amount": 874 } ] }, { "soid": 68810622, "id": "1b356f45-7df7-42ba-8d50-8b14cf67180d", "tableId": 4546, "revenueUnitId": 83, "lines": [ { "solid": 147557985, "articleId": 71159, "quantity": 2, "taxPctValue": 25, "articleName": "Hansa 0,4L", "netAmount": 152, "grossAmount": 190 }, { "solid": 147557986, "articleId": 70948, "quantity": 1, "taxPctValue": 25, "articleName": "Parma", "netAmount": 175.2, "grossAmount": 219 }, { "solid": 147557987, "articleId": 70918, "quantity": 1, "taxPctValue": 25, "articleName": "Focaccia sarda", "netAmount": 71.2, "grossAmount": 89 }, { "solid": 147557988, "articleId": 70935, "quantity": 1, "taxPctValue": 25, "articleName": "Pasta di manzo", "netAmount": 196, "grossAmount": 245 } ], "payments": [ { "soptid": 70655798, "paymentTypeId": 2, "amount": 750 } ] }
2 answers
-
answered 2022-05-04 07:49
Grekkq
In DataFlow you can find flatten transformation which can unroll arrays. It's this cyan color in
Formatters
menu.According to this StackOverflow answer it is also possible with foreach loop but it feels hacky.
-
answered 2022-05-04 12:48
wBob
Just in case you do ever consider using Azure SQL DB's built-in abilities to work with JSON eg
OPENJSON
,JSON_VALUE
andJSON_QUERY
, here is a common pattern I use: land the data in a SQL table using Azure Data Factory (ADF) then work with it in SQL, eg:CREATE TABLE #tmp ( jsonId INT IDENTITY PRIMARY KEY, [json] NVARCHAR(MAX) ); GO DECLARE @json NVARCHAR(MAX) = '{ "customerId": 2241, "soidGt": null, "timestampGt": "2022-04-25T00:00:00", "timestampLt": null, "orders": [ { "soid": 68810264, "id": "a4b84f56-c6a4-4b37-bffb-a34d04c513c4", "tableId": 4676, "revenueUnitId": 682, "lines": [ { "solid": 147557444, "articleId": 70949, "quantity": 3, "taxPctValue": 25, "articleName": "Diavola", "netAmount": 516, "grossAmount": 645 }, { "solid": 147557445, "articleId": 70961, "quantity": 1, "taxPctValue": 25, "articleName": "Parma ai pomodori secchi", "netAmount": 183.2, "grossAmount": 229 } ], "payments": [ { "soptid": 70655447, "paymentTypeId": 2, "amount": 874 } ] }, { "soid": 68810622, "id": "1b356f45-7df7-42ba-8d50-8b14cf67180d", "tableId": 4546, "revenueUnitId": 83, "lines": [ { "solid": 147557985, "articleId": 71159, "quantity": 2, "taxPctValue": 25, "articleName": "Hansa 0,4L", "netAmount": 152, "grossAmount": 190 }, { "solid": 147557986, "articleId": 70948, "quantity": 1, "taxPctValue": 25, "articleName": "Parma", "netAmount": 175.2, "grossAmount": 219 }, { "solid": 147557987, "articleId": 70918, "quantity": 1, "taxPctValue": 25, "articleName": "Focaccia sarda", "netAmount": 71.2, "grossAmount": 89 }, { "solid": 147557988, "articleId": 70935, "quantity": 1, "taxPctValue": 25, "articleName": "Pasta di manzo", "netAmount": 196, "grossAmount": 245 } ], "payments": [ { "soptid": 70655798, "paymentTypeId": 2, "amount": 750 } ] } ] }' INSERT INTO #tmp ( json ) VALUES ( @json ); -- CAST, JSON_VALUE, OPENJSON SELECT JSON_VALUE( t.[json], '$.customerId' ) customerId, JSON_VALUE( t.[json], '$.soidGt' ) soidGt, JSON_VALUE( t.[json], '$.timestampGt' ) timestampGt, JSON_VALUE( t.[json], '$.timestampLt' ) timestampLt, -- orders JSON_VALUE( o.[value], '$.soid' ) soid, JSON_VALUE( o.[value], '$.id' ) id, JSON_VALUE( o.[value], '$.tableId' ) tableId, JSON_VALUE( o.[value], '$.revenueUnitId' ) revenueUnitId, -- lines JSON_VALUE( l.[value], '$.solid' ) solid, JSON_VALUE( l.[value], '$.articleId' ) articleId, JSON_VALUE( l.[value], '$.quantity' ) quantity, JSON_VALUE( l.[value], '$.taxPctValue' ) taxPctValue, JSON_VALUE( l.[value], '$.articleName' ) articleName, JSON_VALUE( l.[value], '$.netAmount' ) netAmount, JSON_VALUE( l.[value], '$.grossAmount' ) grossAmount, -- payments JSON_VALUE( p.[value], '$.soptid' ) soptid, JSON_VALUE( p.[value], '$.paymentTypeId' ) paymentTypeId, JSON_VALUE( p.[value], '$.amount' ) amount FROM #tmp t CROSS APPLY OPENJSON( t.[json], '$.orders' ) o CROSS APPLY OPENJSON( o.[value], '$.lines' ) l CROSS APPLY OPENJSON( o.[value], '$.payments' ) p;
My results:
This is more of an ELT (and code-heavy) approach taking advantage of the existing compute you have running (Azure SQL DB) rather than an ETL approach, eg Mapping Data Flows (or low code) which spins up additional compute in the background. Whatever works for you. It's all managed for you but there's always more than one way to do things.
do you know?
how many words do you know