Data Factory: How to flatten json hierarchy

I have a json-file in a blob container in Azure SA and I want to use "Copy Data" activity in ADF to get the data in a SQL DB. I have also looked into using Data Flows in ADF but haven't succeeded there either.

Now when I use the copy data activity the output only contains the first entry in "lines".

The json-file has the following hierarchy:

enter image description here

And my goal is to have each "line" in "order" in a seperate line in the SQL DB.

EDIT 1: I am using Data Flows and data is added to both the Blob (sink1) and SQL DB (sink2) like I want to, i.e the data is flattened. The problem is that the Data Flow gives errors that I do not understand.

The flow looks like this:

enter image description here

And even though I have specified the file name in the Data Flow the output file is named "part-00000-609332d2-8494-4b68-b481-f237f62cc6c8-c000.json".

The output error details of the pipeline which runs the data flow is as follows:

{"StatusCode":"DFExecutorUserError","Message":"Job failed due to reason: at Sink 'sink1': org.apache.hadoop.fs.azure.AzureException: com.microsoft.azure.storage.StorageException: This operation is not permitted on a non-empty directory.","Details":"org.apache.hadoop.fs.azure.AzureException: com.microsoft.azure.storage.StorageException: This operation is not permitted on a non-empty directory.\n\tat org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.delete(AzureNativeFileSystemStore.java:2607)\n\tat org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.delete(AzureNativeFileSystemStore.java:2617)\n\tat org.apache.hadoop.fs.azure.NativeAzureFileSystem.deleteFile(NativeAzureFileSystem.java:2657)\n\tat org.apache.hadoop.fs.azure.NativeAzureFileSystem$2.execute(NativeAzureFileSystem.java:2391)\n\tat org.apache.hadoop.fs.azure.AzureFileSystemThreadPoolExecutor.executeParallel(AzureFileSystemThreadPoolExecutor.java:223)\n\tat org.apache.hadoop.fs.azure.NativeAzureFileSystem.deleteWithoutAuth(NativeAzureFileSystem.java:2403)\n\tat org.apache.hadoop.fs.azure.NativeAzureFileSystem.delete(NativeAzureFileSystem.java:2453)\n\tat org.apache.hadoop.fs.azure.NativeAzureFileSystem.delete(NativeAzureFileSystem.java:1936)\n\tat org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter."}"

Here is a sample of the JSON data as text:

{ "customerId": 2241, "soidGt": null, "timestampGt": "2022-04-25T00:00:00", "timestampLt": null, "orders": [ { "soid": 68810264, "id": "a4b84f56-c6a4-4b37-bffb-a34d04c513c4", "tableId": 4676, "revenueUnitId": 682, "lines": [ { "solid": 147557444, "articleId": 70949, "quantity": 3, "taxPctValue": 25, "articleName": "Diavola", "netAmount": 516, "grossAmount": 645 }, { "solid": 147557445, "articleId": 70961, "quantity": 1, "taxPctValue": 25, "articleName": "Parma ai pomodori secchi", "netAmount": 183.2, "grossAmount": 229 } ], "payments": [ { "soptid": 70655447, "paymentTypeId": 2, "amount": 874 } ] }, { "soid": 68810622, "id": "1b356f45-7df7-42ba-8d50-8b14cf67180d", "tableId": 4546, "revenueUnitId": 83, "lines": [ { "solid": 147557985, "articleId": 71159, "quantity": 2, "taxPctValue": 25, "articleName": "Hansa 0,4L", "netAmount": 152, "grossAmount": 190 }, { "solid": 147557986, "articleId": 70948, "quantity": 1, "taxPctValue": 25, "articleName": "Parma", "netAmount": 175.2, "grossAmount": 219 }, { "solid": 147557987, "articleId": 70918, "quantity": 1, "taxPctValue": 25, "articleName": "Focaccia sarda", "netAmount": 71.2, "grossAmount": 89 }, { "solid": 147557988, "articleId": 70935, "quantity": 1, "taxPctValue": 25, "articleName": "Pasta di manzo", "netAmount": 196, "grossAmount": 245 } ], "payments": [ { "soptid": 70655798, "paymentTypeId": 2, "amount": 750 } ] }

2 answers

  • answered 2022-05-04 07:49 Grekkq

    In DataFlow you can find flatten transformation which can unroll arrays. It's this cyan color in Formatters menu.

    According to this StackOverflow answer it is also possible with foreach loop but it feels hacky.

  • answered 2022-05-04 12:48 wBob

    Just in case you do ever consider using Azure SQL DB's built-in abilities to work with JSON eg OPENJSON, JSON_VALUE and JSON_QUERY, here is a common pattern I use: land the data in a SQL table using Azure Data Factory (ADF) then work with it in SQL, eg:

    CREATE TABLE #tmp ( 
        jsonId INT IDENTITY PRIMARY KEY, 
        [json] NVARCHAR(MAX)
        );
    GO
    
    DECLARE @json NVARCHAR(MAX) = '{
        "customerId": 2241,
        "soidGt": null,
        "timestampGt": "2022-04-25T00:00:00",
        "timestampLt": null,
        "orders": [
            {
                "soid": 68810264,
                "id": "a4b84f56-c6a4-4b37-bffb-a34d04c513c4",
                "tableId": 4676,
                "revenueUnitId": 682,
                "lines": [
                    {
                        "solid": 147557444,
                        "articleId": 70949,
                        "quantity": 3,
                        "taxPctValue": 25,
                        "articleName": "Diavola",
                        "netAmount": 516,
                        "grossAmount": 645
                    },
                    {
                        "solid": 147557445,
                        "articleId": 70961,
                        "quantity": 1,
                        "taxPctValue": 25,
                        "articleName": "Parma ai pomodori secchi",
                        "netAmount": 183.2,
                        "grossAmount": 229
                    }
                ],
                "payments": [
                    {
                        "soptid": 70655447,
                        "paymentTypeId": 2,
                        "amount": 874
                    }
                ]
            },
            {
                "soid": 68810622,
                "id": "1b356f45-7df7-42ba-8d50-8b14cf67180d",
                "tableId": 4546,
                "revenueUnitId": 83,
                "lines": [
                    {
                        "solid": 147557985,
                        "articleId": 71159,
                        "quantity": 2,
                        "taxPctValue": 25,
                        "articleName": "Hansa 0,4L",
                        "netAmount": 152,
                        "grossAmount": 190
                    },
                    {
                        "solid": 147557986,
                        "articleId": 70948,
                        "quantity": 1,
                        "taxPctValue": 25,
                        "articleName": "Parma",
                        "netAmount": 175.2,
                        "grossAmount": 219
                    },
                    {
                        "solid": 147557987,
                        "articleId": 70918,
                        "quantity": 1,
                        "taxPctValue": 25,
                        "articleName": "Focaccia sarda",
                        "netAmount": 71.2,
                        "grossAmount": 89
                    },
                    {
                        "solid": 147557988,
                        "articleId": 70935,
                        "quantity": 1,
                        "taxPctValue": 25,
                        "articleName": "Pasta di manzo",
                        "netAmount": 196,
                        "grossAmount": 245
                    }
                ],
                "payments": [
                    {
                        "soptid": 70655798,
                        "paymentTypeId": 2,
                        "amount": 750
                    }
                ]
            }
        ]
    }'
    
    
    INSERT INTO #tmp ( json ) VALUES ( @json );
    
    -- CAST, JSON_VALUE, OPENJSON
    SELECT
        JSON_VALUE( t.[json], '$.customerId' ) customerId,
        JSON_VALUE( t.[json], '$.soidGt' ) soidGt,
        JSON_VALUE( t.[json], '$.timestampGt' ) timestampGt,
        JSON_VALUE( t.[json], '$.timestampLt' ) timestampLt,
    
        -- orders
        JSON_VALUE( o.[value], '$.soid' ) soid,
        JSON_VALUE( o.[value], '$.id' ) id,
        JSON_VALUE( o.[value], '$.tableId' ) tableId,
        JSON_VALUE( o.[value], '$.revenueUnitId' ) revenueUnitId,
    
        -- lines
        JSON_VALUE( l.[value], '$.solid' ) solid,
        JSON_VALUE( l.[value], '$.articleId' ) articleId,
        JSON_VALUE( l.[value], '$.quantity' ) quantity,
        JSON_VALUE( l.[value], '$.taxPctValue' ) taxPctValue,
        JSON_VALUE( l.[value], '$.articleName' ) articleName,
        JSON_VALUE( l.[value], '$.netAmount' ) netAmount,
        JSON_VALUE( l.[value], '$.grossAmount' ) grossAmount,
    
        -- payments
        JSON_VALUE( p.[value], '$.soptid' ) soptid,
        JSON_VALUE( p.[value], '$.paymentTypeId' ) paymentTypeId,
        JSON_VALUE( p.[value], '$.amount' ) amount
    
    FROM #tmp t
        CROSS APPLY OPENJSON( t.[json], '$.orders' ) o
            CROSS APPLY OPENJSON( o.[value], '$.lines' ) l
            CROSS APPLY OPENJSON( o.[value], '$.payments' ) p;
    

    My results:

    My results

    This is more of an ELT (and code-heavy) approach taking advantage of the existing compute you have running (Azure SQL DB) rather than an ETL approach, eg Mapping Data Flows (or low code) which spins up additional compute in the background. Whatever works for you. It's all managed for you but there's always more than one way to do things.

How many English words
do you know?
Test your English vocabulary size, and measure
how many words do you know
Online Test
Powered by Examplum