Druid Cookbook

In the previous episode of the Druid Data Cookbook, I showed how to extract and process all elements out of nested array of objects in Druid. But what if the elements you want to process are scattered over different levels of the object hierarchy? By extending the UNNEST paradigm, we can handle these cases too!

A data sample

Our data looks like this:

{
  "timestamp": "2024-09-01",
  "id": "1",
  "org": {
    "name": "Team 1",
    "members": [
      {
        "name": "Alice",
        "gender": "F"
      },
      {
        "name": "Bob",
        "gender": "M"
      },
      {
        "name": "Carol",
        "gender": "F"
      }
    ]
  }
}

Note how there are name fields at the first nesting leel of org, but also at the memberslevel.

Here’s the full dataset in jsonl format:

{ "timestamp": "2024-09-01", "id": "1", "org": { "name": "Team 1", "members": [ { "name": "Alice", "gender": "F" }, { "name": "Bob", "gender": "M" }, { "name": "Carol", "gender": "F" } ] } }
{ "timestamp": "2024-09-01", "id": "2", "org": { "name": "Team 2", "members": [ { "name": "Dan", "gender": "M" }, { "name": "Eve", "gender": "F" }, { "name": "Frank", "gender": "M" } ] } }

Load this data sample into Druid (version 30 or higher.) It should be easy to ingest using the wizard, or you can submit this query:

REPLACE INTO "teams_nested" OVERWRITE ALL
WITH "ext" AS (
  SELECT *
  FROM TABLE(
    EXTERN(
      '{"type":"inline","data":"{ \"timestamp\": \"2024-09-01\", \"id\": \"1\", \"org\": { \"name\": \"Team 1\", \"members\": [ { \"name\": \"Alice\", \"gender\": \"F\" }, { \"name\": \"Bob\", \"gender\": \"M\" }, { \"name\": \"Carol\", \"gender\": \"F\" } ] } }\n{ \"timestamp\": \"2024-09-01\", \"id\": \"2\", \"org\": { \"name\": \"Team 2\", \"members\": [ { \"name\": \"Dan\", \"gender\": \"M\" }, { \"name\": \"Eve\", \"gender\": \"F\" }, { \"name\": \"Frank\", \"gender\": \"M\" } ] } }"}',
      '{"type":"json"}'
    )
  ) EXTEND ("timestamp" VARCHAR, "id" VARCHAR, "org" TYPE('COMPLEX<json>'))
)
SELECT
  TIME_PARSE(TRIM("timestamp")) AS "__time",
  "id",
  "org"
FROM "ext"
PARTITIONED BY DAY

Extracting the leaf paths

Now, because the data does not neatly come in an array, we cannot extract elements at a specified level. But we can do another trick: we can obtain an array of the JSONPaths of all leaf elements in an object by calling JSON_PATHS. Run this query:

SELECT
  t.__time,
  t.id,
  t.org,
  x.leaf_path
FROM "teams_nested" t CROSS JOIN UNNEST(JSON_PATHS(org)) x("leaf_path")

It returns one row for each leaf element in the org object.

UNNEST query

In the next step, let’s feed these values back into JSON_VALUE.

(This will really only work with Druid 30 or better. In earlier versions, JSON_PATHS required a string literal as its second element, limiting its flexibility.)

Some query examples

We can use the above query as a CTE and do further processing in the main query.

Get all name fields, regardless of the hierarchy level, using a LIKE filter:

WITH cte AS (
  SELECT
    t.__time,
    t.id,
    t.org,
    x.leaf_path
  FROM "teams_nested" t CROSS JOIN UNNEST(JSON_PATHS(org)) x("leaf_path")
)
SELECT
  __time,
  id,
  leaf_path,
  JSON_VALUE(org, "leaf_path")
FROM cte
WHERE leaf_path LIKE '%.name'

All names

Or put all those names back into an array per team with an array aggregator:

WITH cte AS (
  SELECT
    t.__time,
    t.id,
    t.org,
    x.leaf_path
  FROM "teams_nested" t CROSS JOIN UNNEST(JSON_PATHS(org)) x("leaf_path")
)
SELECT
  __time,
  id,
  ARRAY_AGG(JSON_VALUE(org, "leaf_path"))
FROM cte
WHERE leaf_path LIKE '%.name'
GROUP BY __time, id

All names, array

You can do arbitrary gymnastics on the JSONPath expressions using regular expression filters. Here I use this technique to emulate a JSONPath expression like "$.members[*].gender":

WITH cte AS (
  SELECT
    t.__time,
    t.id,
    t.org,
    x.leaf_path
  FROM "teams_nested" t CROSS JOIN UNNEST(JSON_PATHS(org)) x("leaf_path")
)
SELECT
  __time,
  id,
  leaf_path,
  JSON_VALUE(org, "leaf_path")
FROM cte
WHERE REGEXP_LIKE(leaf_path, 'members\[\d+\]\.gender$')

JSON wildcard emulation with regex

Conclusion

  • By unnesting the result of JSON_PATHS, you get access to the entire structure of a nested (JSON) object in Druid.
  • In conjunction with regular expression filters, you get processing capabilities that are almost as powerful as jq.

This image is taken from Page 500 of Praktisches Kochbuch für die gewöhnliche und feinere Küche” by Medical Heritage Library, Inc. is licensed under CC BY-NC-SA 2.0 .