Hellmar Becker’s Blog

Druid 31 Preview: Changing the Segment Sort Order

2024-10-20T00:00:00+02:00

Up until Druid 30, data in a Druid segment was always sorted by time. But this is about to change. Druid 31 comes with an experimental option to change the segment sort order. With Druid Summit 2024 and the announcement of Druid 31 around the corner, let’s take a look at this new feature.

This blog is based on the public documentation of this pull request.

This is a sneak peek into Druid 29 functionality. In order to use the new functions, you can (as of the time of writing) build Druid from the Druid 31 branch:

git clone https://github.com/apache/druid.git
cd druid
git checkout 31.0.0
mvn clean install -Pdist -DskipTests

Then follow the instructions to locate and install the tarball.

Disclaimer: This tutorial uses undocumented functionality and unreleased code. This blog is neither endorsed by Imply nor by the Apache Druid PMC. It merely collects the results of personal experiments. The features described here might, in the final release, work differently, or not at all. In addition, the entire build, or execution, may fail. Your mileage may vary.

Motivation

Why would you want to change the segment sort order? This is mainly about data compression. Druid can compress data much better if blocks of contiguous rows have the same value in a column. It is also consistent with the tendency to better organize data within Druid.

For instance, range partitioning of data is now becoming the gold standard for batch ingestion. In fact, Imply’s Polaris service offers range partitioning for both batch and streaming ingestion. (This is done by ingesting data into dynamic partitions first and running a preconfigured autocompaction job in the background that makes sure all data is partitioned according to the configured settings. You can do this in open source Druid too, but you have to configure it yourself.) In most cases, if you decide to partition your data by a set of columns, it would also make sense to order them by the same criteria.

This is currently an experimental feature and not all query types are guaranteed to work with alternative segment sort order. Also not that segments written with alternative sort order cannot be processed by older Druid versions.

Lab

With that said, let’s try it out!

Follow the SQL ingestion tutorial until step 5 where you have a query window with an ingestion query. In order to make this a minimal example, remove from the query all columns but __time, channel, and page. Your query should look like this:

REPLACE INTO "wikipedia-time" OVERWRITE ALL
WITH "ext" AS (
  SELECT *
  FROM TABLE(
    EXTERN(
      '{"type":"http","uris":["https://druid.apache.org/data/wikipedia.json.gz"]}',
      '{"type":"json"}'
    )
  ) EXTEND ("isRobot" VARCHAR, "channel" VARCHAR, "timestamp" VARCHAR, "flags" VARCHAR, "isUnpatrolled" VARCHAR, "page" VARCHAR, "diffUrl" VARCHAR, "added" BIGINT, "comment" VARCHAR, "commentLength" BIGINT, "isNew" VARCHAR, "isMinor" VARCHAR, "delta" BIGINT, "isAnonymous" VARCHAR, "user" VARCHAR, "deltaBucket" BIGINT, "deleted" BIGINT, "namespace" VARCHAR, "cityName" VARCHAR, "countryName" VARCHAR, "regionIsoCode" VARCHAR, "metroCode" BIGINT, "countryIsoCode" VARCHAR, "regionName" VARCHAR)
)
SELECT
  TIME_PARSE("timestamp") AS "__time",
  "channel",
  "page"
FROM "ext"
PARTITIONED BY DAY

Run this query to get the reference table.

We are now going to sort by channel instead of time first. Because channel is relatively low cardinality, this should make the resulting table smaller.

Sort order in SQL based ingestion is controlled by the CLUSTERED BY clause, which also (and primarily) is used for secondary partitioning. So the next step will be to add a clustering clause to the ingestion query.

Duplicate the query tab to create a copy of the query. In the copy, change the table name from wikipedia-time to wikipedia-channel and add a clustering clause CLUSTERED BY channel, page, __time:

The next thing we need is to enable alternative sort order in the query context. Open the query context from the Engine menu:

And in the editor window that opens, enter

{
  "forceSegmentSortByTime": false
}

and hit Save. Then run this ingestion query, too.

Let’s look at the result:

The table that got sorted by channel first is about 4 percent smaller!

In a real life scenario, you could expect even greater space savings.

Conclusion

Starting with Druid 31, you can sort segments by a column other than the primary timestamp.
This is still experimental and is enabled by a context flag.
With alternative segment sorting enabled, the sort order in SQL ingestion is governed by the CLUSTERED BY clause.

Table Based Lookups in Imply Polaris

2024-10-06T00:00:00+02:00

Imply’s Polaris service has a new feature: Lookups. Lookups in Polaris provide a convenient way to model dimension tables in a star schema, and they are more flexible and less resource hungry than the lookups you know from open source Druid. Let’s take a closer look!

Recap: Lookups in Apache Druid

Druid lookups are cached key-value maps that are kept in memory on all historical and peon processes. I wrote about them in an earlier blog. A lookup can be queried like a table, but it has only two columns - key k and value v. If you want to lookup multiple values for a key - say, a customer’s name, stret address, city, and country - then you need to create a separate lookup for each.

Lookups can be populated in various ways:

you can upload the map data directly via the API or the Druid console
or you can populate a lookup from a file described by a URL
or from a Kafka topic
or from a database table.

The last option is particularly interesting because the source of the lookup can also be a regular Druid table, if you specify Druid’s Avatica driver in the JDBC URL. This works but is kind of a hack, and has limited flexibility. But it is how we used to set up lookups in Polaris in the backend, before they were officially supported.

Lookups based on Druid Segments

With the September release, users can configure their own lookups in Polaris. These lookups are built directly on top of Polaris tables, and they are different from all other types of lookups. These segment based lookups are now available in Polaris by default, although they can be enabled in Imply Hybrid and Imply Enterprise by installing a specific extension that is part of the commercial software release of Imply.

The big advantage of segment based lookups: any (string) column can be either key or value. This means the address example above could be implemented using just one lookup structure. Segment based lookups are also more memory efficient, and by being integrated with the Polaris (or Druid) table schema the entire structure becomes more maintainable.

How to do it in practice

Let’s look at a practical example. We are going to generate a data set that has ISO country codes in it, and we want to enrich these data with country names by means of a lookup list you can download here.

Then generate some data using this little script:

import json
import time
from faker import Faker

fake = Faker()
location_fields = ["latitude", "longitude", "place_name", "two-letter_country_code", "timezone"]

for i in range(100):
    place = fake.location_on_land()
    rec = dict(zip(location_fields, place))
    rec["timestamp"] = int(time.time())
    print(json.dumps(rec))

and run it like so:

python3 ./gendata.py >data.json

And upload those two files as data sources to Polaris.

Modeling the dimension table and the lookup

The dimension table has to fulfill a few requirements to be eligible as a lookup source:

all fields that you want to use in the lookup have to be string columns
the table has to be PARTITIONED BY ALL
and it has to fit in a single segment (it should be less than 4GB in size).

So, when ingesting the data from all.csv into table countries:

Make sure to lock every column into a declaration as string. In Polaris, by default all table columns have a type of Auto, but you can override this by declaring the respective column explicitly.

The table has to be partitioned by All Time, which means no time partitioning is applied. If you partition by anything other than All, you will not be able to select that table for a lookup.

Leave the timestamp as Current time. For a dimension table, this is a dummy anyway.

In the left navigation menu, you will find the new item Lookups. Navigate there and select the countries table from the dropdown menu.

That’s the entire configuration!

Modeling the fact table

The fact table is a regular Polaris table. Just make sure the (foreign) key column (the one you want to apply the lookup to) is a string too or else you will need a cast in the lookup. Polaris expects the key and value fields both to be strings, otherwise the LOOKUP call fails.

How to query data using lookups

Compared to regular Druid lookups, the syntax has been extended. In addition to the lookup name, you specify the key and value columns in square brackets using the syntax

LOOKUP(..., 'lookup_name[key_column][value_column]')

For the example table, this query will do nicely:

SELECT 
  __time,
  "two-letter_country_code",
  LOOKUP("two-letter_country_code", 'lookup_countries[alpha-2][name]')
FROM data_loc

You can run this in Polaris’s SQL workbench:

How to visualize lookup data in Pivot

When creating a datacube in Polaris, use the LOOKUP(..., 'lookup_name[key_column][value_column]') syntax in dimension definitions. Don’t forget the t. prefix to refer to the main table:

Side note: unlike a regular Kimball model, you can also model measures by aggregating over a LOOKUP expression. You will, however, have to store the measure data as strings and cast them to numbers in the final aggregation expression.

And with that, you can use the lookup values in any visualization:

Conclusion

Polaris (and the commercial releases of Imply) offer a new flavor of lookups based on single segment tables.
In segment based lookups, any column can be a key or a value.
This is very convenient for star schema-style dimension tables.

Druid Data Cookbook: Deconstructing Nested JSON Objects

2024-09-27T00:00:00+02:00

In the previous episode of the Druid Data Cookbook, I showed how to extract and process all elements out of nested array of objects in Druid. But what if the elements you want to process are scattered over different levels of the object hierarchy? By extending the UNNEST paradigm, we can handle these cases too!

A data sample

Our data looks like this:

{
  "timestamp": "2024-09-01",
  "id": "1",
  "org": {
    "name": "Team 1",
    "members": [
      {
        "name": "Alice",
        "gender": "F"
      },
      {
        "name": "Bob",
        "gender": "M"
      },
      {
        "name": "Carol",
        "gender": "F"
      }
    ]
  }
}

Note how there are name fields at the first nesting level of org, but also at the memberslevel.

Here’s the full dataset in jsonl format:

{ "timestamp": "2024-09-01", "id": "1", "org": { "name": "Team 1", "members": [ { "name": "Alice", "gender": "F" }, { "name": "Bob", "gender": "M" }, { "name": "Carol", "gender": "F" } ] } }
{ "timestamp": "2024-09-01", "id": "2", "org": { "name": "Team 2", "members": [ { "name": "Dan", "gender": "M" }, { "name": "Eve", "gender": "F" }, { "name": "Frank", "gender": "M" } ] } }

Load this data sample into Druid (version 30 or higher.) It should be easy to ingest using the wizard, or you can submit this query:

REPLACE INTO "teams_nested" OVERWRITE ALL
WITH "ext" AS (
  SELECT *
  FROM TABLE(
    EXTERN(
      '{"type":"inline","data":"{ \"timestamp\": \"2024-09-01\", \"id\": \"1\", \"org\": { \"name\": \"Team 1\", \"members\": [ { \"name\": \"Alice\", \"gender\": \"F\" }, { \"name\": \"Bob\", \"gender\": \"M\" }, { \"name\": \"Carol\", \"gender\": \"F\" } ] } }\n{ \"timestamp\": \"2024-09-01\", \"id\": \"2\", \"org\": { \"name\": \"Team 2\", \"members\": [ { \"name\": \"Dan\", \"gender\": \"M\" }, { \"name\": \"Eve\", \"gender\": \"F\" }, { \"name\": \"Frank\", \"gender\": \"M\" } ] } }"}',
      '{"type":"json"}'
    )
  ) EXTEND ("timestamp" VARCHAR, "id" VARCHAR, "org" TYPE('COMPLEX<json>'))
)
SELECT
  TIME_PARSE(TRIM("timestamp")) AS "__time",
  "id",
  "org"
FROM "ext"
PARTITIONED BY DAY

Extracting the leaf paths

Now, because the data does not neatly come in an array, we cannot extract elements at a specified level. But we can do another trick: we can obtain an array of the JSONPaths of all leaf elements in an object by calling JSON_PATHS. Run this query:

SELECT
  t.__time,
  t.id,
  t.org,
  x.leaf_path
FROM "teams_nested" t CROSS JOIN UNNEST(JSON_PATHS(org)) x("leaf_path")

It returns one row for each leaf element in the org object.

In the next step, let’s feed these values back into JSON_VALUE.

(This will really only work with Druid 30 or better. In earlier versions, JSON_PATHS required a string literal as its second element, limiting its flexibility.)

Some query examples

We can use the above query as a CTE and do further processing in the main query.

Get all name fields, regardless of the hierarchy level, using a LIKE filter:

WITH cte AS (
  SELECT
    t.__time,
    t.id,
    t.org,
    x.leaf_path
  FROM "teams_nested" t CROSS JOIN UNNEST(JSON_PATHS(org)) x("leaf_path")
)
SELECT
  __time,
  id,
  leaf_path,
  JSON_VALUE(org, "leaf_path")
FROM cte
WHERE leaf_path LIKE '%.name'

Or put all those names back into an array per team with an array aggregator:

WITH cte AS (
  SELECT
    t.__time,
    t.id,
    t.org,
    x.leaf_path
  FROM "teams_nested" t CROSS JOIN UNNEST(JSON_PATHS(org)) x("leaf_path")
)
SELECT
  __time,
  id,
  ARRAY_AGG(JSON_VALUE(org, "leaf_path"))
FROM cte
WHERE leaf_path LIKE '%.name'
GROUP BY __time, id

You can do arbitrary gymnastics on the JSONPath expressions using regular expression filters. Here I use this technique to emulate a JSONPath expression like "$.members[*].gender":

WITH cte AS (
  SELECT
    t.__time,
    t.id,
    t.org,
    x.leaf_path
  FROM "teams_nested" t CROSS JOIN UNNEST(JSON_PATHS(org)) x("leaf_path")
)
SELECT
  __time,
  id,
  leaf_path,
  JSON_VALUE(org, "leaf_path")
FROM cte
WHERE REGEXP_LIKE(leaf_path, 'members\[\d+\]\.gender$')

Conclusion

By unnesting the result of JSON_PATHS, you get access to the entire structure of a nested (JSON) object in Druid.
In conjunction with regular expression filters, you get processing capabilities that are almost as powerful as jq.

“This image is taken from Page 500 of Praktisches Kochbuch für die gewöhnliche und feinere Küche” by Medical Heritage Library, Inc. is licensed under CC BY-NC-SA 2.0 .

Druid Data Cookbook: Flattening Arrays of Complex Objects

2024-09-22T00:00:00+02:00

A common problem in data that we ingest into Druid is that we may encounter arrays of nested objects, and we want to reason about specific fields within those objects. For instance, assume we have various teams for some sort of contest, and the members of a team might be represented like so:

[
  {
    "name": "Alice",
    "gender": "F"
  },
  {
    "name": "Bob",
    "gender": "M"
  },
  {
    "name": "Carol",
    "gender": "F"
  }
]

Each member has a name and some other attributes. But what if I want to get a list of all the teams that Bob is a member of? I’d need to extract an array of the relevant subfields only. How do I do this in Druid?

The naïve approach would be to write some expression like JSON_VALUE("members", '$[*].name'). Unfortunately, Druid does not support wildcard syntax in JSONPath expressions. So if seems that we are stuck. Or are we?

Loading some data

For this tutorial, download a fresh copy of Druid 30. Run a local quickstart instance and ingest this data set:

timestamp|team|members
2024-09-01|Team 1|[{ "name": "Alice", "gender": "F" }, { "name": "Bob", "gender": "M" }, { "name": "Carol", "gender": "F" }]
2024-09-01|Team 2|[{ "name": "Dan", "gender": "M" }, { "name": "Eve", "gender": "F" }, { "name": "Frank", "gender": "M" }]

If you use the SQL wizard, choose the TSV parser and set the separator character to |. Also, make sure that you add the PARSE_JSON function to the expression for the members column:

Or just use this SQL:

REPLACE INTO "teams_data" OVERWRITE ALL
WITH "ext" AS (
  SELECT *
  FROM TABLE(
    EXTERN(
      '{"type":"inline","data":"timestamp|team|members\n2024-09-01|Team 1|[{ \"name\": \"Alice\", \"gender\": \"F\" }, { \"name\": \"Bob\", \"gender\": \"M\" }, { \"name\": \"Carol\", \"gender\": \"F\" }]\n2024-09-01|Team 2|[{ \"name\": \"Dan\", \"gender\": \"M\" }, { \"name\": \"Eve\", \"gender\": \"F\" }, { \"name\": \"Frank\", \"gender\": \"M\" }]"}',
      '{"type":"tsv","delimiter":"|","findColumnsFromHeader":true}'
    )
  ) EXTEND ("timestamp" VARCHAR, "team" VARCHAR, "members" VARCHAR)
)
SELECT
  TIME_PARSE(TRIM("timestamp")) AS "__time",
  "team",
  PARSE_JSON("members") AS "members"
FROM "ext"
PARTITIONED BY DAY

Querying the data

Let’s first unnest the members data into individual rows. Unfortunately, we cannot just apply UNNEST("members"). This is because Druid has no way of knowing that the value of members is always an array. In an earlier post I described the problem and how to solve it using JSON_QUERY_ARRAY:

SELECT
  t."__time",
  t."team",
  r."member_rec"
FROM "teams_data" t CROSS JOIN UNNEST(JSON_QUERY_ARRAY("members", '$')) AS r(member_rec)

Use this query as a common table expression and group the values back in the main query:

WITH cte AS (
  SELECT
    t."__time",
    t."team",
    r."member_rec"
  FROM "teams_data" t CROSS JOIN UNNEST(JSON_QUERY_ARRAY("members", '$')) AS r(member_rec)
)
SELECT team, ARRAY_AGG(JSON_VALUE(member_rec, '$.name')) AS member_names
FROM cte
GROUP BY team

As you can see, this query yields a list of member names for each team - just what we wanted to get.

This technique is even more powerful when used in conjunction with JSON_PATHS to extract leaf objects at any level of the hierarchy. But I will leave that for another blog post.

Conclusion

In order to flatten and extract scalar fields from arrays of complex objects, you can transpose them using the UNNEST function.
Advanced filtering can be done on the single row level.
Then group the values back together and use ARRAY_AGG to reconstitute the arrays.

“This image is taken from Page 500 of Praktisches Kochbuch für die gewöhnliche und feinere Küche” by Medical Heritage Library, Inc. is licensed under CC BY-NC-SA 2.0 .

Druid Data Cookbook: Parameterizing the IN clause

2024-06-24T00:00:00+02:00

Let’s look at a neat new feature in Druid 30 that makes using the API more flexible.

For this tutorial, download a fresh copy of Druid 30. Run a local quickstart instance and ingest the Wikipedia sample data as per this tutorial.

Recap: Parameters in the query API

Druid is typically queried through a REST API; the payload is documented here. The API supports parameterizing queries, which is frequently used by programming language specific clients that create a wrapper layer around the API calls.

For instance, a simple query with a filter could be written like this:

{
    "query": "SELECT COUNT(*) FROM wikipedia WHERE channel = ?",
    "parameters": [
        {
            "type": "VARCHAR",
            "value": "#en.wikipedia"
        }
    ]
}

You can submit this query to the SQL endpoint using curl like so:

curl --location 'http://localhost:8888/druid/v2/sql' \
--header 'Content-Type: application/json' \
--data '{
    "query": "SELECT COUNT(*) FROM wikipedia WHERE channel = ?",
    "parameters": [
        {
            "type": "VARCHAR",
            "value": "#en.wikipedia"
        }
    ]
}'

or you can use Postman:

Let’s make the query a bit more complex. We want to count the rows for more than one channel with a simple GROUP BY and an IN clause:

{
    "query": "SELECT COUNT(*) FROM wikipedia WHERE channel IN ?",
    "parameters": [
        {
            "type": "VARCHAR",
            "value": "(#en.wikipedia, #de.wikipedia, #fr.wikipedia)"
        }
    ]
}

Alas, this fails.

And until Druid 29, you would have to work around this problem because ARRAYs as parameters weren’t really supported.

Two new features in Druid

Druid 30 brings two new features that help us here:

Druid 30 supports passing ARRAYs as parameters. (https://github.com/apache/druid/pull/16274) The way to do this is to specify a type of "ARRAY" and to use a JSON array as the value.
There is a new SCALAR_IN_ARRAY function that checks for presence of a particular value in an array. In fact, a conventional IN filter would internally use this functionality if the number of elements in the list is large enough. (https://github.com/apache/druid/pull/16306)

With that, we have everything to make the query work.

Now, let’s make it work

We’ll pass the list of channels as an ARRAY type parameter and use SCALAR_IN_ARRAY in place of IN:

{
    "query": "SELECT channel, COUNT(*) FROM wikipedia WHERE SCALAR_IN_ARRAY(channel, ?) GROUP BY 1",
    "parameters": [
        {
            "type": "ARRAY",
            "value": ["#en.wikipedia", "#de.wikipedia", "#fr.wikipedia"]
        }
    ]
}

The query works and returns the expected result:

Conclusion

We’ve learned that a simple IN filter cannot be parameterized through the Druid REST API. However, Druid 30 introduces an alternative because it supports array parameters. With array parameters and the new SCALAR_IN_ARRAY function, you can efficiently parameterize filter lists through the SQL REST API.

“This image is taken from Page 500 of Praktisches Kochbuch für die gewöhnliche und feinere Küche” by Medical Heritage Library, Inc. is licensed under CC BY-NC-SA 2.0 .

Druid Data Cookbook: About SQL NULL

2024-04-28T00:00:00+02:00

In the latest versions of Druid, handling of logical expressions, and in particular of NULL (unknown) values, has been changed to match the SQL standard. This leads to some behavior that may be surprising to long time Druid users. Let’s take a look at some examples!

Server configuration settings that affect NULL handling

There are three configuration settings that affect NULL handling:

druid.generic.useDefaultValueForNull
druid.expressions.useStrictBooleans
druid.generic.useThreeValueLogicForNativeFilters

The Status tile in the Druid web console indicates whether Druid is configured in SQL compliant mode; if you hover over the corresponding text, it shows a detailed breakdown of the settings:

What do these settings do? Let’s look at them in a bit more detail.

useDefaultValueForNull

Originally, Druid did not have a separate representation for empty or NULL values. A NULL value would be simply treated as an empty string, or as a numeric 0 (zero); and it would be equivalent to these default values for all intents and purposes.

The new default for this setting is false.

useStrictBooleans

This setting now defaults to true. If set, it forces the result of all logical expressions to be 0 or 1. Older versions of Druid would keep the original values of input parameters, since anything that was not 0 or an empty string would be considered a true value. (This resembles the way logical expressions are handled in some scripting languages.)

useThreeValueLogicForNativeFilters

This setting defaults to true, too. Its effect is that NULL is kept as a distinct value that is separate from either true or false, in all logical evaluations. Any expression that contains a NULL value would yield a result of NULL, too. This has a number of interesting ramifications. Let’s look at some of them today!

This tutorial can be done using the Druid 29.0.1 quickstart.

Ingestion

First of all, let’s create a very simple data set.

The table I am going to use has but four rows of data. There is a column color, which can be either a string, or NULL.

Here is the Druid SQL to create the sample data set:

REPLACE INTO "inline_data" OVERWRITE ALL
WITH "ext" AS (
  SELECT *
  FROM TABLE(
    EXTERN(
      '{"type":"inline","data":"{\"id\": 1, \"color\": \"red\"}\n{\"id\": 1, \"color\": null}\n{\"id\": 1, \"color\": \"blue\"}\n{\"id\": 1, \"color\": \"green\"}\n"}',
      '{"type":"json"}'
    )
  ) EXTEND ("id" BIGINT, "color" VARCHAR)
)
SELECT
  TIMESTAMP '2000-01-01 00:00:00' AS "__time",
  "id",
  "color"
FROM "ext"
PARTITIONED BY ALL

If you ingest this data set and list the entire table, you should get something like this:

Note the NULL value in the color column.

Let’s run some more queries now!

Comparing against single values

Since a comparison with NULL is never true, NULL is not even considered equal to itself. This is why the following query

yields no rows in return. Even more, if you change the condition to WHERE color <> NULL you get nothing, too!

You have to use the operators IS NULL and IS NOT NULL, instead.

In a similar vein, let’s get all the entries whose color is not red. Naïvely, we try:

We get only the blue and green entries - the NULL value is, again, not caught by the operator.

You could construct a combined filter clause handling the NULL case separately. But there is a special operator that allows one to treat NULL values like regular values:

The IS DISTINCT FROM operator does what we need: it treats NULL values as equivalent and distinct from any other value.

Comparing against multiple values

How about filtering multiple values with an IN clause? Let’s try to retrieve only those rows that have color red or NULL:

After the previous experiments, this should not come as a surprise: The query returns only the row for red. But what if we invert the condition?

This one returns no rows at all! Supposedly, the comparison with anything that contains NULL would always give a NULL result, and in fact the entire filter is optimized out of the query plan.

There is a workaround though. Instead of using a list with IN, we can also try an array literal like so:

This gives the same result as the IN filter. But if we invert the filter condition, we get something that is more along the lines of the expectation:

Learnings

Druid’s handling of unknown values has been made SQL compliant.
This can lead to unexpected results since any comparison with a NULL value yields a NULL value itself: NULL is equal to nothing, but is also not equal to nothing - not even to itself!
In order to handle NULL values properly, special operators exist, such as IS NULL and IS DISTINCT FROM.
Beware of NULL values in IN () filter clauses! Using an array literal instead can help.

“This image is taken from Page 500 of Praktisches Kochbuch für die gewöhnliche und feinere Küche” by Medical Heritage Library, Inc. is licensed under CC BY-NC-SA 2.0 .

New in Druid 29: Exporting Query Results

2024-03-01T00:00:00+01:00

The problem

Often my customers come to me with the requirement to extract large and/or detailed data sets from druid; they would like to store these data in a well known format for further processing by other tools. With multi-stage query, you can issue an asynchronous query against deep storage that handles (almost) unlimited amounts of data.

However, obtaining a result is a multi step process:

First, submit the query;
then poll the task endpoint until it is done
and finally, retrieve the result.

Meanwhile, the data that you download in step 3 has been written to some storage location inside Druid already. You can define a path and even instruct Druid to use durable storage for query results, but: these data are is still in a Druid specific format and cannot easily be read by other tools.

What if we could skip that step (persisting the result) completely and send the result directly to a file in a format of our choice?

Druid 29 can do this. For now, it is somewhat limited - it only supports csv, and can only export to local filesystem or S3. But other formats, such as Parquet, are coming.

Let’s try this out with a Druid Quickstart installation!

In this tutorial, you will

learn how to configure the settings for MSQ export
export a sample dataset.

Preparation

We are going to export to local storage. To limit the attack surface for malicious or inexperienced users, you have to define a specific filesystem path where Druid is allowed to store export files.

On your local machine, install Druid 29 from the tarball.

Create a directory /tmp/druid-export on your local disk.

In your Druid installation, edit the file conf/druid/auto/_common/common.runtime.properties and add the line

druid.export.storage.baseDir=/tmp/druid-export

at the end of the file.

Then start Druid like so, from within your Druid install directory:

bin/start-druid -m5g

Ingest the wikipedia sample data following the instructions using either classic batch or SQL ingestion.

Then go to the Query tab in the Druid console.

Exporting data

Run this query:

INSERT INTO 
EXTERN(local(exportPath => '/tmp/druid-export/wikipedia-export'))
AS CSV
SELECT * FROM wikipedia

When the query finishes, check the export directory and you will find a CSV file containing the data:

Note: the target directory has to be empty, else you get an error message.

This also works for export to S3.

Learnings

With MSQ, you can now export query results directly to external storage.
This is a new feature in Druid 29. It is currently limited to CSV format and either local storage or S3, but expect more options to be added soon.

“This image is taken from Page 500 of Praktisches Kochbuch für die gewöhnliche und feinere Küche” by Medical Heritage Library, Inc. is licensed under CC BY-NC-SA 2.0 .

Druid 29 Preview: Transposing Data with PIVOT and UNPIVOT

2024-01-15T00:00:00+01:00

Imagine that you are tasked with getting a spreadsheet of sales data into Druid that looks like this:

You’ve got the sales figures in the cells, and the regions down and the years across. While you can work with the data in this form in Druid, this may not be your best option. Druid 29 brings two new SQL functions that can help with transforming the data into a format that is better suited for analytics. Let’s see how that works!

Getting set up

This is a sneak peek into Druid 29 functionality. In order to use the new functions, you can (as of the time of writing) build Druid from the HEAD of the master branch:

git clone https://github.com/apache/druid.git
cd druid
mvn clean install -Pdist -DskipTests

Then follow the instructions to locate and install the tarball. Make sure you have the druid-multi-stage-query extension enabled.

In this tutorial, you will

learn how to use the PIVOT and UNPIVOT functions to transpose rows into columns and vice versa
and use this knowledge to transform a dataset during ingestion in Druid.

Ingesting the data

The dataset is very simple and looks like this:

region,2022,2023
Central,215000,240000
East,350000,360000
West,415000,450000

The easiest way to get these data into Druid is with the ingestion wizard in the Druid console, using the Paste data input source:

Run the ingestion wizard; make sure to give a meaningful name to the target datasource. Or you can paste the below SQL directly into a query window:

REPLACE INTO "sales_data" OVERWRITE ALL
WITH "ext" AS (
  SELECT *
  FROM TABLE(
    EXTERN(
      '{"type":"inline","data":"region,2022,2023\nCentral,215000,240000\nEast,350000,360000\nWest,415000,450000"}',
      '{"type":"csv","findColumnsFromHeader":true}'
    )
  ) EXTEND ("region" VARCHAR, "2022" BIGINT, "2023" BIGINT)
)
SELECT
  TIMESTAMP '2000-01-01 00:00:00' AS "__time",
  "region",
  "2022",
  "2023"
FROM "ext"
PARTITIONED BY ALL

`PIVOT` - transpose rows to columns

Let’s represent the data in a different form. We want one column per region and per year. Here is the query for this transformation:

SELECT *
FROM "sales_data"
PIVOT (
  SUM("2022") AS sales_2022, 
  SUM("2023") AS sales_2023 
  FOR "region" IN ('East' AS east, 'Central' AS central))

A few things worth noting:

PIVOT takes a list of aggregations over existing value columns to calculate the values in the final columns.
The aggregations are needed because PIVOT implicitly groups by the values in the value columns.
The FOR clause lists the pivot column.
To keep the column list finite, you have to give it a list of values to filter by (like an implicit HAVING clause.)
You can define aliases for the values, those will serve as column prefixes.
You can use the generated column names in query clauses - this here is a legitimate query:

SELECT east_sales_2022
FROM "sales_data"
PIVOT (
  SUM("2022") AS sales_2022, 
  SUM("2023") AS sales_2023 
  FOR "region" IN ('East' AS east, 'Central' AS central))

`UNPIVOT` - transpose columns to rows

To collect a list of columns into one, transposing the columns to rows, you can use UNPIVOT. Here is a query that creates a format that you would probably prefer for further analytical processing:

SELECT *
FROM "sales_data"
UNPIVOT ( "sales" FOR "year" IN ("2022" AS 'previous', "2023" AS 'current') )

An UNPIVOT query needs no aggregation since it only reorders the values.
You need to define two aliases:
- the first one, "sales" in the example, is the column where the values end up;
- the second one, "year" is where the column names are collected, expressed as strings.
Again, you can also define alias values for the column names.

`UNPIVOT` during ingestion

Back to the beginning of the story. As you may have noticed, the original table does not have a proper timestamp because the time information is in the column headers. Instead we just let Druid fill in a constant dummy timestamp. This is not optimal, particularly since the input data is very obviously time based!

Can we use our new knowledge to generate a proper timestamp?

Let’s see how to do this using SQL based ingestion. We’ll generate the timestamp column by UNPIVOTing the year column headers into a single new column, and parsing that column as a timestamp:

REPLACE INTO "sales_data_unpivot" OVERWRITE ALL
WITH "ext" AS (
  SELECT *
  FROM TABLE(
    EXTERN(
      '{"type":"inline","data":"region,2022,2023\nCentral,215000,240000\nEast,350000,360000\nWest,415000,450000"}',
      '{"type":"csv","findColumnsFromHeader":true}'
    )
  ) EXTEND ("region" VARCHAR, "2022" BIGINT, "2023" BIGINT)
)
SELECT
  TIME_PARSE("year", 'YYYY') AS "__time",
  "region",
  "sales"
FROM "ext"
UNPIVOT ( "sales" FOR "year" IN ("2022", "2023" ) )
PARTITIONED BY YEAR

Let’s check the result:

We have a proper timestamp. (You can also check the Segments view to verify that the data is actually partitioned by year.)

Conclusion

PIVOT transposes rows to columns, aggregating values on the way.
UNPIVOT transposes columns to rows.
The behavior of both functions can be fine tuned by choosing suitable column aliases.
One case where this is especially handy is with spreadsheet data that has the time axis across.

“This image is taken from Page 500 of Praktisches Kochbuch für die gewöhnliche und feinere Küche” by Medical Heritage Library, Inc. is licensed under CC BY-NC-SA 2.0 .

Druid 29 Preview: Handling Nested Arrays

2023-12-17T00:00:00+01:00

Imagine you have a data sample like this:

{'id': 93, 'shop': 'Circular Pi Pizzeria', 'name': 'David Murillo', 'phoneNumber': '305-351-2631', 'address': '746 Chelsea Plains Suite 656\nNew Richard, MA 16940', 'pizzas': [{'pizzaName': 'Salami', 'additionalToppings': ['🥓 bacon']}], 'timestamp': 1702815411410}
{'id': 94, 'shop': 'Marios Pizza', 'name': 'Darius Roach', 'phoneNumber': '344.571.9608x0590', 'address': '58235 Robert Cliffs\nAguilarland, PR 76249', 'pizzas': [{'pizzaName': 'Diavola', 'additionalToppings': []}, {'pizzaName': 'Salami', 'additionalToppings': ['🧄 garlic']}, {'pizzaName': 'Peperoni', 'additionalToppings': ['🫒 olives', '🧅 onion', '🍅 tomato', '🍓 strawberry']}, {'pizzaName': 'Diavola', 'additionalToppings': ['🫒 olives', '🍌 banana', '🍍 pineapple']}, {'pizzaName': 'Margherita', 'additionalToppings': ['🍓 strawberry', '🍍 pineapple', '🥚 egg', '🐟 tuna', '🐟 tuna']}, {'pizzaName': 'Margherita', 'additionalToppings': ['🥚 egg']}, {'pizzaName': 'Margherita', 'additionalToppings': ['🫑 green peppers', '🥚 egg', '🥚 egg']}, {'pizzaName': 'Peperoni', 'additionalToppings': []}, {'pizzaName': 'Salami', 'additionalToppings': []}], 'timestamp': 1702815415518}
{'id': 95, 'shop': 'Mammamia Pizza', 'name': 'Ryan Juarez', 'phoneNumber': '(041)278-5690', 'address': '934 Melissa Lights\nPaulland, UT 40700', 'pizzas': [{'pizzaName': 'Marinara', 'additionalToppings': ['🫑 green peppers', '🧅 onion']}, {'pizzaName': 'Marinara', 'additionalToppings': ['🍅 tomato', '🥓 bacon', '🍌 banana', '🌶️ hot pepper']}, {'pizzaName': 'Peperoni', 'additionalToppings': ['🍓 strawberry', '🍌 banana', '🐟 tuna', '🧀 blue cheese']}, {'pizzaName': 'Marinara', 'additionalToppings': ['🐟 tuna', '🧅 onion', '🍍 pineapple', '🍓 strawberry']}, {'pizzaName': 'Mari & Monti', 'additionalToppings': ['🫒 olives', '🐟 tuna']}, {'pizzaName': 'Marinara', 'additionalToppings': ['🍍 pineapple', '🍅 tomato', '🍌 banana', '🧀 blue cheese', '🫒 olives']}, {'pizzaName': 'Marinara', 'additionalToppings': ['🍌 banana', '🫑 green peppers', '🧄 garlic', '🍅 tomato']}], 'timestamp': 1702815418643}

I created the data sample using Francesco’s pizza simulator. The structure of these simulated pizza orders is quite deeply nested:

Each order has a field pizzas, which is an array of JSON objects.
Each individual pizza item has
- a pizzaName field, which is a string
- additionalToppings, an array of strings that may be empty.

Arrays of objects are a bit obtuse, and I would like to create a data model that breaks down the orders so that each row in Druid represents a line item (a single pizza.) To that end, it would be nice to use some combination of JSON functions and UNNEST during ingestion. But how exactly? Let’s find out!

Getting set up

This is a sneak peek into Druid 29 functionality. In order to use the new functions, you can (as of the time of writing) build Druid from the HEAD of the master branch:

git clone https://github.com/apache/druid.git
cd druid
mvn clean install -Pdist -DskipTests

Then follow the instructions to locate and install the tarball. Make sure you have the druid-multi-stage-query extension enabled.

In this tutorial, you will

examine how to model deeply nested JSON data with arrays in Druid and
break down a nested JSON array into individual rows using new functionality that is currently being built.

The data

Right now, the technique we are looking at is limited to batch ingestion. So, we need to capture the simulator data in a file.

I assume you have a local Kafka service at localhost:9092.

Check out the pizza simulator and run it like so:

python3 main.py --security-protocol PLAINTEXT --host localhost --port 9092 --topic-name pizza-orders --nr-messages 0 --max-waiting-time 5

Capture the output using kcat and redirect to a file:

kcat -b localhost:9092 -t pizza-orders >>./pizza-orders.json

You can stop the simulator after a while and use the pizza-orders.json file as input for the next steps.

Basic ingestion: the `pizza-orders` table

Let’s start by setting up a naïve data model using the web console wizard. Note how in the SQL view, the type of the pizzas field is somewhat correctly recognized as a COMPLEX<json> but it does not know about the array structure:

Here is the ingestion query using MSQ:

REPLACE INTO "pizza-orders" OVERWRITE ALL
WITH "ext" AS (
  SELECT *
  FROM TABLE(
    EXTERN(
      '{"type":"local","baseDir":"/Users/hellmarbecker/meetup-talks/jsonarray","filter":"*json"}',
      '{"type":"json"}'
    )
  ) EXTEND ("id" BIGINT, "shop" VARCHAR, "name" VARCHAR, "phoneNumber" VARCHAR, "address" VARCHAR, "pizzas" TYPE('COMPLEX<json>'), "timestamp" BIGINT)
)
SELECT
  MILLIS_TO_TIMESTAMP("timestamp") AS "__time",
  "id",
  "shop",
  "name",
  "phoneNumber",
  "address",
  "pizzas"
FROM "ext"
PARTITIONED BY DAY

When we query this table, we see that indeed we have a general nested column here - it is not marked as an array

We can look at the detailed values in the column

Again, what we would like is a table model where each row represents a line item, i. e. an individual pizza!

First attempt at breaking down the line items

Let’s try to craft a new ingestion query that breaks down the line items using UNNEST. We want to unnest the line items using something like UNNEST(JSON_QUERY(pizzas, '$')), and then extract the individual fields into separate columns: JSON_VALUE(p, '$.pizzaName') AS pizzaName and so forth.

Here’s the first attempt at such a query:

REPLACE INTO "pizza-lineitems" OVERWRITE ALL
WITH "ext" AS (
  SELECT *
  FROM TABLE(
    EXTERN(
      '{"type":"local","baseDir":"/Users/hellmarbecker/meetup-talks/jsonarray","filter":"*json"}',
      '{"type":"json"}'
    )
  ) EXTEND ("id" BIGINT, "shop" VARCHAR, "name" VARCHAR, "phoneNumber" VARCHAR, "address" VARCHAR, "pizzas" TYPE('COMPLEX<json>'), "timestamp" BIGINT)
)
SELECT
  MILLIS_TO_TIMESTAMP("timestamp") AS "__time",
  "id",
  "shop",
  "name",
  "phoneNumber",
  "address",
  JSON_VALUE(p, '$.pizzaName') AS pizzaName,
  JSON_QUERY(p, '$.additionalToppings') AS additionalToppings
FROM "ext" CROSS JOIN UNNEST(JSON_QUERY(pizzas, '$')) AS lineitems(p)
PARTITIONED BY DAY

This, unfortunately, fails with a screaming error message:

We cannot unnest arrays of objects just like arrays of primitives! But why is that? Look at the error message more closely: Druid thinks this is a call to UNNEST(COMPLEX<JSON>). So, JSON_QUERY doesn’t know about the array nature of its output. What now?

A new function: `JSON_QUERY_ARRAY`

The Druid team has added a new function that does just the right thing for our case:

JSON_QUERY_ARRAY(expr, path)

Extracts an ARRAY<COMPLEX<json>> value from expr at the specified path. If value is not an ARRAY, it gets translated into a single element ARRAY containing the value at path. The primary use of this function is to extract arrays of objects to use as inputs to other array functions.

Let’s rewrite the above query, substituting JSON_QUERY_ARRAY for JSON_QUERY in both cases:

REPLACE INTO "pizza-lineitems" OVERWRITE ALL
WITH "ext" AS (
  SELECT *
  FROM TABLE(
    EXTERN(
      '{"type":"local","baseDir":"/Users/hellmarbecker/meetup-talks/jsonarray","filter":"*json"}',
      '{"type":"json"}'
    )
  ) EXTEND ("id" BIGINT, "shop" VARCHAR, "name" VARCHAR, "phoneNumber" VARCHAR, "address" VARCHAR, "pizzas" TYPE('COMPLEX<json>'), "timestamp" BIGINT)
)
SELECT
  MILLIS_TO_TIMESTAMP("timestamp") AS "__time",
  "id",
  "shop",
  "name",
  "phoneNumber",
  "address",
  JSON_VALUE(p, '$.pizzaName') AS pizzaName,
  JSON_QUERY_ARRAY(p, '$.additionalToppings') AS additionalToppings
FROM "ext" CROSS JOIN UNNEST(JSON_QUERY_ARRAY(pizzas, '$')) AS lineitems(p)
PARTITIONED BY DAY

That way, we can also be sure that the additionalToppings column will be represented as an array.

After the ingestion has finished, query the table and note how

there is now one row per line item
the pizzas subcolumn is represented as an array, as you can see by the [⋯] instead of the tree symbol:

You can actually run a query over the new table that shows how JSON_QUERY forgets about the “array-ness” of the array column, while JSON_QUERY_ARRAY enforces it:

It is, however, preferred to use JSON_QUERY_ARRAY at ingestion time and represent the result in your data model. This is part of optimizing the data model to achieve those fast queries that Druid is known for!

Conclusion

We have seen how it is now possible to unnest even columns that contain arrays of objects. With this capability, Druid takes another big step handling nested objects.
Using JSON_QUERY_ARRAY on an array retains the “array-ness” and passes it on to functions that require an array input.
Using JSON_QUERY_ARRAY on a single object wraps it into an array.
You should use JSON_QUERY_ARRAY at ingestion rather than query time.

"Pizza" by Katrin Gilger is licensed under CC BY-SA 2.0 .

Druid Data Cookbook: Upserts in Druid SQL

2023-11-25T00:00:00+01:00

In an earlier blog, I demonstrated a technique to combine existing and new data in Druid batch ingestion in a way that more or less emulates what is usually expressed in SQL as a MERGE or UPSERT statement. This technique involves a combine datasource and works only in JSON-based ingestion. Also, it works on bulk data where you replace an entire range of data based on a time interval and key range.

Today I am going to look at a similar, albeit more surgical construction, implementing what is usually expressed in SQL as a MERGE or UPSERT statement. I will be using SQL based ingestion that is available in newer versions of Druid.

The MERGE statement, in a simplified way, works like this:

MERGE INTO druid_table
(SELECT * FROM external_table)
ON druid_table.keys = external_table.keys
WHEN MATCHED THEN UPDATE ...
WHEN NOT MATCHED THEN INSERT ...

So, you compare old (druid_table) and new data (external_table) with respect to a matching condition. This would entail a combination of timestamp and key fields, which in the above pseudocode is denoted by keys. There are three possible outcomes for any combination of keys:

If keys exists only in druid table, leave these data untouched.
If keys exists in both tables, replace the row(s) in druid_table with those in external_table.
If keys exists only in external_table, insert that data into druid_table.

But Druid SQL does not offer a MERGE statement, at least not at the time of this writing. Can we do this in SQL anyway? Stay tuned if you want to know!

This tutorial works with the Druid 28 quickstart.

Recap: the data

Let’s use the same data as in the bulk upsert blog: daily aggregated viewership data from various ad networks.

{"date": "2023-01-01T00:00:00Z", "ad_network": "gaagle", "ads_impressions": 2770, "ads_revenue": 330.69}
{"date": "2023-01-01T00:00:00Z", "ad_network": "fakebook", "ads_impressions": 9646, "ads_revenue": 137.85}
{"date": "2023-01-01T00:00:00Z", "ad_network": "twottr", "ads_impressions": 1139, "ads_revenue": 493.73}
{"date": "2023-01-02T00:00:00Z", "ad_network": "gaagle", "ads_impressions": 9066, "ads_revenue": 368.66}
{"date": "2023-01-02T00:00:00Z", "ad_network": "fakebook", "ads_impressions": 4426, "ads_revenue": 170.96}
{"date": "2023-01-02T00:00:00Z", "ad_network": "twottr", "ads_impressions": 9110, "ads_revenue": 452.2}
{"date": "2023-01-03T00:00:00Z", "ad_network": "gaagle", "ads_impressions": 3275, "ads_revenue": 363.53}
{"date": "2023-01-03T00:00:00Z", "ad_network": "fakebook", "ads_impressions": 9494, "ads_revenue": 426.37}
{"date": "2023-01-03T00:00:00Z", "ad_network": "twottr", "ads_impressions": 4325, "ads_revenue": 107.44}
{"date": "2023-01-04T00:00:00Z", "ad_network": "gaagle", "ads_impressions": 8816, "ads_revenue": 311.53}
{"date": "2023-01-04T00:00:00Z", "ad_network": "fakebook", "ads_impressions": 8955, "ads_revenue": 254.5}
{"date": "2023-01-04T00:00:00Z", "ad_network": "twottr", "ads_impressions": 6905, "ads_revenue": 211.74}
{"date": "2023-01-05T00:00:00Z", "ad_network": "gaagle", "ads_impressions": 3075, "ads_revenue": 382.41}
{"date": "2023-01-05T00:00:00Z", "ad_network": "fakebook", "ads_impressions": 4870, "ads_revenue": 205.84}
{"date": "2023-01-05T00:00:00Z", "ad_network": "twottr", "ads_impressions": 1418, "ads_revenue": 282.21}
{"date": "2023-01-06T00:00:00Z", "ad_network": "gaagle", "ads_impressions": 7413, "ads_revenue": 322.43}
{"date": "2023-01-06T00:00:00Z", "ad_network": "fakebook", "ads_impressions": 1251, "ads_revenue": 265.52}
{"date": "2023-01-06T00:00:00Z", "ad_network": "twottr", "ads_impressions": 8055, "ads_revenue": 394.56}
{"date": "2023-01-07T00:00:00Z", "ad_network": "gaagle", "ads_impressions": 4279, "ads_revenue": 317.84}
{"date": "2023-01-07T00:00:00Z", "ad_network": "fakebook", "ads_impressions": 5848, "ads_revenue": 162.96}
{"date": "2023-01-07T00:00:00Z", "ad_network": "twottr", "ads_impressions": 9449, "ads_revenue": 379.21}

Save this file as data1.json. Also, save the “new data” bit:

{"date": "2023-01-03T00:00:00Z", "ad_network": "gaagle", "ads_impressions": 4521, "ads_revenue": 378.65}
{"date": "2023-01-04T00:00:00Z", "ad_network": "gaagle", "ads_impressions": 4330, "ads_revenue": 464.02}
{"date": "2023-01-05T00:00:00Z", "ad_network": "gaagle", "ads_impressions": 6088, "ads_revenue": 320.57}
{"date": "2023-01-06T00:00:00Z", "ad_network": "gaagle", "ads_impressions": 3417, "ads_revenue": 162.77}
{"date": "2023-01-07T00:00:00Z", "ad_network": "gaagle", "ads_impressions": 9762, "ads_revenue": 76.27}
{"date": "2023-01-08T00:00:00Z", "ad_network": "gaagle", "ads_impressions": 1484, "ads_revenue": 188.17}
{"date": "2023-01-09T00:00:00Z", "ad_network": "gaagle", "ads_impressions": 1845, "ads_revenue": 287.5}

as data2.json.

Initial data ingestion

Let’s ingest the first data set. We want to set the segment granularity to month, so the ingestion statement uses a PARTITIONED BY MONTH clause. Moreover, we enforce secondary partitioning by choosing REPLACE mode and by including a CLUSTERED BY clause. Here’s the complete statement (replace the path in baseDir with the path you saved the sample file to):

REPLACE INTO "ad_data" OVERWRITE ALL
WITH "ext" AS (
  SELECT *
  FROM TABLE(
    EXTERN(
      '{"type":"local","baseDir":"/<my base path>","filter":"data1.json"}',
      '{"type":"json"}'
    )
  ) EXTEND ("date" VARCHAR, "ad_network" VARCHAR, "ads_impressions" BIGINT, "ads_revenue" DOUBLE)
)
SELECT
  TIME_PARSE("date") AS "__time",
  "ad_network",
  "ads_impressions",
  "ads_revenue"
FROM "ext"
PARTITIONED BY MONTH
CLUSTERED BY "ad_network"

You can run this SQL from the Query tab in the Druid console:

Or you can use the Ingest wizard to enter the same code.

The merge query

Many thanks to John Kowtko for pointing out this approach. Since we don’t have a MERGE statement, let’s emulate it using a FULL OUTER JOIN. Druid’s MSQ engine supports sort/merge joins of arbitrary size tables, so we can actually pull this off!

Important note: the new join algorithm needs to be explicitly requested by setting a query context parameter. Open up the query engine menu next to the Preview button, and select Edit context:

Add { "sqlJoinAlgorithm": "sortMerge" } to the query context.

Then run the ingestion query:

REPLACE INTO "ad_data" OVERWRITE ALL
WITH "ext" AS (
  SELECT *
  FROM TABLE(
    EXTERN(
      '{"type":"local","baseDir":"/<my base path>","filter":"data2.json"}',
      '{"type":"json"}'
    )
  ) EXTEND ("date" VARCHAR, "ad_network" VARCHAR, "ads_impressions" BIGINT, "ads_revenue" DOUBLE)
)
SELECT
  COALESCE("new_data"."__time", "ad_data"."__time") AS "__time",
  COALESCE("new_data"."ad_network", "ad_data"."ad_network") AS "ad_network",
  CASE WHEN "new_data"."ad_network" IS NOT NULL THEN "new_data"."ads_impressions" ELSE "ad_data"."ads_impressions" END AS "ads_impressions",
  CASE WHEN "new_data"."ad_network" IS NOT NULL THEN "new_data"."ads_revenue" ELSE "ad_data"."ads_revenue" END AS "ads_revenue"
FROM
  "ad_data"
FULL OUTER JOIN
  ( SELECT
    TIME_PARSE("date") AS "__time",
    "ad_network",
    "ads_impressions",
    "ads_revenue"
  FROM "ext" ) "new_data"
ON "ad_data"."__time" = "new_data"."__time" AND "ad_data"."ad_network" = "new_data"."ad_network"
PARTITIONED BY MONTH
CLUSTERED BY "ad_network"

Analysis of the query

What have we done here?

We are emulating the MERGE statement with a full outer join. The left side table is the data we already have in Druid; the right side is the new data. Our merge key is a combination of timestamp (daily granularity) and ad network.

For each key combination there are three possible outcomes:

If the right hand side is null, leave the left hand side data as the result (leave old data untouched).
If neither side is null, replace the row(s) in the existing table with new data from the right hand side (update rows).
If the left hand side is null, insert the right hand side data into Druid.

This is exactly what we wanted to happen.

In order to identify the correct data to be inserted, we look at the join key:

Data rows that refer to key fields are modeled with a COALESCE statement: COALESCE("new_data"."ad_network", "ad_data"."ad_network") AS "ad_network" selects the key field from the right hand side, and if that one is null (right hand side doesn’t exist), then the left hand side instead.
For non-key fields the statement is a bit more complex because we still have to select based on the key field. Otherwise some real null values in the data might create inconsistencies, where we would overwrite rows only partially. Hence an expression like CASE WHEN "new_data"."ad_network" IS NOT NULL THEN "new_data"."ads_impressions" ELSE "ad_data"."ads_impressions" END AS "ads_impressions".

Can we be more selective?

You might be thinking that this approach entails rewriting all the data in the existing table, even if the range of new data is much more limited. And you would be right. Fortunately, it is possible to limit the date range to be overwritten.

Let’s try this. Apparently we can specify the date range like so:

REPLACE INTO "ad_data" OVERWRITE WHERE __time >= TIMESTAMP'2023-01-03' AND __time < TIMESTAMP'2023-01-10'
...

Alas, this doesn’t work:

The date filter has to be aligned with the segments, otherwise Druid will refuse to run the query. This is actually a Good Thing: in JSON ingestion mode you would be able to overwrite a whole segment with data covering a lesser date range, potentially deleting data that you actually wanted to keep!

If we adjust the date range clause to match the segment boundaries:

REPLACE INTO "ad_data" OVERWRITE WHERE __time >= TIMESTAMP'2023-01-01' AND __time < TIMESTAMP'2023-02-01'
...

the ingestion query works fine and we get the desired result:

Use the new graphical exploration mode of Druid to get an idea of the data:

Learnings

You can emulate the effect of a MERGE statement in Druid with a full outer join.
Make sure to enable the sort/merge join algorithm in the query context.
Some consideration must be taken around null values in the outer join result.
You can limit the range of data for reprocessing using OVERWRITE WHERE ..., but take care to align the time filter with your segment granularity.

“This image is taken from Page 500 of Praktisches Kochbuch für die gewöhnliche und feinere Küche” by Medical Heritage Library, Inc. is licensed under CC BY-NC-SA 2.0 .

Druid SQL: BETWEEN considered harmful

2023-11-03T00:00:00+01:00

When querying data in Druid (or another analytical database), your query will in almost all cases include a filter on the primary timestamp. And this timestamp filter will usually take the form of an interval.

The easiest way to describe such an interval seems to be the SQL BETWEEN operator.

Advice from a grug brained developer: Don’t do that.

Here’s why.

A harmless data sample

Imagine you have a table like this:

__time	val
2023-01-01T01:00:00.000Z	1
2023-01-02T00:00:00.000Z	1
2023-01-02T06:00:00.000Z	1
2023-01-03T00:00:00.000Z	1
2023-01-03T01:00:00.000Z	1
2023-01-04T00:00:00.000Z	1
2023-01-04T07:00:00.000Z	1

You can populate such a table in Druid using SQL ingestion like so:

REPLACE INTO "sample" OVERWRITE ALL
WITH "ext" AS (
  SELECT *
  FROM TABLE(
    EXTERN(
      '{"type":"inline","data":"datetime,val\n2023-01-01 01:00:00,1\n2023-01-02 00:00:00,1\n2023-01-02 06:00:00,1\n2023-01-03 00:00:00,1\n2023-01-03 01:00:00,1\n2023-01-04 00:00:00,1\n2023-01-04 07:00:00,1"}',
      '{"type":"csv","findColumnsFromHeader":true}'
    )
  ) EXTEND ("datetime" VARCHAR, "val" BIGINT)
)
SELECT
  TIME_PARSE(TRIM("datetime")) AS "__time",
  "val"
FROM "ext"
PARTITIONED BY DAY

You want to list all rows for 2nd and 3rd January. You write:

SELECT * FROM "sample"
WHERE __time BETWEEN TIMESTAMP'2023-01-02' AND TIMESTAMP'2023-01-03'

And here’s the result:

__time	val
2023-01-02T00:00:00.000Z	1
2023-01-02T06:00:00.000Z	1
2023-01-03T00:00:00.000Z	1

You notice that all the rows for 2nd January are in the result, but only one row for 3rd January. What happened?

The solution

We are being hit by two entirely documented features here, which together create a minor footgun.

The BETWEEN operator creates a closed interval, that is it includes both the left and right boundary value. This would by itself not be a problem, were it not for the second feature.
The literal TIMESTAMP'2023-01-03' does not mean “the entire day of 3rd January”, as one might naïvely think. It is equivalent to “3rd January, 00:00”.

What we have done is: we have created a query that includes the entire 2nd January but only the data for 00:00 on 3rd January!

You could fix this by writing something like TIMESTAMP'2023-01-03 23:59:59' for the right interval boundary. But does this really catch every last bit of the data for that day? What if you have fractional timestamps? Is your precision milliseconds? or even microseconds?

This is why I argue that the proper way to model such time filter conditions is to use a right-open interval, which includes the left boundary value but not the right boundary value. If you do that, you have to set the right boundary to the next day (4th January), in order to still catch all of 3rd January in your filter:

SELECT * FROM "sample"
WHERE __time >= TIMESTAMP'2023-01-02' AND __time < TIMESTAMP'2023-01-04'

This query returns the correct result:

__time	val
2023-01-02T00:00:00.000Z	1
2023-01-02T06:00:00.000Z	1
2023-01-03T00:00:00.000Z	1
2023-01-03T01:00:00.000Z	1

This way of filtering is also in line with the treatment of time intervals almost everywhere in Druid. Segment time chunks, for instance, are defined in terms of right open intervals, too.

Edit 2023-11-06: Peter pointed out that you can instead use the TIME_IN_INTERVAL function. This uses ISO interval notation and creates exactly the right exclusive intervals we want. So a more elegant way of rewriting the query is:

SELECT * FROM "sample"
WHERE TIME_IN_INTERVAL(__time, '2023-01-02/2023-01-04')

Learnings

Don’t use the BETWEEN operator in SQL. Especially not for time intervals. Because the operator creates an inclusive (closed) interval, the result may not be what you expect.
Use a WHERE clause with simple comparison operators instead, to create a right open interval.

"Grug" by PlatinumFusi0n is licensed under CC BY 3.0 .

Druid 28 Sneak Peek: Ingesting Multiple Kafka Topics into One Datasource

2023-10-29T00:00:00+02:00

Apache Druid has the concept of supervisors that orchestrate ingestion jobs and handle data handoff and failure recovery. Per datasource, you can have exactly one supervisor.

Until recently, that meant that one datasource could only ingest data from one stream. But many of my customers asked whether it would be possible to multiplex several streams into populating one datasource. With Druid 28, this becomes possible!

In this quick tutorial, you will learn how to utilize the new options in Kafka ingestion so as to stream multiple topics into one Druid datasource. You will need:

a Druid 28 preview build (see below)
any Kafka installation
I am using Francesco’s pizza simulator for generating test data.

Building the distribution

You can use the 30 day free trial of Imply’s Druid release which already contains the new features. Documentation is also available.

But if you want to build the open source version:

Clone the Druid repository, check out the 28.0.0 branch, and build the tarball:

git clone https://github.com/apache/druid.git
cd druid
git checkout 28.0.0
mvn clean install -Pdist -DskipTests

Then follow the instructions to locate and install the tarball, and start Druid. Make sure you are loading the Kafka indexing extension. (It is included in the quickstart but not by default in the Docker image.)

Generating test data

I am assuming that you are running Kafka locally on the standard port and that you have enabled auto topic creation.

Clone the simulator repository, change to the simulator directory and run three instances of pizza delivery:

python3 main.py --host localhost --port 9092 --topic-name pizza-mario --max-waiting-time 5 --security-protocol PLAINTEXT --nr-messages 0 >/dev/null &
python3 main.py --host localhost --port 9092 --topic-name pizza-luigi --max-waiting-time 5 --security-protocol PLAINTEXT --nr-messages 0 >/dev/null &
python3 main.py --host localhost --port 9092 --topic-name my-pizza --max-waiting-time 5 --security-protocol PLAINTEXT --nr-messages 0 >/dev/null &

If you have set up Kafka differently, you may have to modify these instructions.

Connecting Druid to the streams

Navigate your browser to the Druid GUI (in the quickstart, this is http://localhost:8888), and start configuring a streaming ingestion:

Choose Kafka as the input source. Note how there is a new option topicPattern in the connection settings:

This is a regular expression that you can specify in place of the topic name. Let’s try to gobble up all our pizza related topics by setting the pattern to “pizza”:

Oh, this didn’t work as expected. But the documentation and bubble help show us the solution: The topic pattern has to match the entire topic name. So, the above expression actually matches like the regular expression ^pizza$.

Armed with this knowledge, let’s correct the pattern:

This matches all topic names that start with “pizza-“.

Building the data model

Let’s have a look at the Parse data screen. Among the Kafka metadata, there is a new field containing the source topic for each row of data. The default column name is kafka.topic but this is configurable in the Kafka metadata settings on the right hand side:

Proceed to the final data model - the topic name is automatically included as a string column:

Before kicking off the ingestion job, you may want to review and edit the datasource name

because by default, the datasource name is derived from the topic pattern and may contain a lot of special characters.

Once the supervisor is running, you should see data coming in from both the pizza-mario and pizza-luigi topics:

What if you want to pick up all 3 topics? From the above, it should be clear - you need to pad the regular expression with .* on both sides:

You can try it yourself!

Task management

Druid will pick up any topics that match the topicPattern, even if new topics are added during the ingestion.

How are partitions assigned to tasks?

The Supervisor will fetch the list of all partitions from all topics and assign the list of these partitions in same way as it assigns the partitions for one topic. In detail this means (quote from the documentation):

When ingesting data from multiple topics, partitions are assigned based on the hashcode of the topic name and the id of the partition within that topic. The partition assignment might not be uniform across all the tasks.

And looking at the code, this boils down to

Math.abs(31 * topic().hashCode() + partitionId) % taskCount

This heuristic should give a fairly uniform load, provided that the data volumes per partition are comparable.

Conclusion

You can use topicPattern instead of topic in a Kafka Supervisor spec, to enable ingesting from multiple topics.
topicPattern is a regex but the regex has to match the whole topic name
You can have as many active ingestion tasks as the total partitions of all topics. Partitions are assigned to tasks using a hashing algorithm.

"Pizza" by Katrin Gilger is licensed under CC BY-SA 2.0 .

New in Imply Polaris: Data Retention Policy

2023-09-24T00:00:00+02:00

Apache Druid has always had built-in data lifecycle management by way of retention rules. Specifying fixed time intervals or relative periods, you would tell Druid to retain only data segments that are not older than x days.

The mid-August release of Polaris brings retention management to Imply Polaris, the fully managed analytics service powered by Druid. You can set the retention policy by table. Here is how it’s done:

In the Tables view, select the ... menu for the table that you want to set the retention policy for.

In the Edit table screen, find the barrel icon with Data retention next to it. Select Specific, and enter the desired period. The format is ISO-8601 duration, so for instance, P7D means 7 days (before the current date.) Any data that is older (by primary timestamp) will be scheduled for deletion after 30 days.

Then hit Update to apply the changes.

Conclusion

Data retention management is now available in Polaris.
Unlike Druid default (which retains data in deep storage indefinitely), data dropped from Polaris will be deleted permanently after 30 days.

New in Apache Druid 27: Querying Deep Storage

2023-09-07T00:00:00+02:00

In realtime analytics, a common scenario is that you want to retain a lot of (years of) historical data in order to run analytics over a longer period of time. But these analytical queries occur infrequently and their performance is usually not critical. The bulk of everyday queries, however, accesses only a limited set of relatively fresh data, typically 1 or 2 weeks worth.

In the standard configuration of Druid, until recently you would have to preload all data that you wanted to be queryable to your data servers. That would mean a lot of local storage would be required, most of which would be accessed very rarely. You could mitigate this problem to a certain extent using data tiering, but the cost associated with just having that storage around would still be considerable.

Druid 27 comes with the ability to query deep storage directly, meaning in the above scenario you can actually keep only your 1-2 weeks of fresh data on local SSDs and retain all your historical data in deep storage only. Because of the higher latency of cloud storage, deep storage queries are generally executed asynchronously, and there is a new API endpoint just for deep storage queries.

Let’s run a small example to learn how deep storage query is configured and used!

This tutorial works with the Druid 27 quickstart.

Building the test data set

Ingest the wikipedia example data set. We want to have a bunch of segments so let’s partition by hour. You can configure the ingestion job using the wizard, or just use this SQL statement:

REPLACE INTO "wikipedia" OVERWRITE ALL
WITH "ext" AS (SELECT *
FROM TABLE(
  EXTERN(
    '{"type":"http","uris":["https://druid.apache.org/data/wikipedia.json.gz"]}',
    '{"type":"json"}'
  )
) EXTEND ("isRobot" VARCHAR, "channel" VARCHAR, "timestamp" VARCHAR, "flags" VARCHAR, "isUnpatrolled" VARCHAR, "page" VARCHAR, "diffUrl" VARCHAR, "added" BIGINT, "comment" VARCHAR, "commentLength" BIGINT, "isNew" VARCHAR, "isMinor" VARCHAR, "delta" BIGINT, "isAnonymous" VARCHAR, "user" VARCHAR, "deltaBucket" BIGINT, "deleted" BIGINT, "namespace" VARCHAR, "cityName" VARCHAR, "countryName" VARCHAR, "regionIsoCode" VARCHAR, "metroCode" BIGINT, "countryIsoCode" VARCHAR, "regionName" VARCHAR))
SELECT
  TIME_PARSE("timestamp") AS "__time",
  "isRobot",
  "channel",
  "flags",
  "isUnpatrolled",
  "page",
  "diffUrl",
  "added",
  "comment",
  "commentLength",
  "isNew",
  "isMinor",
  "delta",
  "isAnonymous",
  "user",
  "deltaBucket",
  "deleted",
  "namespace",
  "cityName",
  "countryName",
  "regionIsoCode",
  "metroCode",
  "countryIsoCode",
  "regionName"
FROM "ext"
PARTITIONED BY HOUR

You should end up with 22 segments, each spanning an hour.

Recap: Retention rules

By default, Druid retains all data in deep storage that it has ever ingested. You have to run an explicit kill task to delete data permanently.

However, standard Druid queries can only work with data segments that have been preloaded to the data servers. Preloading of data is configured using retention rules, which you can configure on a per-datasource basis. Retention rules are evaluated for each segment, from top to bottom, until a rule is found that matches the segment in question. Each rule is either a Load rule (which tells the Coordinator to make that segment available for queries), or a Drop rule (which removes the segment from the list of available segments.) Rules specify either a time period (relative to the current time), or an absolute time interval.

In production setups you would usually find period rules (“retain only data for the last 2 weeks”), but for the tutorial we are going to use interval rules because we are working with a fixed dataset.

First attempt to configure deep storage query

The data sample includes one day’s worth of data. Let’s load all data from noon onward, and drop all data from before noon, and see if we can query the data using the endpoint for deep storage.

Here is the first set of retention rules:

[
  {
    "interval": "2016-06-27T12:00:00.000Z/2020-01-01T00:00:00.000Z",
    "tieredReplicants": {
      "_default_tier": 2
    },
    "useDefaultTierForNull": true,
    "type": "loadByInterval"
  },
  {
    "type": "dropForever"
  }
]

If you run a standard query in the console, you see that the rules have been applied:

Using curl, I am sending the same query to the endpoint for deep storage query:

curl -L -H 'Content-Type: application/json' localhost:8888/druid/v2/sql/statements -d'{
    "query": "SELECT DATE_TRUNC('\''hour'\'', __time), COUNT(*) FROM \"wikipedia\" GROUP BY 1 ORDER BY 1",
    "context":{
        "executionMode":"ASYNC"
    }
}'
{"queryId":"query-db8b79ae-f28b-466e-b876-3f987d0e87fc","state":"ACCEPTED","createdAt":"2023-09-06T11:33:39.839Z","schema":[{"name":"EXPR$0","type":"TIMESTAMP","nativeType":"LONG"},{"name":"EXPR$1","type":"BIGINT","nativeType":"LONG"}],"durationMs":-1}

This is an asynchronous endpoint - it returns immediately and hands me back a query ID. I have to append the query ID to the URL in order to poll the status and eventually get the result:

curl -L -H 'Content-Type: application/json' localhost:8888/druid/v2/sql/statements/query-db8b79ae-f28b-466e-b876-3f987d0e87fc
{"queryId":"query-db8b79ae-f28b-466e-b876-3f987d0e87fc","state":"SUCCESS","createdAt":"2023-09-06T11:33:39.839Z","schema":[{"name":"EXPR$0","type":"TIMESTAMP","nativeType":"LONG"},{"name":"EXPR$1","type":"BIGINT","nativeType":"LONG"}],"durationMs":13944,"result":{"numTotalRows":10,"totalSizeInBytes":374,"dataSource":"__query_select","sampleRecords":[[1467028800000,1219],[1467032400000,1211],[1467036000000,1353],[1467039600000,1422],[1467043200000,1442],[1467046800000,1339],[1467050400000,1321],[1467054000000,1175],[1467057600000,1213],[1467061200000,603]],"pages":[{"id":0,"numRows":10,"sizeInBytes":374}]}}

Oops. We got the same ten rows as from the interactive query. The naïve approach of just dropping the segments didn’t work. Or rather, it worked as intended.

Doing it right

Druid actually distinguishes whether a segment is unavailable (and exists in deep storage only) or whether it is available but not preloaded, which is a new thing in Druid 27. The latter case is expressed by configuring a load rule for that segment, but with a replication factor of 0.

Also worth noting is that at least one segment for the datasource in question has to be preloaded, or else Druid won’t be able to query it at all.

So instead of dropping the segments, let’s load them with a replication factor of 0:

[
  {
    "interval": "2016-06-27T12:00:00.000Z/2020-01-01T00:00:00.000Z",
    "tieredReplicants": {
      "_default_tier": 2
    },
    "useDefaultTierForNull": true,
    "type": "loadByInterval"
  },
  {
    "interval": "2010-01-01T00:00:00.000Z/2016-06-27T12:00:00.000Z",
    "tieredReplicants": {},
    "useDefaultTierForNull": false,
    "type": "loadByInterval"
  }
]

This is how the rules look like in the console view:

Use the Mark as used all segments function to force the Coordinator to reapply the retention rules:

This forces the morning segments to be available for asynchronous query only. You will see this reflected in the Datasources view like this:

Then run the same query again:

curl -L -H 'Content-Type: application/json' localhost:8888/druid/v2/sql/statements -d'{                                      
 "query": "SELECT DATE_TRUNC('\''hour'\'', __time), COUNT(*) FROM \"wikipedia\" GROUP BY 1 ORDER BY 1",
 "context":{
        "executionMode":"ASYNC"
    }
}'
{"queryId":"query-7f972571-b26e-4206-a7a8-61503d386d4b","state":"ACCEPTED","createdAt":"2023-09-06T11:38:57.369Z","schema":[{"name":"EXPR$0","type":"TIMESTAMP","nativeType":"LONG"},{"name":"EXPR$1","type":"BIGINT","nativeType":"LONG"}],"durationMs":-1}

This time, the result has 22 rows:

curl -L -H 'Content-Type: application/json' localhost:8888/druid/v2/sql/statements/query-7f972571-b26e-4206-a7a8-61503d386d4b
{"queryId":"query-7f972571-b26e-4206-a7a8-61503d386d4b","state":"SUCCESS","createdAt":"2023-09-06T11:38:57.369Z","schema":[{"name":"EXPR$0","type":"TIMESTAMP","nativeType":"LONG"},{"name":"EXPR$1","type":"BIGINT","nativeType":"LONG"}],"durationMs":14294,"result":{"numTotalRows":22,"totalSizeInBytes":782,"dataSource":"__query_select","sampleRecords":[[1466985600000,876],[1466989200000,870],[1466992800000,960],[1466996400000,1025],[1467000000000,936],[1467003600000,836],[1467007200000,969],[1467010800000,1135],[1467014400000,1141],[1467018000000,1137],[1467021600000,1135],[1467025200000,1115],[1467028800000,1219],[1467032400000,1211],[1467036000000,1353],[1467039600000,1422],[1467043200000,1442],[1467046800000,1339],[1467050400000,1321],[1467054000000,1175],[1467057600000,1213],[1467061200000,603]],"pages":[{"id":0,"numRows":22,"sizeInBytes":782}]}}

We have successfully queried data that partially exists in deep storage only!

Learnings

Deep storage query is a great new feature that helps organizations to run Druid in a cost effective way, retaining the ability to query large amounts of historical data.
There is a new API endpoint for queries that include segments from deep storage. These queries run asynchronously.
You have to configure a load rule with a replication factor of 0 in order to make segments available for deep storage queries.
At least one segment of a datasource needs to be preloaded on the historical servers in order to run deep storage queries.

Using Druid with MinIO

2023-08-29T00:00:00+02:00

With on premise setups, compute/storage separation is often implemented using a NAS or similar storage unit that exposes an S3 API endpoint.

I want to emulate S3 related behavior in a self contained demo that I can run on my laptop without an internet connection. This is conveniently done using MinIO as my S3 compatible storage.

Let’s deploy MinIO using this docker compose file:

version: "3"

services:
  minio:
    image: minio/minio
    container_name: minio
    environment:
      - MINIO_ROOT_USER=admin
      - MINIO_ROOT_PASSWORD=password
      - MINIO_DOMAIN=minio
    networks:
      minio_net:
        aliases:
          - druid.minio
    ports:
      - 9001:9001
      - 9000:9000
    command: ["server", "/data", "--console-address", ":9001"]
  mc:
    depends_on:
      - minio
    image: minio/mc
    container_name: mc
    networks:
      minio_net:
    environment:
      - AWS_ACCESS_KEY_ID=admin
      - AWS_SECRET_ACCESS_KEY=password
      - AWS_REGION=us-east-1
    entrypoint: >
      /bin/sh -c "
      until (/usr/bin/mc config host add minio http://minio:9000 admin password) do echo '...waiting...' && sleep 1; done;
      /usr/bin/mc rm -r --force minio/indata;
      /usr/bin/mc mb minio/indata;
      /usr/bin/mc policy set public minio/indata;
      /usr/bin/mc rm -r --force minio/deepstorage;
      /usr/bin/mc mb minio/deepstorage;
      /usr/bin/mc policy set public minio/deepstorage;
      tail -f /dev/null
      "
networks:
  minio_net:

Save this file as docker-compose.yaml to your work directory and run the command

docker compose up -d

This gives us a MinIO instance and the mc client. It will also automatically create two buckets in MinIO, named indata and deepstorage, that we will need for this tutorial. If you point your browser to localhost:9000, you can verify that the buckets have been created:

(Kudos to Tabular from whose GitHub repository I adapted the docker compose file.)

Configuring MinIO as deep storage and log target

I am using the standard Druid 27.0 quickstart. If you want to start Druid using the new start-druid script, you find the relevant configuration settings in conf/druid/auto/_common/common.runtime.properties under your Druid installation directory.

First of all, we need to load the S3 extension by adding it to the load list - it should look similar to this:

druid.extensions.loadList=["druid-s3-extensions", "druid-hdfs-storage", "druid-kafka-indexing-service", "druid-datasketches", "druid-multi-stage-query"]

Also configure the S3 default settings (endpoint, authentication):

druid.s3.accessKey=admin
druid.s3.secretKey=password
druid.s3.protocol=http
druid.s3.enablePathStyleAccess=true
druid.s3.endpoint.signingRegion=us-east-1
druid.s3.endpoint.url=http://localhost:9000/

For using MinIO as deep storage, comment out the default settings for druid.storage.*, and insert this section instead:

druid.storage.type=s3
druid.storage.bucket=deepstorage
druid.storage.baseKey=segments

Likewise, change the default configuration for the indexer logs to:

druid.indexer.logs.type=s3
druid.indexer.logs.s3Bucket=deepstorage
druid.indexer.logs.s3Prefix=indexing-logs

Then start Druid like this:

bin/start-druid -m5g

Ingesting data from MinIO

By default, Druid uses the same settings in common.runtime.properties for ingestion from S3, too. So for instance, you can upload the wikipedia data sample to the indata bucket in your MinIO instance and we take advantage of the same settings as for deep storage. Just use s3://indata/ as the S3 prefix in the ingestion wizard, and it should work out of the box.

Here is my example JSON ingestion spec:

{
  "type": "index_parallel",
  "spec": {
    "ioConfig": {
      "type": "index_parallel",
      "inputSource": {
        "type": "s3",
        "prefixes": [
          "s3://indata/"
        ]
      },
      "inputFormat": {
        "type": "json"
      }
    },
    "tuningConfig": {
      "type": "index_parallel",
      "partitionsSpec": {
        "type": "dynamic"
      }
    },
    "dataSchema": {
      "dataSource": "wikipedia_s3_2",
      "timestampSpec": {
        "column": "time",
        "format": "iso"
      },
      "granularitySpec": {
        "queryGranularity": "none",
        "rollup": false,
        "segmentGranularity": "day"
      },
      "dimensionsSpec": {
        "dimensions": [
          "channel",
          "cityName",
          "comment",
          "countryIsoCode",
          "countryName",
          "isAnonymous",
          "isMinor",
          "isNew",
          "isRobot",
          "isUnpatrolled",
          "metroCode",
          "namespace",
          "page",
          "regionIsoCode",
          "regionName",
          "user",
          {
            "type": "long",
            "name": "delta"
          },
          {
            "type": "long",
            "name": "added"
          },
          {
            "type": "long",
            "name": "deleted"
          }
        ]
      }
    }
  }
}

Or in SQL (using the automatic conversion function):

REPLACE INTO "wikipedia_s3_2" OVERWRITE ALL
WITH "source" AS (SELECT * FROM TABLE(
  EXTERN(
    '{"type":"s3","prefixes":["s3://indata/"]}',
    '{"type":"json"}'
  )
) EXTEND ("time" VARCHAR, "channel" VARCHAR, "cityName" VARCHAR, "comment" VARCHAR, "countryIsoCode" VARCHAR, "countryName" VARCHAR, "isAnonymous" VARCHAR, "isMinor" VARCHAR, "isNew" VARCHAR, "isRobot" VARCHAR, "isUnpatrolled" VARCHAR, "metroCode" VARCHAR, "namespace" VARCHAR, "page" VARCHAR, "regionIsoCode" VARCHAR, "regionName" VARCHAR, "user" VARCHAR, "delta" BIGINT, "added" BIGINT, "deleted" BIGINT))
SELECT
  TIME_PARSE("time") AS "__time",
  "channel",
  "cityName",
  "comment",
  "countryIsoCode",
  "countryName",
  "isAnonymous",
  "isMinor",
  "isNew",
  "isRobot",
  "isUnpatrolled",
  "metroCode",
  "namespace",
  "page",
  "regionIsoCode",
  "regionName",
  "user",
  "delta",
  "added",
  "deleted"
FROM "source"
PARTITIONED BY DAY

In either case, you can easily verify that both the segment files and the indexer logs end up in MinIO.

Changing the endpoint settings in the ingestion command

Now let’s go back to local deep storage, so that we cannot take advantage of endpoint settings that are baked into the service properties file. Hence we need to establish those settings right in the ingestion spec.

Restore the common properties to their default values and restart Druid. (You still need the S3 extension loaded.)

JSON version

Start the wizard as for a standard S3 ingestion. Then switch to the JSON view and edit the S3 settings in the ingestion spec:

      "inputSource": {
        "type": "s3",
        "prefixes": [
          "s3://indata/"
        ],
        "properties": {
          "accessKeyId": {
            "type": "default",
            "password": "admin"
          },
          "secretAccessKey": {
            "type": "default",
            "password": "password"
          }
        },
        "endpointConfig": {
          "url": "http://localhost:9000",
          "signingRegion": "us-east-1"
        },
        "clientConfig": {
          "disableChunkedEncoding": true,
          "enablePathStyleAccess": true,
          "forceGlobalBucketAccessEnabled": false
        }
      }

Note: In this case, because we are using plain HTTP, we need to include the http:// in the endpoint URL. If we put it in the clientConfig.protocol, as you might think from the sample in the documentation, it is not recognized.

SQL version

In the SQL version, we copy the same settings into the EXTERN statement, like so:

REPLACE INTO "wikipedia_s3_2" OVERWRITE ALL
WITH "source" AS (SELECT * FROM TABLE(
  EXTERN(
    '{ "type": "s3", "prefixes": [ "s3://indata/" ], "properties": { "accessKeyId": { "type": "default", "password": "admin" }, "secretAccessKey": { "type": "default", "password": "password" } }, "endpointConfig": { "url": "http://localhost:9000", "signingRegion": "us-east-1" }, "clientConfig": { "disableChunkedEncoding": true, "enablePathStyleAccess": true, "forceGlobalBucketAccessEnabled": false } }',
    '{"type":"json"}'
  )
) EXTEND ("time" VARCHAR, "channel" VARCHAR, "cityName" VARCHAR, "comment" VARCHAR, "countryIsoCode" VARCHAR, "countryName" VARCHAR, "isAnonymous" VARCHAR, "isMinor" VARCHAR, "isNew" VARCHAR, "isRobot" VARCHAR, "isUnpatrolled" VARCHAR, "metroCode" VARCHAR, "namespace" VARCHAR, "page" VARCHAR, "regionIsoCode" VARCHAR, "regionName" VARCHAR, "user" VARCHAR, "delta" BIGINT, "added" BIGINT, "deleted" BIGINT))
SELECT
  TIME_PARSE("time") AS "__time",
  "channel",
  "cityName",
  "comment",
  "countryIsoCode",
  "countryName",
  "isAnonymous",
  "isMinor",
  "isNew",
  "isRobot",
  "isUnpatrolled",
  "metroCode",
  "namespace",
  "page",
  "regionIsoCode",
  "regionName",
  "user",
  "delta",
  "added",
  "deleted"
FROM "source"
PARTITIONED BY DAY

Conclusion

You can use MinIO or another S3 compatible storage with Druid. You configure the endpoint, protocol, and authentication settings in the common properties file.
If you need to ingest from a different MinIO instance, or you want to use MinIO for ingestion only, you can set or override the S3 settings in the ingestion spec. This works both in JSON and SQL mode.
Either way, make sure you have the S3 extension loaded.

Druid Sneak Peek: Graphical Data Exploration

2023-07-30T00:00:00+02:00

Druid’s unified console is mostly directed at data management. Among other things, you can control your ingestion tasks, manage segments and their compaction settings, monitor services, and there is also a query manager GUI that understands both SQL and Druid native queries.

For data visualization, up until now you had to use external tools such as Superset or Tableau, or Imply’s own Pivot that comes bundled with the commercial distribution of the software.

But this is going to change. Druid 28 is going to add an exploration GUI that allows visual analysis of data!

This is a sneak peek into Druid 28 functionality. In order to use the new functions, you can (as of the time of writing) build Druid from the HEAD of the master branch:

git clone https://github.com/apache/druid.git
cd druid
mvn clean install -Pdist -DskipTests

Then follow the instructions to locate and install the tarball.

For this post, I ingested the Wikipedia sample data, as described in the quickstart tutorial. You are of course encouraged to try out different data sets with the new explorer.

How to access the Explorer view

To access the data explorer, go to the three dots ... right next to the Services tab, open the menu and click Explore:

You will be greeted with a canvas in the middle, and surrounding GUI controls:

In the top left field you select the datasource (table) that you wish to explore.
As soon as a datasource is selected, the left panel shows a list of all fields as they occur in the datasource. This does not care whether the fields are dimensions or metrics.
In the top bar you can set filters. Time filters come with an option of relative or absolute times. For character values, there is a regular expression filters as well as the ability to pick literal values.
In the right panel you choose one of the supported visualization types. Depending on your selection, different configuration options appear below. There’s also a ... button, behind which you can find the query history list. This is handy if you want to know which SQL queries are generated by the Explorer.

Let’s go through the list of visualization types.

Time chart

The Time chart visualizes the development of a metric over time. This is an area chart, or optionally (if you select a dimension to stack by) a stacked area chart.

It is possible to limit the number of items to be displayed in the stacked dimension.

This visualization allows selecting as metrics:

total count
unique count of any column
minimum and maximum of timestamp
for numeric columns, moreover, the standard aggregators sum, min, max, and 98 percentile.

This mechanism of selecting metrics is the same for all other visualizations, too.

Bar chart

The bar chart displays one bar column (dimension) and one metric, It is possible to sort by a metric other than the one displayed.

Table

The table chart has the most flexibility in selecting and arranging table fields. These are the options:

Group by: These are your regular BI dimensions, things to aggregate by. While all discrete dimensions just create one row by per value, __time has builtin intelligence when you select this, you can select the bucketing (granularity). You can select multiple dimension columns.
Show: can show a column without aggregating by that column. You could view this as interpreting a dimension as a metric where you pick either the latest value or the number of values. You can add multiple columns here, too.
Pivot: This displays a dimension across instead of down. The query mechanism is a bit different: it currently uses filtered metrics with one expression per dimension value.
Aggregates: These are the metrics, the selection is the same as for the time chart. But you can have multiple metrics.

Compares: compare by time interval. You can include multiple comparisons. But Compare and Pivot are for now mutually exclusive.

You can sort by any column if you click on the column header.

Pie chart

This displays one metric, broken down by one dimension. You can specify the number of named slices, the rest goes into Other.

Multi-axis chart

This is a variety of the time chart, but with many metrics. They are drawn as line charts, overlayed and each to its own scale. The first metric’s axis is displayed to the left, all others are displayed to the right of the chart.

Conclusion

In this post, I have shown a glimpse of the upcoming data exploration GUI that is built right into Druid. While this is currently not a replacement for a full BI suite, it is a valuable tool for the data engineer to get a better idea of how the data looks like. This can assist in understanding the distribution of the data and optimizing the data model inside Druid. It’s also valuable when an analysts asks the data team why a particular chart looks the way it does.

Note that the data explorer is not part of any official release (yet), and that it is likely going to change and evolve a lot. Feel free to experiment!

Merging Realtime Segments in Apache Druid

2023-07-25T00:00:00+02:00

So, you want your realtime analytical queries to be really fast, and that’s why you selected Apache Druid! Today, let’s have a look at another aspect of how Druid achieves its amazing performance.

Data Layout and Druid Performance

Druid’s query performance can be influenced by multiple factors in the data layout:

Segment size. The optimum size of a data segment is about 500 MB. If segments are much bigger than that, those segments need more resources for querying and also parallelism suffers. More often you encounter the opposite problem: there are too many small segments, which slows down query performance.
Partitioning and sorting of data. Partitioning gives an extra performance boost when you can partition the segments according to the expected query pattern; also, inside a segment data is sorted by time, but then by partitioning key; this further speeds up segment scans by increasing compression ratio. For this to work you need to enable range partitioning.
Rollup. This reduces both storage and query needs by pre-aggregating data. Ideally you want to have perfect rollup so that each unique combination of dimension values corresponds to exactly one aggregate row. For this to work, again one has to use range or hash partitioning. In fact, with range or hash partitioning only, rollup is always perfect; with dynamic partitioning, rollup is best effort - the resulting table may be multiple times bigger than the optimum.

Let’s find out how Druid optimizes these factors for streaming data - without any external processes!

The Problem with Streaming Data

In batch processing, all the above factors can be addressed easily. Streaming data, however, usually does not come in neatly ordered. The point in streaming ingestion is to have these data available for analytics within a split second after an event occurs: and so, segments are built up in memory and persisted frequently. As a result, after a hand-off (the process of persisting a segment to deep storage) by streaming ingestion, segments are not optimal:

Segments will usually be fragmented and smaller than optimum because we cannot wait long to initiate a handoff. in addition, we may have to juggle multiple time chunks simultaneously because of late arriving data
Range partitioning requires multiple steps of mapping and shuffling and merging. This is not possible to do during streaming ingestion so the only allowed partitioning scheme is Dynamic.
Because data can always be added incrementally, rollup is best effort.

Managing the Lifecycle

This is where many databases would add an external maintenance process that reorganizes data. It is the beauty of Druid that it handles this reorganization largely automatically by a process called autocompaction. Here are a few notes in passing about autocompaction and its capabilities.

I discussed autocompaction briefly in my blog about data lifecycle management in Druid. It is a data compaction process that:

is done automatically by the Coordinator in the background
has a simple configuration, either through the Druid API or through the Unified Console GUI
it is basically a reindexing job - it takes all the segments for a given time chunk and re-ingests them into the same datasource, creating a new version.

Autocompaction can:

make sure segments have a size close to the target value;
set/modify the partitioning scheme;
modify rollup settings;
modify segment granularity;
modify query granularity.

It also has a setting to leave the newest data alone so as not to interfere with the ongoing ingestion.

Set partitioning scheme

Because streaming ingestion always produces dynamic partitions, you have to use autocompaction to organize your data in a better scheme. While either hash or range partitioning both achieve perfect rollup, range partitioning is recommended for most cases - particularly if you know typical query patterns in advance.

Modify rollup settings

You can go from a detail to a rollup table using autocompaction. There are some caveats though: this approach makes sense mostly if you are using the same aggregation functions in your queries and in rollup.

Modify segment granularity

Segment granularity defines the time period for each time chunk. If your data volume is low enough to have only segment per time chunk, you might consider increasing segment granularity: if there is only one segment per time chunk, secondary partitioning will do essentially nothing, so you need to make the time chunks bigger in order to to force secondary partitioning into effect.

Make sure segment granularities roll up into each other neatly (for instance, don’t do week to month), or else you are in for some surprises.

Modify query granularity

Query granularity defines the aggregation level inside a segment. The primary timestamp will be truncated to the precision defined by the query granularity, and data is aggregated at that level.

You can define additional aggregation during autocompaction by making the query granularity coarser. This is a data lifecycle operation and some organizations use it to retain detail data up to a certain period, and aggregates for older data. When configuring a segment merge autocompaction, you would not usually do this.

Configure grace period for recent data

Druid will soon have the ability to run ingestion and autocompaction over the same time chunk simultaneously. For now, there’s a setting skipOffsetFromLatest, which is by default set to P1D (one day). Its effect is that data younger than that period are left alone by autocompaction, because we anticipate more data to be ingested for that period. Increase this setting if you expect a lot of late arriving data.

This is an ISO 8601 time period.

Configuring it in the wizard

Autocompaction can be configured using the unified console wizard or the API.

In the console, autocompaction settings can be accessed from the Datasources tab. Clicking the compaction settings for a datasource opens a dialog for the basic settings like partitioning and recent data grace period:

For configuring rollup and granularity settings, you have to enter JSON mode and follow the reference in the documentation.

Outlook

Autocompaction has been with Druid since version 0.13, but it has seen a lot of improvement recently. Some notable changes that will (likely) be released in the near future:

The algorithm that selects segments for compaction is being tuned to grab segments faster and to use free system resources more efficiently, resulting in a considerable speedup.
Fully concurrent ingestion and autocompaction - so data layout will be optimized on the fly!
A lot more options are available to fine tune autocompaction: refer to the documentation for more detail!

Conclusion

This article gave only a glimpse into the capabilities of Druid autocompaction. What we learned:

Autocompaction is a process that merges and optimizes (among others) realtime ingested segments.
Autocompaction runs automatically in the background. It requires no extra program invocation or scheduler setup.
In addition to merging segments, autocompaction can also perform more advanced data lifecycle management tasks with minimal configuration.

Analyzing GitHub Stars with Imply Polaris

2023-07-12T00:00:00+02:00

Why all this?

A while ago, Will asked if we could measure community engagement in the Apache Druid community by analyzing the number of GitHub stars that the Druid source repository got over time. He wanted to compare that development with other repositories within the realtime analytics ecosystem, and possibly identify segments of GitHub users that had starred multiple repositories out of the list we are looking at.

This blog is not about the results of that endeavor. Instead, I am going to look at an interesting data/query modeling problem I encountered on the way.

The dataset

Let’s get the stargazers for various repos that are either competitive or complementary with druid. This includes

other realtime analytics datastores
streaming platforms
stream processors
frontend (business intelligence) tools.

For each stargazer record, we store

the user
the repository
date and time when it was starred; this will be the primary timestamp for the Druid data model.

How to get the data

The data we are going to analyze comes from the GitHub stargazers API. Vijay has written a great blog about this; I am using a simpler approach with a Python script that runs once and tries to retrieve all the data.

This probably warrants another blog about the quirks of the GitHub API, so for now let a few remarks suffice.

Surprise: Elon Musk did not invent API rate limiting! Our first idea was to get all the repositories that Druid stargazers also starred. This approach is not viable.
There are primary (hard) and secondary rate limits. Either way, if you hit a limit, GitHub throws a 403 error at you. The required action depends on the type of rate limit that was applied, and this needs to be parsed from response headers.
The API imposes pagination with a maximum page size of 100 records.
The maximum page index you can retrieve is 399.
As a consequence, you will not get more than 40,000 stars for any one repository, which will soon become important.

You can find the code that I used, as well as all the SQL samples from this post, in my GitHub repository.

Loading the data into Polaris

While the basic SQL analysis works just as well with open source Druid, I am using Imply Polaris because of its ease of use and built in visualization. Ingesting file data into Polaris is a streamlined process that is well described in the quickstart guide - follow the instructions there.

Here are some sample records from my script:

{"starred_at": "2012-10-23T19:08:07Z", "user": {"login": "bennettandrews", "id": 1143, "node_id": "MDQ6VXNlcjExNDM=", "avatar_url": "https://avatars.githubusercontent.com/u/1143?v=4", "gravatar_id": "", "url": "https://api.github.com/users/bennettandrews", "html_url": "https://github.com/bennettandrews", "followers_url": "https://api.github.com/users/bennettandrews/followers", "following_url": "https://api.github.com/users/bennettandrews/following{/other_user}", "gists_url": "https://api.github.com/users/bennettandrews/gists{/gist_id}", "starred_url": "https://api.github.com/users/bennettandrews/starred{/owner}{/repo}", "subscriptions_url": "https://api.github.com/users/bennettandrews/subscriptions", "organizations_url": "https://api.github.com/users/bennettandrews/orgs", "repos_url": "https://api.github.com/users/bennettandrews/repos", "events_url": "https://api.github.com/users/bennettandrews/events{/privacy}", "received_events_url": "https://api.github.com/users/bennettandrews/received_events", "type": "User", "site_admin": false}, "starred_repo": "apache/druid"}
{"starred_at": "2012-10-23T19:08:07Z", "user": {"login": "xwmx", "id": 1246, "node_id": "MDQ6VXNlcjEyNDY=", "avatar_url": "https://avatars.githubusercontent.com/u/1246?v=4", "gravatar_id": "", "url": "https://api.github.com/users/xwmx", "html_url": "https://github.com/xwmx", "followers_url": "https://api.github.com/users/xwmx/followers", "following_url": "https://api.github.com/users/xwmx/following{/other_user}", "gists_url": "https://api.github.com/users/xwmx/gists{/gist_id}", "starred_url": "https://api.github.com/users/xwmx/starred{/owner}{/repo}", "subscriptions_url": "https://api.github.com/users/xwmx/subscriptions", "organizations_url": "https://api.github.com/users/xwmx/orgs", "repos_url": "https://api.github.com/users/xwmx/repos", "events_url": "https://api.github.com/users/xwmx/events{/privacy}", "received_events_url": "https://api.github.com/users/xwmx/received_events", "type": "User", "site_admin": false}, "starred_repo": "apache/druid"}

Upload the output file to Polaris and ingest only the starred_at, user["login"], user["id"], and starred_repo columns. (You will need to use JSON_VALUE to extract the nested fields.)

Create a data cube with default settings. By default, you will get an event count measure, but you can add your own filtered or computed measures if you want.

Naïve visualization

This first data model shows only the new stars for every point in time. This looks a bit confusing, but there is one interesting fact to be gleaned already:

The new star data for the superset repository is gone after a certain date! Why is that?

Remember, we can only retrieve 40,000 stargazer records per repository. But Superset has more than 52,000 stars, so we cannot get them all.

This is a starting point, but what Will really wanted to see is the growth of stars over time. Something you would address using a window function and a BETWEEN CURRENT AND UNBOUND PRECEDING clause. But since window functions in Druid are not quite production ready yet, we have to use a different syntax to model these queries.

Let’s do this with monthly resolution so we can track the month over month growth curve for each repository.

First attempt at cumulative sums: self join

Last year, I wrote about emulating window functions in Druid SQL, and one of the techniques I used was to join a table with itself. Conveniently, we roll up by month before joining the data, so as to keep the intermediate result sets small. Since we are repeating the same query, let’s formulate it as a common table expression.

WITH cte AS (
  SELECT DATE_TRUNC('MONTH', "__time") AS date_month, starred_repo, COUNT(*) AS count_monthly
  FROM "stargazers-ecosystem"
  GROUP BY 1, 2
)
SELECT
  cte.date_month,
  cte.starred_repo,
  SUM(t2.count_monthly) AS sum_cume
FROM cte INNER JOIN cte t2 ON cte.starred_repo = t2.starred_repo
WHERE t2.date_month <= cte.date_month
GROUP BY 1, 2
ORDER BY 1, 2

The interesting measure in this data model is sum_cume: the sum of stars from all past up to the reference date, per repository. Let’s visualize this in Polaris over a time period of 10 years!

This is almost good, but did you notice how the superset line drops to zero? Why is that?

Well, you remember the 40k stars limit? Because we don’t get new entries after a certain date, the join has nothing to join against.

We have been hit by a well known problem in data modeling, factless facts. Generally, this problem of “holes” in the data is addressed by creating a canvas table that manages to get us a data point for each possible combination of dimension values, not only those that we have fact data for.

So let’s build up a calendar dimension instead, shall we

The straightforward approach to this task is to create a calendar dimension. Fortunately, since Druid 26, we have the ability to generate an array of equally spaced points in time (with DATE_EXPAND), and to transform such an array into a set of single value rows (with UNNEST). This is not quite a fully featured sequence generator, but it should work for our case.

Note that for all the sample queries you will need to set a query context flag to enable UNNEST:

{
  "enableUnnest": true
}

Let’s try to fill out the time dimension with one record per month, from the minimum to maximum timestamp that is in the data:

SELECT t.dateByWeek 
FROM (
  SELECT
    TIMESTAMP_TO_MILLIS(TIME_FLOOR(MIN(__time), 'P1M')) AS minDate, 
    TIMESTAMP_TO_MILLIS(TIME_CEIL(MAX(__time), 'P1M')) AS maxDate
  FROM
    "stargazers-ecosystem"
  ),
  UNNEST(DATE_EXPAND(minDate, maxDate, 'P1M')) AS t(dateByWeek)

Unfortunately, the query fails. But it indicates clearly why:

Error: Unsupported operation
Cannot convert to Duration as this period contains months and months vary in length

So instead, let’s use the next largest interval that works with DATE_EXPAND, which is week - a week is always the same length -, then truncate to months, and deduplicate the values:

SELECT DISTINCT TIME_FLOOR(t.dateByWeek, 'P1M') 
FROM (
  SELECT
    TIMESTAMP_TO_MILLIS(TIME_FLOOR(MIN(__time), 'P1M')) AS minDate, 
    TIMESTAMP_TO_MILLIS(TIME_CEIL(MAX(__time), 'P1M')) AS maxDate
  FROM
    "stargazers-ecosystem"
  ),
  UNNEST(DATE_EXPAND(minDate, maxDate, 'P1W')) AS t(dateByWeek)

This works!

Join up against the fact data

Let’s try to join the calendar dimension against the fact data. We know already that we can’t have a “less than or equal” condition in the JOIN clause. So let’s try and write a Cartesian join with a WHERE clause that does the time windowing:

WITH 
  cte_calendar AS (
  SELECT DISTINCT TIME_FLOOR(t.dateByWeek, 'P1M') AS date_month
  FROM (
    SELECT
      TIMESTAMP_TO_MILLIS(TIME_FLOOR(MIN(__time), 'P1M')) AS minDate, 
      TIMESTAMP_TO_MILLIS(TIME_CEIL(MAX(__time), 'P1M')) AS maxDate
    FROM
      "stargazers-ecosystem"
    ),
    UNNEST(DATE_EXPAND(minDate, maxDate, 'P1W')) AS t(dateByWeek)
  ),
  cte_stars AS (
  SELECT 
    DATE_TRUNC('MONTH', "__time") AS date_month, 
    starred_repo, 
    COUNT(*) AS count_monthly
  FROM "stargazers-ecosystem"
  GROUP BY 1, 2
)
SELECT
  cte_calendar.date_month,
  cte_stars.starred_repo,
  SUM(cte_stars.count_monthly) AS sum_cume
FROM cte_calendar, cte_stars
WHERE cte_stars.date_month <= cte_calendar.date_month
GROUP BY 1, 2
ORDER BY 1, 2

Alas, this fails too - Druid’s query planner views this still as a JOIN with a non-equals condition, and refuses to plan it:

SQL requires a join with 'LESS_THAN_OR_EQUAL' condition that is not supported.

The message is clear, we need an equals join. Let’s do a workaround by adding starred_repo to the calendar canvas as well, so as to use it as a join key. So the canvas definition becomes a cross join between the monthly calendar we created above, and the list of all unique repositories:

  SELECT 
    TIME_FLOOR(t.dateByWeek, 'P1M') AS date_month,
    starred_repo
  FROM (
    SELECT
      TIMESTAMP_TO_MILLIS(TIME_FLOOR(MIN(__time), 'P1M')) AS minDate, 
      TIMESTAMP_TO_MILLIS(TIME_CEIL(MAX(__time), 'P1M')) AS maxDate
    FROM
      "stargazers-ecosystem"
    ),
    UNNEST(DATE_EXPAND(minDate, maxDate, 'P1W')) AS t(dateByWeek),
    ( SELECT DISTINCT starred_repo FROM "stargazers-ecosystem" )
  GROUP BY 1, 2

Then define this as a CTE, join the facts on starred_repo, and tuck the unbound preceding condition away into a filtered metric:

WITH 
  cte_calendar AS (
  SELECT 
    TIME_FLOOR(t.dateByWeek, 'P1M') AS date_month,
    starred_repo
  FROM (
    SELECT
      TIMESTAMP_TO_MILLIS(TIME_FLOOR(MIN(__time), 'P1M')) AS minDate, 
      TIMESTAMP_TO_MILLIS(TIME_CEIL(MAX(__time), 'P1M')) AS maxDate
    FROM
      "stargazers-ecosystem"
    ),
    UNNEST(DATE_EXPAND(minDate, maxDate, 'P1W')) AS t(dateByWeek),
    ( SELECT DISTINCT starred_repo FROM "stargazers-ecosystem" )
  GROUP BY 1, 2
  ),
  cte_stars AS (
  SELECT 
    DATE_TRUNC('MONTH', "__time") AS date_month, 
    starred_repo, 
    COUNT(*) AS count_monthly
  FROM "stargazers-ecosystem"
  GROUP BY 1, 2
)
SELECT
  cte_calendar.date_month,
  cte_stars.starred_repo,
  SUM(cte_stars.count_monthly) FILTER(WHERE cte_stars.date_month <= cte_calendar.date_month) AS sum_cume
FROM cte_calendar INNER JOIN cte_stars ON cte_calendar.starred_repo = cte_stars.starred_repo
GROUP BY 1, 2
ORDER BY 1, 2

Use this query to define a cube in the Polaris GUI, and see the result:

And, ceteris paribus, now the number of Superset stars maxes out at 40k but they don’t drop to zero!

Learnings

The self join approach to cumulative sums fails when there are “holes” in the data (aka factless facts).
The best approach to counter this is building an explicit calendar dimension.
DATE_EXPAND can be used to build a calendar canvas but has some limitations. We showed how to work around those.
Also, we learned how we can work around the JOIN limitation in Druid SQL by adding a synthetic join key to the calendar dimension and using a filtered metric.

“Ludwig_Richter-The_Star_Money-2-1862” (via Wikimedia Commons) is in the public domain in its country of origin and other countries and areas where the copyright term is the author’s life plus 100 years or fewer.

Indexes in Apache Druid

2023-06-28T00:00:00+02:00

If you come from a traditional database background, you are probably used to creating and maintaining indexes on most of your data. In a relational database, indexes can speed up queries but at a cost of slower data insertion.

In Druid, on the other hand, you never see a CREATE INDEX statement. Instead, Druid automatically indexes all data, creating optimized storage segments that provide high performance for all data types - and you never need to select or manage indexes. Let’s look at some of these data organization features!

Druid Bitmap Indexes

Druid uses bitmap indexes. These are created automatically on all string columns and on each subfield of a JSON column. Let’s look at this design choice in some more detail.

Types of indexes in a relational database

Relational databases use a B-tree index as their primary index type. A relational table often has a primary key that can be used to uniquely identify a row in the table. A B-tree index maps individual keys to the rows that contain them. Its use cases are:

enforcing uniqueness of a key during inserting
quickly looking up a single value for updates, inserts, and (sometimes) join queries.

A B-tree index is not a good choice for analytical queries where you have, as a rule, many rows with the same value, and you want to retrieve and aggregate data in bulk. It is also to be noted that due to the structure of a B-tree index, lookups are O(log n) complexity, which may be impractical for large tables.

Bitmap indexes - why?

Bitmap indexes came up as relational databases were enhanced with analytical features. A bitmap index stores, for each value, a bit array where the position of each row that has a 1 bit and all the other positions are 0. It can be thought of as an inverted index that maps not a row number to a value, but a value to a collection of rows where the value occurs.

This has a number of advantages:

Fast lookup of all rows for a value. Because the bitmap index is an array, such lookups are O(1).
Even better, bitmap indexes are mergeable in any combination. To model logical conditions such as the union or intersection of filters, just apply bitwise logical OR and AND operations to the bitmap.
Bitmaps are always segment local and thus fast to maintain. If your data is partitioned or sharded, the bitmap index is partitioned in the same way.

For high cardinality and sparse data, a forward index such as a B-tree may be faster but there are ways to get the best of both worlds. I’ll get to that in a moment.

Why doesn’t Druid use B-tree indexes as a general option? Unlike a bitmap index, a B-tree index has to be global to be fast. (A global index spans the whole table, disregarding any partitioning.) This makes insertion and index maintenance quite expensive.

How Druid implements the best of forward and inverted index: Druid roaring bitmaps

Let’s talk about sparse indexes for a moment. Contrary to a widespread belief, regular bitmaps are best for columns with medium cardinality. If the cardinality of a column is very low, the index is not very selective and you need to read a lot of data anyway. If the cardinality is very high, you have a different problem: Each value is only present in a small fraction of rows, so you would waste a lot of space storing zeroes for each value.

This is why Druid does not just implement bitmap indexes. Instead, bitmap indexes are by default compressed using Roaring bitmap indexes. The roaring bitmap algorithm cuts up the bitmap into pages of 2¹⁶ rows. If the page has very few 1 bits, it stores a list of row IDs instead.

Roaring bitmaps also support run-length encoding of pages, which is particularly effective when indexing a dimension that is also used to pre-sort the data - more about this later.

Bitmap indexes and multi-value dimensions

Multi-value dimensions go nicely with bitmap indexes. A multi-value field would just have a bit set for every value that occurs in the cell. That is another reason to prefer bitmap indexes.

Colocating Data: Partitioning and Clustering

In relational data modeling, the main abstraction is that you look at the table as a whole. There is no implicit ordering in the way the data is laid out. It has long been known that this is not the best model for analytical queries. That is why there are options in Druid that inform the physical layout of the data.

Time partitioning, granularities and sorting by time

All data in Druid is partitioned and sorted by time. Each row has a primary timestamp, and part of the data modeling process is to define a segment granularity and query granularity.

Segment granularity is defined by the PARTITIONED BY clause in SQL based ingestion and it translates directly into the time chunks that define the segment timeline. (Within each time chunk, there may be multiple segments.) Within a segment, data is sorted by primary timestamp. This creates the equivalent of a timeseries index.

Query granularity is defined by truncating the primary timestamp in the ingestion query. Druid uses query granularity to deliberately define the time resolution such that data can be rolled up efficiently. This can greatly improve query performance and storage use.

Special case: Multiple time granularities

If you want to achieve primary sorting by another column than time, you should set segment and query granularity to the same value. If you still need detailed timestamps, you can define the detailed time as a secondary timestamp. The main criteria for this design decision is if you expect to be running predominantly analytical queries that do not have timeseries characteristics, but you want to retain the ability to run some timeseries queries. The number of timestamp fields is in principle not limited.

Secondary partitioning: Pruning and range queries

Below the timestamp level, there is secondary partitioning, which is usually implemented as range partitioning. This defines a list of dimension fields to partition by. In SQL based ingestion, this corresponds to the CLUSTERED BY clause. You want to order your partitioning columns first in the ingestion query, too. Then your data will be sorted according to the partitioning columns, and like values will be grouped together physically. If you filter by the partitioning key in a query, Druid uses this information to determine which data segments to look at, even before scanning any data. This is called partition pruning and is a great way to speed up queries.

How Druid implements composite index functionality

With multi-dimension range partitioning, Druid achieves the same functionality as a composite index. In an RDBMS, you would use a composite index whenever you have a combination of columns that you use to filter or group by in most of the queries that you typically run.

That being said, because we use bitmap indexes on all columns, we also achieve composite index functionality by merging bitmap indexes across columns.

How Druid implements range index functionality

Another advantage of multi-dimension range partitioning is where you query for a range of values. Because the partitioning key also determines sort order, values within a range are grouped together. This achieves the functionality of a range index.

Be extra space efficient: Front coding

In addition to range sorting, Druid implements front coding for character data. All data is represented by a dictionary (which can be thought of as a forward index), and common prefixes are shared between entries. That way, we optimize space usage without sacrificing speed.

Structured Data: Nested Columns

For nested (JSON) columns, Druid creates a bitmap index for each nested field. With that, you get the functionality of a document (JSON) index. Again, Druid does the right thing automatically without requiring any explicit configuration.

Conclusion

In this article, I gave a quick tour of data organization and indexing features in Apache Druid. What have we learned?

You might be asking: where are the indexes? In Druid, indexes are created and maintained automatically. And a lot of index functionality is done with features that are not technically indexes, but achieve the same effect.
For analytical queries, bitmap indexes are the best choice for many scenarios. Druid creates bitmap indexes on all (string) columns by default.
Bitmap indexes allow merging and logical operations, and thus support arbitrary column combinations, superseding composite indexes.
Our implementation of Roaring bitmaps uses forward lookup for sparse columns: this optimizes both query speed and storage.
Time partitioning aids pruning in time based queries.
Time sorting is great for time series and time range queries.
Secondary partitioning replaces composite and range indexes.
Each field inside a nested column (document column) has its own bitmap index so JSON index functionality is achieved.

New in Druid 26: Data Provenance Tracking with Kafka Headers, Automatically

2023-06-27T00:00:00+02:00

I have previously written about processing and visualizing ADS-B flight radar data with Kafka and Druid. This time, let’s look at some new possibilities with ingesting those data in a bit more detail.

The story starts with a discussion within our DevRel team at Imply. Wouldn’t it be nice to have multiple flight radar receivers in different locations, and have them all produce data into the same Kafka topic (which lives in Confluent Cloud.) But then, one should also be able to add a unique client ID (and possibly other metadata) to each message. In short, we need data provenance tracking. This is indeed of practical use: in any serious enterprise use case, data lineage tracking is indisposable!

In Kafka, data lineage is tracked with message headers. These are basically key-value pairs that can be defined freely. Inside Kafka, the header values are coded as binary bytes - their meaning and encoding is governed by your data contract, something to keep in mind for later.

Druid has been able to ingest Kafka metadata for a while, and I have written about it before. But before version 26, you had to edit the ingestion spec manually to enable this feature. Now, it is supported by the Druid console, making things a lot easier. Let’s see how this works for our flight radar data!

In this tutorial, you will

generate Kafka messages with headers from flight radar data
ingest and model these data inside Druid
and show how these data can be queried just like any other table column using Druid SQL.

For the tutorial, use at least Druid version 26.0. The Druid quickstart works fine.

Generating the data

In my blog, I’ve previously described how you can use a Raspberry Pi with a DVB-T stick to receive flight radar data. Let’s modify the Kafka connector script to generate ourselves some data with Kafka headers. kcat comes with a -H option to inject arbitrary headers into a Kafka message.

Edit the following script, entering a unique client ID of your choice and your geographical coordinates. Then follow the instructions in the blog above to install the script as a service on your Raspberry Pi.

#!/bin/bash

CC_BOOTSTRAP="<confluent cloud bootstrap server>"
CC_APIKEY="<api key>"
CC_SECRET="<secret>"
CC_SECURE="-X security.protocol=SASL_SSL -X sasl.mechanism=PLAIN -X sasl.username=${CC_APIKEY} -X sasl.password=${CC_SECRET}"
CLIENT_ID="<client id>"
LON="0.0"
LAT="0.0"
TOPIC_NAME="adsb-raw"

nc localhost 30003 \
    | awk -F "," '{ print $5 "|" $0 }' \
    | kafkacat -P \
        -t ${TOPIC_NAME} \
        -b ${CC_BOOTSTRAP} \
        -H "ClientID=${CLIENT_ID}" \
        -H "ReceiverLon=${LON}" \
        -H "ReceiverLat=${LAT}" \
        -K "|" \
        ${CC_SECURE}

This adds a Kafka key (the aircraft hex ID), a unique ID for the radar receiver, and also the receiver coordinates, as Kafka headers.

Ingesting the data

In Druid, create a Kafka connection. In my lab, I am using Confluent Cloud so I have to encode the credentials in the consumer properties as described in another of my blog posts. (If you are using a local, unsecured Kafka service, it is sufficient to enter the bootstrap server and Kafka topic.)

Note how the preview looks different from previous Druid versions:

It now lists the Kafka metadata:

timestamp
key
headers

along with the payload.

In the Parse data wizard, enter the column headers for the flight data:

MT,TT,SID,AID,Hex,FID,DMG,TMG,DML,TML,CS,Alt,GS,Trk,Lat,Lng,VR,Sq,Alrt,Emer,SPI,Gnd

Also make sure to enable the switch for parsing Kafka metadata (it should be on by default):

If you scroll down the right window pane, you will find a number of new options about handling the metadata.

Here you specify how the key is parsed. (You could in theory have a structured key, because the key is parsed into an input format just like the payload. In practice, you will usually have a single string that can be parsed using a regular expression or a degenerate CSV parser.)

Moreover, this is where you define the prefixes to be used for the metadata in your final data model. And last but no least, you define how to decode the header values. In most cases, UTF-8 is a good choice, but it really depends on what your producer puts in at the other end.

The Kafka timestamp is automatically suggested as the primary Druid timestamp:

So, with a minimum configuration (as usual, you have to define your segment granularity and datasource name), you have your Kafka ingestion ready:

After submitting the spec, run a quick query to verify that indeed, the Kafka metadata has been parsed and ingested correctly:

And that is how easily Kafka metadata goes into Apache Druid!

Conclusion

Data lineage can be tracked with Kafka headers.
Starting with Druid 26, Kafka metadata (timestamp, key, headers) are supported by the unified console wizard.
With this, we can easily build a distributed flight data service using only one Kafka topic.

"Lufthansa Airbus A350 XWB D-AIXP arrives SFO L1060413" by wbaiv is licensed under CC BY-SA 2.0 .

Hellmar Becker’s Blog

Druid 31 Preview: Changing the Segment Sort Order

Motivation

Lab

Conclusion

Table Based Lookups in Imply Polaris

Recap: Lookups in Apache Druid

Lookups based on Druid Segments

How to do it in practice

Modeling the dimension table and the lookup

Modeling the fact table

How to query data using lookups

How to visualize lookup data in Pivot

Conclusion

Druid Data Cookbook: Deconstructing Nested JSON Objects

A data sample

Extracting the leaf paths

Some query examples

Conclusion

Druid Data Cookbook: Flattening Arrays of Complex Objects

Loading some data

Querying the data

Conclusion

Druid Data Cookbook: Parameterizing the IN clause

Recap: Parameters in the query API

Two new features in Druid

Now, let’s make it work

Conclusion

Druid Data Cookbook: About SQL NULL

Server configuration settings that affect NULL handling

useDefaultValueForNull

useStrictBooleans

useThreeValueLogicForNativeFilters

Ingestion

Comparing against single values

Comparing against multiple values

Learnings

New in Druid 29: Exporting Query Results

The problem

Preparation

Exporting data

Learnings

Druid 29 Preview: Transposing Data with PIVOT and UNPIVOT

Getting set up

Ingesting the data

PIVOT - transpose rows to columns

UNPIVOT - transpose columns to rows

UNPIVOT during ingestion

Conclusion

Druid 29 Preview: Handling Nested Arrays

Getting set up

The data

Basic ingestion: the pizza-orders table

First attempt at breaking down the line items

A new function: JSON_QUERY_ARRAY

Conclusion

Druid Data Cookbook: Upserts in Druid SQL

Recap: the data

Initial data ingestion

The merge query

Analysis of the query

Can we be more selective?

Learnings

Druid SQL: BETWEEN considered harmful

A harmless data sample

The solution

Learnings

Druid 28 Sneak Peek: Ingesting Multiple Kafka Topics into One Datasource

Building the distribution

Generating test data

Connecting Druid to the streams

Building the data model

Task management

Conclusion

New in Imply Polaris: Data Retention Policy

Conclusion

New in Apache Druid 27: Querying Deep Storage

Building the test data set

Recap: Retention rules

First attempt to configure deep storage query

`PIVOT` - transpose rows to columns

`UNPIVOT` - transpose columns to rows

`UNPIVOT` during ingestion

Basic ingestion: the `pizza-orders` table

A new function: `JSON_QUERY_ARRAY`