Multi-Value Dimensions in Apache Druid (Part 2)

This is part 2 of a miniseries about multi-value dimensions in Apache Druid. More posts can be found here:

Let’s continue our journey into the world of multi-value dimensions in Apache Druid - with this here:

Source snippet

Multi-Value Handling!

Remember the casual remark from last time about this here:

Payload

By default, a multi-value dimension is represented as a sorted array, but there are alternatives. Let’s look at them in a bit more detail. Using the data sample from the movie-tags tutorial, let’s create some copies of our tags dimension. This is conveniently done in the Transform stage of the ingestion wizard, just by defining new dimensions and referencing the original field as the expression:

Transform

Define two additional dimensions tags_unsorted and tags_set. Proceed through the Filter step to Configure schema.

It is possible that you do not see the new fields now. Flip the automatic schema detection switch off and on again

Auto-Schema

and you should see them all.

Continue like last time until the Edit spec stage. Find the "dimensionsSpec" part and replace it with this snippet:

      "dimensionsSpec": {
        "dimensions": [
          {
            "type": "string",
            "name": "movie",
            "multiValueHandling": "SORTED_ARRAY",
            "createBitmapIndex": true
          },
          {
            "type": "string",
            "name": "tags",
            "multiValueHandling": "SORTED_ARRAY",
            "createBitmapIndex": true
          },
          {
            "type": "string",
            "name": "tags_unsorted",
            "multiValueHandling": "ARRAY",
            "createBitmapIndex": true
          },
          {
            "type": "string",
            "name": "tags_set",
            "multiValueHandling": "SORTED_SET",
            "createBitmapIndex": true
          }
        ],
        "dimensionExclusions": [
          "__time",
          "timestamp"
        ]
      },

Submit the ingestion job and see what you get:

Query Result

This is great when you need to deduplicate tag values.

Learnings

By default, multi-value dimensions are represented as sorted arrays.
Alternatives are unsorted arrays and sets.
Sets are especially neat because they have deduplication built in.
As of now, you need to specify the handling manually in the JSON spec.
Documentation of this feature is somewhat terse, although it is going to improve soon.