Jekyll2024-03-01T17:52:34+01:00//Hellmar Becker’s BlogRandom thoughts and technical experimentsNew in Druid 29: Exporting Query Results2024-03-01T00:00:00+01:002024-03-01T00:00:00+01:00/2024/03/01/new-in-druid-29-exporting-query-results<p><img src="/assets/2021-12-21-elf.jpg" alt="Druid Cookbook" /></p>
<h2 id="the-problem">The problem</h2>
<p>Often my customers come to me with the requirement to extract large and/or detailed data sets from druid; they would like to store these data in a well known format for further processing by other tools. With <a href="https://druid.apache.org/docs/latest/multi-stage-query/concepts#multi-stage-query-task-engine">multi-stage query</a>, you can issue an asynchronous query against deep storage that handles (almost) unlimited amounts of data.</p>
<p>However, obtaining a result is a multi step process:</p>
<ul>
<li>First, <a href="https://druid.apache.org/docs/latest/api-reference/sql-api#submit-a-query-1">submit the query</a>;</li>
<li>then <a href="https://druid.apache.org/docs/latest/api-reference/sql-api#get-query-status">poll the task endpoint</a> until it is done</li>
<li>and finally, <a href="https://druid.apache.org/docs/latest/api-reference/sql-api#get-query-results">retrieve the result</a>.</li>
</ul>
<p>Meanwhile, the data that you download in step 3 has been written to some storage location inside Druid already. You can define a path and even instruct Druid to use <a href="https://druid.apache.org/docs/latest/operations/durable-storage#enable-durable-storage">durable storage</a> for query results, but: these data are is still in a Druid specific format and cannot easily be read by other tools.</p>
<p>What if we could skip that step (persisting the result) completely and send the result directly to a file in a format of our choice?</p>
<p>Druid 29 can do this. For now, it is somewhat limited - it only supports csv, and can only export to local filesystem or S3. But other formats, such as Parquet, are coming.</p>
<p>Let’s try this out with a <a href="https://druid.apache.org/docs/latest/tutorials/">Druid Quickstart</a> installation!</p>
<p>In this tutorial, you will</p>
<ul>
<li>learn how to configure the settings for MSQ export</li>
<li>export a sample dataset.</li>
</ul>
<h2 id="preparation">Preparation</h2>
<p>We are going to export to local storage. To limit the attack surface for malicious or inexperienced users, you have to define a specific filesystem path where Druid is allowed to store export files.</p>
<p>On your local machine, install Druid 29 from the <a href="https://druid.apache.org/downloads/">tarball</a>.</p>
<p>Create a directory <code class="language-plaintext highlighter-rouge">/tmp/druid-export</code> on your local disk.</p>
<p>In your Druid installation, edit the file <code class="language-plaintext highlighter-rouge">conf/druid/auto/_common/common.runtime.properties</code> and add the line</p>
<div class="language-properties highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="py">druid.export.storage.baseDir</span><span class="p">=</span><span class="s">/tmp/druid-export</span>
</code></pre></div></div>
<p>at the end of the file.</p>
<p>Then start Druid like so, from within your Druid install directory:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>bin/start-druid <span class="nt">-m5g</span>
</code></pre></div></div>
<p>Ingest the <em>wikipedia</em> sample data following the instructions using either <a href="https://druid.apache.org/docs/latest/tutorials/#load-data">classic batch</a> or <a href="https://druid.apache.org/docs/latest/tutorials/tutorial-msq-extern">SQL ingestion</a>.</p>
<p>Then go to the <code class="language-plaintext highlighter-rouge">Query</code> tab in the Druid console.</p>
<h2 id="exporting-data">Exporting data</h2>
<p>Run this query:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">INSERT</span> <span class="k">INTO</span>
<span class="n">EXTERN</span><span class="p">(</span><span class="k">local</span><span class="p">(</span><span class="n">exportPath</span> <span class="o">=></span> <span class="s1">'/tmp/druid-export/wikipedia-export'</span><span class="p">))</span>
<span class="k">AS</span> <span class="n">CSV</span>
<span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">wikipedia</span>
</code></pre></div></div>
<p><img src="/assets/2024-03-01-01.jpg" alt="Screenshot of running query" /></p>
<p>When the query finishes, check the export directory and you will find a CSV file containing the data:</p>
<p><img src="/assets/2024-03-01-02.jpg" alt="Preview of result file in a shell window" /></p>
<p>Note: the target directory has to be empty, else you get an error message.</p>
<p>This also works for export to <a href="https://druid.apache.org/docs/latest/multi-stage-query/reference/#s3">S3</a>.</p>
<h2 id="learnings">Learnings</h2>
<ul>
<li>With MSQ, you can now export query results directly to external storage.</li>
<li>This is a new feature in Druid 29. It is currently limited to CSV format and either local storage or S3, but expect more options to be added soon.</li>
</ul>
<hr />
<p>“<a href="https://www.flickr.com/photos/mhlimages/48051262646/">This image is taken from Page 500 of Praktisches Kochbuch für die gewöhnliche und feinere Küche</a>” by <a href="https://www.flickr.com/photos/mhlimages/">Medical Heritage Library, Inc.</a> is licensed under <a target="_blank" rel="noopener noreferrer" href="https://creativecommons.org/licenses/by-nc-sa/2.0/">CC BY-NC-SA 2.0 <img src="https://mirrors.creativecommons.org/presskit/icons/cc.svg" style="height: 1em; margin-right: 0.125em; display: inline;" /><img src="https://mirrors.creativecommons.org/presskit/icons/by.svg" style="height: 1em; margin-right: 0.125em; display: inline;" /><img src="https://mirrors.creativecommons.org/presskit/icons/nc.svg" style="height: 1em; margin-right: 0.125em; display: inline;" /><img src="https://mirrors.creativecommons.org/presskit/icons/sa.svg" style="height: 1em; margin-right: 0.125em; display: inline;" /></a>.</p>Druid 29 Preview: Transposing Data with PIVOT and UNPIVOT2024-01-15T00:00:00+01:002024-01-15T00:00:00+01:00/2024/01/15/druid-29-preview-transposing-data-with-PIVOT-and-UNPIVOT<p><img src="/assets/2021-12-21-elf.jpg" alt="Druid Cookbook" /></p>
<p>Imagine that you are tasked with getting a spreadsheet of sales data into Druid that looks like this:</p>
<p><img src="/assets/2024-01-15-01-rawdata-table.png" alt="Raw data in table format" /></p>
<p>You’ve got the sales figures in the cells, and the regions down and the years across. While you <em>can</em> work with the data in this form in Druid, this may not be your best option. Druid 29 brings two new SQL functions that can help with transforming the data into a format that is better suited for analytics. Let’s see how that works!</p>
<h2 id="getting-set-up">Getting set up</h2>
<p>This is a sneak peek into Druid 29 functionality. In order to use the new functions, you can (as of the time of writing) <a href="https://druid.apache.org/docs/latest/development/build.html">build Druid</a> from the HEAD of the master branch:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>git clone https://github.com/apache/druid.git
<span class="nb">cd </span>druid
mvn clean <span class="nb">install</span> <span class="nt">-Pdist</span> <span class="nt">-DskipTests</span>
</code></pre></div></div>
<p>Then follow the instructions to locate and install the tarball. Make sure you have <a href="https://druid.apache.org/docs/latest/multi-stage-query/#load-the-extension">the <code class="language-plaintext highlighter-rouge">druid-multi-stage-query</code> extension enabled</a>.</p>
<p>In this tutorial, you will</p>
<ul>
<li>learn how to use the <code class="language-plaintext highlighter-rouge">PIVOT</code> and <code class="language-plaintext highlighter-rouge">UNPIVOT</code> functions to transpose rows into columns and vice versa</li>
<li>and use this knowledge to transform a dataset during ingestion in Druid.</li>
</ul>
<p><em><strong>Disclaimer:</strong> This tutorial uses undocumented functionality and unreleased code. This blog is neither endorsed by Imply nor by the Apache Druid PMC. It merely collects the results of personal experiments. The features described here might, in the final release, work differently, or not at all. In addition, the entire build, or execution, may fail. Your mileage may vary.</em></p>
<h2 id="ingesting-the-data">Ingesting the data</h2>
<p>The dataset is very simple and looks like this:</p>
<pre><code class="language-csv">region,2022,2023
Central,215000,240000
East,350000,360000
West,415000,450000
</code></pre>
<p>The easiest way to get these data into Druid is with the ingestion wizard in the Druid console, using the <code class="language-plaintext highlighter-rouge">Paste data</code> input source:</p>
<p><img src="/assets/2024-01-15-02-ingest.jpg" alt="Druid wizard with Paste data sample" /></p>
<p>Run the ingestion wizard; make sure to give a meaningful name to the target datasource. Or you can paste the below SQL directly into a query window:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">REPLACE</span> <span class="k">INTO</span> <span class="nv">"sales_data"</span> <span class="n">OVERWRITE</span> <span class="k">ALL</span>
<span class="k">WITH</span> <span class="nv">"ext"</span> <span class="k">AS</span> <span class="p">(</span>
<span class="k">SELECT</span> <span class="o">*</span>
<span class="k">FROM</span> <span class="k">TABLE</span><span class="p">(</span>
<span class="n">EXTERN</span><span class="p">(</span>
<span class="s1">'{"type":"inline","data":"region,2022,2023</span><span class="se">\n</span><span class="s1">Central,215000,240000</span><span class="se">\n</span><span class="s1">East,350000,360000</span><span class="se">\n</span><span class="s1">West,415000,450000"}'</span><span class="p">,</span>
<span class="s1">'{"type":"csv","findColumnsFromHeader":true}'</span>
<span class="p">)</span>
<span class="p">)</span> <span class="n">EXTEND</span> <span class="p">(</span><span class="nv">"region"</span> <span class="nb">VARCHAR</span><span class="p">,</span> <span class="nv">"2022"</span> <span class="nb">BIGINT</span><span class="p">,</span> <span class="nv">"2023"</span> <span class="nb">BIGINT</span><span class="p">)</span>
<span class="p">)</span>
<span class="k">SELECT</span>
<span class="nb">TIMESTAMP</span> <span class="s1">'2000-01-01 00:00:00'</span> <span class="k">AS</span> <span class="nv">"__time"</span><span class="p">,</span>
<span class="nv">"region"</span><span class="p">,</span>
<span class="nv">"2022"</span><span class="p">,</span>
<span class="nv">"2023"</span>
<span class="k">FROM</span> <span class="nv">"ext"</span>
<span class="n">PARTITIONED</span> <span class="k">BY</span> <span class="k">ALL</span>
</code></pre></div></div>
<h2 id="pivot---transpose-rows-to-columns"><code class="language-plaintext highlighter-rouge">PIVOT</code> - transpose rows to columns</h2>
<p>Let’s represent the data in a different form. We want one column per region and per year. Here is the query for this transformation:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span> <span class="o">*</span>
<span class="k">FROM</span> <span class="nv">"sales_data"</span>
<span class="n">PIVOT</span> <span class="p">(</span>
<span class="k">SUM</span><span class="p">(</span><span class="nv">"2022"</span><span class="p">)</span> <span class="k">AS</span> <span class="n">sales_2022</span><span class="p">,</span>
<span class="k">SUM</span><span class="p">(</span><span class="nv">"2023"</span><span class="p">)</span> <span class="k">AS</span> <span class="n">sales_2023</span>
<span class="k">FOR</span> <span class="nv">"region"</span> <span class="k">IN</span> <span class="p">(</span><span class="s1">'East'</span> <span class="k">AS</span> <span class="n">east</span><span class="p">,</span> <span class="s1">'Central'</span> <span class="k">AS</span> <span class="n">central</span><span class="p">))</span>
</code></pre></div></div>
<p><img src="/assets/2024-01-15-03-pivot.jpg" alt="PIVOT query" /></p>
<p>A few things worth noting:</p>
<ul>
<li><code class="language-plaintext highlighter-rouge">PIVOT</code> takes a list of <em>aggregations over existing value columns</em> to calculate the values in the final columns.</li>
<li>The aggregations are needed because <code class="language-plaintext highlighter-rouge">PIVOT</code> implicitly <em>groups by the values</em> in the value columns.</li>
<li>The <code class="language-plaintext highlighter-rouge">FOR</code> clause lists the <em>pivot column</em>.</li>
<li>To keep the column list finite, you have to give it a list of values to filter by (like an implicit <code class="language-plaintext highlighter-rouge">HAVING</code> clause.)</li>
<li>You can define aliases for the values, those will serve as column prefixes.</li>
<li>You can use the generated column names in query clauses - this here is a legitimate query:</li>
</ul>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span> <span class="n">east_sales_2022</span>
<span class="k">FROM</span> <span class="nv">"sales_data"</span>
<span class="n">PIVOT</span> <span class="p">(</span>
<span class="k">SUM</span><span class="p">(</span><span class="nv">"2022"</span><span class="p">)</span> <span class="k">AS</span> <span class="n">sales_2022</span><span class="p">,</span>
<span class="k">SUM</span><span class="p">(</span><span class="nv">"2023"</span><span class="p">)</span> <span class="k">AS</span> <span class="n">sales_2023</span>
<span class="k">FOR</span> <span class="nv">"region"</span> <span class="k">IN</span> <span class="p">(</span><span class="s1">'East'</span> <span class="k">AS</span> <span class="n">east</span><span class="p">,</span> <span class="s1">'Central'</span> <span class="k">AS</span> <span class="n">central</span><span class="p">))</span>
</code></pre></div></div>
<h2 id="unpivot---transpose-columns-to-rows"><code class="language-plaintext highlighter-rouge">UNPIVOT</code> - transpose columns to rows</h2>
<p>To collect a list of columns into one, transposing the columns to rows, you can use <code class="language-plaintext highlighter-rouge">UNPIVOT</code>. Here is a query that creates a format that you would probably prefer for further analytical processing:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span> <span class="o">*</span>
<span class="k">FROM</span> <span class="nv">"sales_data"</span>
<span class="n">UNPIVOT</span> <span class="p">(</span> <span class="nv">"sales"</span> <span class="k">FOR</span> <span class="nv">"year"</span> <span class="k">IN</span> <span class="p">(</span><span class="nv">"2022"</span> <span class="k">AS</span> <span class="s1">'previous'</span><span class="p">,</span> <span class="nv">"2023"</span> <span class="k">AS</span> <span class="s1">'current'</span><span class="p">)</span> <span class="p">)</span>
</code></pre></div></div>
<p><img src="/assets/2024-01-15-04-unpivot.jpg" alt="UNPIVOT query" /></p>
<ul>
<li>An <code class="language-plaintext highlighter-rouge">UNPIVOT</code> query needs no aggregation since it only reorders the values.</li>
<li>You need to define two aliases:
<ul>
<li>the first one, <code class="language-plaintext highlighter-rouge">"sales"</code> in the example, is the column where the <em>values</em> end up;</li>
<li>the second one, <code class="language-plaintext highlighter-rouge">"year"</code> is where the column names are collected, expressed as strings.</li>
</ul>
</li>
<li>Again, you can also define alias values for the column names.</li>
</ul>
<h2 id="unpivot-during-ingestion"><code class="language-plaintext highlighter-rouge">UNPIVOT</code> during ingestion</h2>
<p>Back to the beginning of the story. As you may have noticed, the original table does not have a proper timestamp because the time information is in the column headers. Instead we just let Druid fill in a constant dummy timestamp. This is not optimal, particularly since the input data is very obviously time based!</p>
<p>Can we use our new knowledge to generate a proper timestamp?</p>
<p>Let’s see how to do this using SQL based ingestion. We’ll generate the timestamp column by <code class="language-plaintext highlighter-rouge">UNPIVOT</code>ing the year column headers into a single new column, and parsing that column as a timestamp:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">REPLACE</span> <span class="k">INTO</span> <span class="nv">"sales_data_unpivot"</span> <span class="n">OVERWRITE</span> <span class="k">ALL</span>
<span class="k">WITH</span> <span class="nv">"ext"</span> <span class="k">AS</span> <span class="p">(</span>
<span class="k">SELECT</span> <span class="o">*</span>
<span class="k">FROM</span> <span class="k">TABLE</span><span class="p">(</span>
<span class="n">EXTERN</span><span class="p">(</span>
<span class="s1">'{"type":"inline","data":"region,2022,2023</span><span class="se">\n</span><span class="s1">Central,215000,240000</span><span class="se">\n</span><span class="s1">East,350000,360000</span><span class="se">\n</span><span class="s1">West,415000,450000"}'</span><span class="p">,</span>
<span class="s1">'{"type":"csv","findColumnsFromHeader":true}'</span>
<span class="p">)</span>
<span class="p">)</span> <span class="n">EXTEND</span> <span class="p">(</span><span class="nv">"region"</span> <span class="nb">VARCHAR</span><span class="p">,</span> <span class="nv">"2022"</span> <span class="nb">BIGINT</span><span class="p">,</span> <span class="nv">"2023"</span> <span class="nb">BIGINT</span><span class="p">)</span>
<span class="p">)</span>
<span class="k">SELECT</span>
<span class="n">TIME_PARSE</span><span class="p">(</span><span class="nv">"year"</span><span class="p">,</span> <span class="s1">'YYYY'</span><span class="p">)</span> <span class="k">AS</span> <span class="nv">"__time"</span><span class="p">,</span>
<span class="nv">"region"</span><span class="p">,</span>
<span class="nv">"sales"</span>
<span class="k">FROM</span> <span class="nv">"ext"</span>
<span class="n">UNPIVOT</span> <span class="p">(</span> <span class="nv">"sales"</span> <span class="k">FOR</span> <span class="nv">"year"</span> <span class="k">IN</span> <span class="p">(</span><span class="nv">"2022"</span><span class="p">,</span> <span class="nv">"2023"</span> <span class="p">)</span> <span class="p">)</span>
<span class="n">PARTITIONED</span> <span class="k">BY</span> <span class="nb">YEAR</span>
</code></pre></div></div>
<p><img src="/assets/2024-01-15-05-unpivot-ingest.jpg" alt="UNPIVOT ingestion" /></p>
<p>Let’s check the result:</p>
<p><img src="/assets/2024-01-15-06-select.jpg" alt="Query table with timestamp" /></p>
<p>We have a proper timestamp. (You can also check the <code class="language-plaintext highlighter-rouge">Segments</code> view to verify that the data is actually partitioned by year.)</p>
<h2 id="conclusion">Conclusion</h2>
<ul>
<li><code class="language-plaintext highlighter-rouge">PIVOT</code> transposes rows to columns, aggregating values on the way.</li>
<li><code class="language-plaintext highlighter-rouge">UNPIVOT</code> transposes columns to rows.</li>
<li>The behavior of both functions can be fine tuned by choosing suitable column aliases.</li>
<li>One case where this is especially handy is with spreadsheet data that has the time axis across.</li>
</ul>
<hr />
<p>“<a href="https://www.flickr.com/photos/mhlimages/48051262646/">This image is taken from Page 500 of Praktisches Kochbuch für die gewöhnliche und feinere Küche</a>” by <a href="https://www.flickr.com/photos/mhlimages/">Medical Heritage Library, Inc.</a> is licensed under <a target="_blank" rel="noopener noreferrer" href="https://creativecommons.org/licenses/by-nc-sa/2.0/">CC BY-NC-SA 2.0 <img src="https://mirrors.creativecommons.org/presskit/icons/cc.svg" style="height: 1em; margin-right: 0.125em; display: inline;" /><img src="https://mirrors.creativecommons.org/presskit/icons/by.svg" style="height: 1em; margin-right: 0.125em; display: inline;" /><img src="https://mirrors.creativecommons.org/presskit/icons/nc.svg" style="height: 1em; margin-right: 0.125em; display: inline;" /><img src="https://mirrors.creativecommons.org/presskit/icons/sa.svg" style="height: 1em; margin-right: 0.125em; display: inline;" /></a>.</p>Druid 29 Preview: Handling Nested Arrays2023-12-17T00:00:00+01:002023-12-17T00:00:00+01:00/2023/12/17/druid-29-preview-handling-nested-arrays<p><img src="/assets/2022-11-23-00-pizza.jpg" alt="Pizza" /></p>
<p>Imagine you have a data sample like this:</p>
<pre><code class="language-json-nd">{'id': 93, 'shop': 'Circular Pi Pizzeria', 'name': 'David Murillo', 'phoneNumber': '305-351-2631', 'address': '746 Chelsea Plains Suite 656\nNew Richard, MA 16940', 'pizzas': [{'pizzaName': 'Salami', 'additionalToppings': ['🥓 bacon']}], 'timestamp': 1702815411410}
{'id': 94, 'shop': 'Marios Pizza', 'name': 'Darius Roach', 'phoneNumber': '344.571.9608x0590', 'address': '58235 Robert Cliffs\nAguilarland, PR 76249', 'pizzas': [{'pizzaName': 'Diavola', 'additionalToppings': []}, {'pizzaName': 'Salami', 'additionalToppings': ['🧄 garlic']}, {'pizzaName': 'Peperoni', 'additionalToppings': ['🫒 olives', '🧅 onion', '🍅 tomato', '🍓 strawberry']}, {'pizzaName': 'Diavola', 'additionalToppings': ['🫒 olives', '🍌 banana', '🍍 pineapple']}, {'pizzaName': 'Margherita', 'additionalToppings': ['🍓 strawberry', '🍍 pineapple', '🥚 egg', '🐟 tuna', '🐟 tuna']}, {'pizzaName': 'Margherita', 'additionalToppings': ['🥚 egg']}, {'pizzaName': 'Margherita', 'additionalToppings': ['🫑 green peppers', '🥚 egg', '🥚 egg']}, {'pizzaName': 'Peperoni', 'additionalToppings': []}, {'pizzaName': 'Salami', 'additionalToppings': []}], 'timestamp': 1702815415518}
{'id': 95, 'shop': 'Mammamia Pizza', 'name': 'Ryan Juarez', 'phoneNumber': '(041)278-5690', 'address': '934 Melissa Lights\nPaulland, UT 40700', 'pizzas': [{'pizzaName': 'Marinara', 'additionalToppings': ['🫑 green peppers', '🧅 onion']}, {'pizzaName': 'Marinara', 'additionalToppings': ['🍅 tomato', '🥓 bacon', '🍌 banana', '🌶️ hot pepper']}, {'pizzaName': 'Peperoni', 'additionalToppings': ['🍓 strawberry', '🍌 banana', '🐟 tuna', '🧀 blue cheese']}, {'pizzaName': 'Marinara', 'additionalToppings': ['🐟 tuna', '🧅 onion', '🍍 pineapple', '🍓 strawberry']}, {'pizzaName': 'Mari & Monti', 'additionalToppings': ['🫒 olives', '🐟 tuna']}, {'pizzaName': 'Marinara', 'additionalToppings': ['🍍 pineapple', '🍅 tomato', '🍌 banana', '🧀 blue cheese', '🫒 olives']}, {'pizzaName': 'Marinara', 'additionalToppings': ['🍌 banana', '🫑 green peppers', '🧄 garlic', '🍅 tomato']}], 'timestamp': 1702815418643}
</code></pre>
<p>I created the data sample using <a href="https://github.com/Aiven-Labs/python-fake-data-producer-for-apache-kafka">Francesco’s pizza simulator</a>. The structure of these simulated pizza orders is quite deeply nested:</p>
<ul>
<li>Each order has a field <code class="language-plaintext highlighter-rouge">pizzas</code>, which is an array of JSON objects.</li>
<li>Each individual pizza item has
<ul>
<li>a <code class="language-plaintext highlighter-rouge">pizzaName</code> field, which is a string</li>
<li><code class="language-plaintext highlighter-rouge">additionalToppings</code>, an array of strings that may be empty.</li>
</ul>
</li>
</ul>
<p>Arrays of objects are a bit obtuse, and I would like to create a data model that breaks down the orders so that each row in Druid represents a line item (a single pizza.)
To that end, it would be nice to use some combination of JSON functions and <a href="/2023/04/08/druid-sneak-peek-timeseries-interpolation/"><code class="language-plaintext highlighter-rouge">UNNEST</code></a> during ingestion. But how exactly? Let’s find out!</p>
<h2 id="getting-set-up">Getting set up</h2>
<p>This is a sneak peek into Druid 29 functionality. In order to use the new functions, you can (as of the time of writing) <a href="https://druid.apache.org/docs/latest/development/build.html">build Druid</a> from the HEAD of the master branch:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>git clone https://github.com/apache/druid.git
<span class="nb">cd </span>druid
mvn clean <span class="nb">install</span> <span class="nt">-Pdist</span> <span class="nt">-DskipTests</span>
</code></pre></div></div>
<p>Then follow the instructions to locate and install the tarball. Make sure you have <a href="https://druid.apache.org/docs/latest/multi-stage-query/#load-the-extension">the <code class="language-plaintext highlighter-rouge">druid-multi-stage-query</code> extension enabled</a>.</p>
<p>In this tutorial, you will</p>
<ul>
<li>examine how to model deeply nested JSON data with arrays in Druid and</li>
<li>break down a nested JSON array into individual rows using new functionality that is currently being built.</li>
</ul>
<p><em><strong>Disclaimer:</strong> This tutorial uses undocumented functionality and unreleased code. This blog is neither endorsed by Imply nor by the Apache Druid PMC. It merely collects the results of personal experiments. The features described here might, in the final release, work differently, or not at all. In addition, the entire build, or execution, may fail. Your mileage may vary.</em></p>
<h2 id="the-data">The data</h2>
<p>Right now, the technique we are looking at is limited to batch ingestion. So, we need to capture the simulator data in a file.</p>
<p>I assume you have a local Kafka service at <em>localhost:9092</em>.</p>
<p>Check out the <a href="https://github.com/Aiven-Labs/python-fake-data-producer-for-apache-kafka">pizza simulator</a> and run it like so:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>python3 main.py <span class="nt">--security-protocol</span> PLAINTEXT <span class="nt">--host</span> localhost <span class="nt">--port</span> 9092 <span class="nt">--topic-name</span> pizza-orders <span class="nt">--nr-messages</span> 0 <span class="nt">--max-waiting-time</span> 5
</code></pre></div></div>
<p>Capture the output using <code class="language-plaintext highlighter-rouge">kcat</code> and redirect to a file:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>kcat <span class="nt">-b</span> localhost:9092 <span class="nt">-t</span> pizza-orders <span class="o">>></span>./pizza-orders.json
</code></pre></div></div>
<p>You can stop the simulator after a while and use the <code class="language-plaintext highlighter-rouge">pizza-orders.json</code> file as input for the next steps.</p>
<h2 id="basic-ingestion-the-pizza-orders-table">Basic ingestion: the <code class="language-plaintext highlighter-rouge">pizza-orders</code> table</h2>
<p>Let’s start by setting up a naïve data model using the <a href="https://druid.apache.org/docs/latest/operations/web-console#data-loader">web console wizard</a>. Note how in the SQL view, the type of the <code class="language-plaintext highlighter-rouge">pizzas</code> field is somewhat correctly recognized as a <code class="language-plaintext highlighter-rouge">COMPLEX<json></code> but it does not know about the array structure:</p>
<p><img src="/assets/2023-12-17-01-ingest-orders.jpg" alt="Ingestion view for pizza-orders" /></p>
<p>Here is the ingestion query using MSQ:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">REPLACE</span> <span class="k">INTO</span> <span class="nv">"pizza-orders"</span> <span class="n">OVERWRITE</span> <span class="k">ALL</span>
<span class="k">WITH</span> <span class="nv">"ext"</span> <span class="k">AS</span> <span class="p">(</span>
<span class="k">SELECT</span> <span class="o">*</span>
<span class="k">FROM</span> <span class="k">TABLE</span><span class="p">(</span>
<span class="n">EXTERN</span><span class="p">(</span>
<span class="s1">'{"type":"local","baseDir":"/Users/hellmarbecker/meetup-talks/jsonarray","filter":"*json"}'</span><span class="p">,</span>
<span class="s1">'{"type":"json"}'</span>
<span class="p">)</span>
<span class="p">)</span> <span class="n">EXTEND</span> <span class="p">(</span><span class="nv">"id"</span> <span class="nb">BIGINT</span><span class="p">,</span> <span class="nv">"shop"</span> <span class="nb">VARCHAR</span><span class="p">,</span> <span class="nv">"name"</span> <span class="nb">VARCHAR</span><span class="p">,</span> <span class="nv">"phoneNumber"</span> <span class="nb">VARCHAR</span><span class="p">,</span> <span class="nv">"address"</span> <span class="nb">VARCHAR</span><span class="p">,</span> <span class="nv">"pizzas"</span> <span class="k">TYPE</span><span class="p">(</span><span class="s1">'COMPLEX<json>'</span><span class="p">),</span> <span class="nv">"timestamp"</span> <span class="nb">BIGINT</span><span class="p">)</span>
<span class="p">)</span>
<span class="k">SELECT</span>
<span class="n">MILLIS_TO_TIMESTAMP</span><span class="p">(</span><span class="nv">"timestamp"</span><span class="p">)</span> <span class="k">AS</span> <span class="nv">"__time"</span><span class="p">,</span>
<span class="nv">"id"</span><span class="p">,</span>
<span class="nv">"shop"</span><span class="p">,</span>
<span class="nv">"name"</span><span class="p">,</span>
<span class="nv">"phoneNumber"</span><span class="p">,</span>
<span class="nv">"address"</span><span class="p">,</span>
<span class="nv">"pizzas"</span>
<span class="k">FROM</span> <span class="nv">"ext"</span>
<span class="n">PARTITIONED</span> <span class="k">BY</span> <span class="k">DAY</span>
</code></pre></div></div>
<p>When we query this table, we see that indeed we have a general nested column here - it is not marked as an array</p>
<p><img src="/assets/2023-12-17-02-select-orders.jpg" alt="Sample of a query over orders" /></p>
<p>We can look at the detailed values in the column</p>
<p><img src="/assets/2023-12-17-03-orders-detail.jpg" alt="Detail view of a pizzas object" /></p>
<p>Again, what we would <em>like</em> is a table model where each row represents a <em>line item</em>, i. e. an individual pizza!</p>
<h2 id="first-attempt-at-breaking-down-the-line-items">First attempt at breaking down the line items</h2>
<p>Let’s try to craft a new ingestion query that breaks down the line items using <code class="language-plaintext highlighter-rouge">UNNEST</code>. We want to unnest the line items using something like <code class="language-plaintext highlighter-rouge">UNNEST(JSON_QUERY(pizzas, '$'))</code>, and then extract the individual fields into separate columns: <code class="language-plaintext highlighter-rouge">JSON_VALUE(p, '$.pizzaName') AS pizzaName</code> and so forth.</p>
<p>Here’s the first attempt at such a query:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">REPLACE</span> <span class="k">INTO</span> <span class="nv">"pizza-lineitems"</span> <span class="n">OVERWRITE</span> <span class="k">ALL</span>
<span class="k">WITH</span> <span class="nv">"ext"</span> <span class="k">AS</span> <span class="p">(</span>
<span class="k">SELECT</span> <span class="o">*</span>
<span class="k">FROM</span> <span class="k">TABLE</span><span class="p">(</span>
<span class="n">EXTERN</span><span class="p">(</span>
<span class="s1">'{"type":"local","baseDir":"/Users/hellmarbecker/meetup-talks/jsonarray","filter":"*json"}'</span><span class="p">,</span>
<span class="s1">'{"type":"json"}'</span>
<span class="p">)</span>
<span class="p">)</span> <span class="n">EXTEND</span> <span class="p">(</span><span class="nv">"id"</span> <span class="nb">BIGINT</span><span class="p">,</span> <span class="nv">"shop"</span> <span class="nb">VARCHAR</span><span class="p">,</span> <span class="nv">"name"</span> <span class="nb">VARCHAR</span><span class="p">,</span> <span class="nv">"phoneNumber"</span> <span class="nb">VARCHAR</span><span class="p">,</span> <span class="nv">"address"</span> <span class="nb">VARCHAR</span><span class="p">,</span> <span class="nv">"pizzas"</span> <span class="k">TYPE</span><span class="p">(</span><span class="s1">'COMPLEX<json>'</span><span class="p">),</span> <span class="nv">"timestamp"</span> <span class="nb">BIGINT</span><span class="p">)</span>
<span class="p">)</span>
<span class="k">SELECT</span>
<span class="n">MILLIS_TO_TIMESTAMP</span><span class="p">(</span><span class="nv">"timestamp"</span><span class="p">)</span> <span class="k">AS</span> <span class="nv">"__time"</span><span class="p">,</span>
<span class="nv">"id"</span><span class="p">,</span>
<span class="nv">"shop"</span><span class="p">,</span>
<span class="nv">"name"</span><span class="p">,</span>
<span class="nv">"phoneNumber"</span><span class="p">,</span>
<span class="nv">"address"</span><span class="p">,</span>
<span class="n">JSON_VALUE</span><span class="p">(</span><span class="n">p</span><span class="p">,</span> <span class="s1">'$.pizzaName'</span><span class="p">)</span> <span class="k">AS</span> <span class="n">pizzaName</span><span class="p">,</span>
<span class="n">JSON_QUERY</span><span class="p">(</span><span class="n">p</span><span class="p">,</span> <span class="s1">'$.additionalToppings'</span><span class="p">)</span> <span class="k">AS</span> <span class="n">additionalToppings</span>
<span class="k">FROM</span> <span class="nv">"ext"</span> <span class="k">CROSS</span> <span class="k">JOIN</span> <span class="k">UNNEST</span><span class="p">(</span><span class="n">JSON_QUERY</span><span class="p">(</span><span class="n">pizzas</span><span class="p">,</span> <span class="s1">'$'</span><span class="p">))</span> <span class="k">AS</span> <span class="n">lineitems</span><span class="p">(</span><span class="n">p</span><span class="p">)</span>
<span class="n">PARTITIONED</span> <span class="k">BY</span> <span class="k">DAY</span>
</code></pre></div></div>
<p>This, unfortunately, fails with a screaming error message:</p>
<p><img src="/assets/2023-12-17-04-error.jpg" width="50%" /></p>
<p>We cannot unnest arrays of objects just like arrays of primitives! But why is that? Look at the error message more closely: Druid thinks this is a call to <code class="language-plaintext highlighter-rouge">UNNEST(COMPLEX<JSON>)</code>. So, <code class="language-plaintext highlighter-rouge">JSON_QUERY</code> doesn’t know about the array nature of its output. What now?</p>
<h2 id="a-new-function-json_query_array">A new function: <code class="language-plaintext highlighter-rouge">JSON_QUERY_ARRAY</code></h2>
<p>The Druid team has added a new function that does just the right thing for our case:</p>
<blockquote>
<p><code class="language-plaintext highlighter-rouge">JSON_QUERY_ARRAY(expr, path)</code></p>
<p>Extracts an <code class="language-plaintext highlighter-rouge">ARRAY<COMPLEX<json>></code> value from <code class="language-plaintext highlighter-rouge">expr</code> at the specified <code class="language-plaintext highlighter-rouge">path</code>. If value is not an <code class="language-plaintext highlighter-rouge">ARRAY</code>, it gets translated into a single element <code class="language-plaintext highlighter-rouge">ARRAY</code> containing the value at <code class="language-plaintext highlighter-rouge">path</code>. The primary use of this function is to extract arrays of objects to use as inputs to other array functions.</p>
</blockquote>
<p>Let’s rewrite the above query, substituting <code class="language-plaintext highlighter-rouge">JSON_QUERY_ARRAY</code> for <code class="language-plaintext highlighter-rouge">JSON_QUERY</code> in both cases:</p>
<p><img src="/assets/2023-12-17-05-ingest-lineitems.jpg" alt="Ingestion using JSON_QUERY_ARRAY" /></p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">REPLACE</span> <span class="k">INTO</span> <span class="nv">"pizza-lineitems"</span> <span class="n">OVERWRITE</span> <span class="k">ALL</span>
<span class="k">WITH</span> <span class="nv">"ext"</span> <span class="k">AS</span> <span class="p">(</span>
<span class="k">SELECT</span> <span class="o">*</span>
<span class="k">FROM</span> <span class="k">TABLE</span><span class="p">(</span>
<span class="n">EXTERN</span><span class="p">(</span>
<span class="s1">'{"type":"local","baseDir":"/Users/hellmarbecker/meetup-talks/jsonarray","filter":"*json"}'</span><span class="p">,</span>
<span class="s1">'{"type":"json"}'</span>
<span class="p">)</span>
<span class="p">)</span> <span class="n">EXTEND</span> <span class="p">(</span><span class="nv">"id"</span> <span class="nb">BIGINT</span><span class="p">,</span> <span class="nv">"shop"</span> <span class="nb">VARCHAR</span><span class="p">,</span> <span class="nv">"name"</span> <span class="nb">VARCHAR</span><span class="p">,</span> <span class="nv">"phoneNumber"</span> <span class="nb">VARCHAR</span><span class="p">,</span> <span class="nv">"address"</span> <span class="nb">VARCHAR</span><span class="p">,</span> <span class="nv">"pizzas"</span> <span class="k">TYPE</span><span class="p">(</span><span class="s1">'COMPLEX<json>'</span><span class="p">),</span> <span class="nv">"timestamp"</span> <span class="nb">BIGINT</span><span class="p">)</span>
<span class="p">)</span>
<span class="k">SELECT</span>
<span class="n">MILLIS_TO_TIMESTAMP</span><span class="p">(</span><span class="nv">"timestamp"</span><span class="p">)</span> <span class="k">AS</span> <span class="nv">"__time"</span><span class="p">,</span>
<span class="nv">"id"</span><span class="p">,</span>
<span class="nv">"shop"</span><span class="p">,</span>
<span class="nv">"name"</span><span class="p">,</span>
<span class="nv">"phoneNumber"</span><span class="p">,</span>
<span class="nv">"address"</span><span class="p">,</span>
<span class="n">JSON_VALUE</span><span class="p">(</span><span class="n">p</span><span class="p">,</span> <span class="s1">'$.pizzaName'</span><span class="p">)</span> <span class="k">AS</span> <span class="n">pizzaName</span><span class="p">,</span>
<span class="n">JSON_QUERY_ARRAY</span><span class="p">(</span><span class="n">p</span><span class="p">,</span> <span class="s1">'$.additionalToppings'</span><span class="p">)</span> <span class="k">AS</span> <span class="n">additionalToppings</span>
<span class="k">FROM</span> <span class="nv">"ext"</span> <span class="k">CROSS</span> <span class="k">JOIN</span> <span class="k">UNNEST</span><span class="p">(</span><span class="n">JSON_QUERY_ARRAY</span><span class="p">(</span><span class="n">pizzas</span><span class="p">,</span> <span class="s1">'$'</span><span class="p">))</span> <span class="k">AS</span> <span class="n">lineitems</span><span class="p">(</span><span class="n">p</span><span class="p">)</span>
<span class="n">PARTITIONED</span> <span class="k">BY</span> <span class="k">DAY</span>
</code></pre></div></div>
<p>That way, we can also be sure that the <code class="language-plaintext highlighter-rouge">additionalToppings</code> column will be represented as an array.</p>
<p>After the ingestion has finished, query the table and note how</p>
<ul>
<li>there is now one row per line item</li>
<li>the <code class="language-plaintext highlighter-rouge">pizzas</code> subcolumn is represented as an array, as you can see by the <code class="language-plaintext highlighter-rouge">[⋯]</code> instead of the tree symbol:</li>
</ul>
<p><img src="/assets/2023-12-17-06-select-lineitems.jpg" alt="Query on line items" /></p>
<p>You can actually run a query over the new table that shows how <code class="language-plaintext highlighter-rouge">JSON_QUERY</code> forgets about the “array-ness” of the array column, while <code class="language-plaintext highlighter-rouge">JSON_QUERY_ARRAY</code> enforces it:</p>
<p><img src="/assets/2023-12-17-07-compare.jpg" alt="Comparison query" /></p>
<p>It is, however, preferred to use <code class="language-plaintext highlighter-rouge">JSON_QUERY_ARRAY</code> at ingestion time and represent the result in your data model. This is part of optimizing the data model to achieve those fast queries that Druid is known for!</p>
<h2 id="conclusion">Conclusion</h2>
<ul>
<li>We have seen how it is now possible to unnest even columns that contain arrays of objects. With this capability, Druid takes another big step handling nested objects.</li>
<li>Using <code class="language-plaintext highlighter-rouge">JSON_QUERY_ARRAY</code> on an array retains the “array-ness” and passes it on to functions that require an array input.</li>
<li>Using <code class="language-plaintext highlighter-rouge">JSON_QUERY_ARRAY</code> on a single object wraps it into an array.</li>
<li>You should use <code class="language-plaintext highlighter-rouge">JSON_QUERY_ARRAY</code> at ingestion rather than query time.</li>
</ul>
<hr />
<p class="attribution">"<a target="_blank" rel="noopener noreferrer" href="https://www.flickr.com/photos/26242865@N04/5919366429">Pizza</a>" by <a target="_blank" rel="noopener noreferrer" href="https://www.flickr.com/photos/26242865@N04">Katrin Gilger</a> is licensed under <a target="_blank" rel="noopener noreferrer" href="https://creativecommons.org/licenses/by-sa/2.0/?ref=openverse">CC BY-SA 2.0 <img src="https://mirrors.creativecommons.org/presskit/icons/cc.svg" style="height: 1em; margin-right: 0.125em; display: inline;" /><img src="https://mirrors.creativecommons.org/presskit/icons/by.svg" style="height: 1em; margin-right: 0.125em; display: inline;" /><img src="https://mirrors.creativecommons.org/presskit/icons/sa.svg" style="height: 1em; margin-right: 0.125em; display: inline;" /></a>. </p>Druid Data Cookbook: Upserts in Druid SQL2023-11-25T00:00:00+01:002023-11-25T00:00:00+01:00/2023/11/25/druid-data-cookbook-upserts-in-druid-sql<p><img src="/assets/2021-12-21-elf.jpg" alt="Druid Cookbook" /></p>
<p>In <a href="/2023/03/07/selective-bulk-upserts-in-apache-druid/">an earlier blog</a>, I demonstrated a technique to combine existing and new data in Druid batch ingestion in a way that more or less emulates what is usually expressed in SQL as a <code class="language-plaintext highlighter-rouge">MERGE</code> or <code class="language-plaintext highlighter-rouge">UPSERT</code> statement. This technique involves a <code class="language-plaintext highlighter-rouge">combine</code> datasource and works only in JSON-based ingestion. Also, it works on bulk data where you replace an entire range of data based on a time interval and key range.</p>
<p>Today I am going to look at a similar, albeit more surgical construction, implementing what is usually expressed in SQL as a <code class="language-plaintext highlighter-rouge">MERGE</code> or <code class="language-plaintext highlighter-rouge">UPSERT</code> statement. I will be using <a href="https://druid.apache.org/docs/latest/multi-stage-query/">SQL based ingestion</a> that is available in newer versions of Druid.</p>
<p>The <code class="language-plaintext highlighter-rouge">MERGE</code> statement, in a simplified way, works like this:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">MERGE</span> <span class="k">INTO</span> <span class="n">druid_table</span>
<span class="p">(</span><span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">external_table</span><span class="p">)</span>
<span class="k">ON</span> <span class="n">druid_table</span><span class="p">.</span><span class="n">keys</span> <span class="o">=</span> <span class="n">external_table</span><span class="p">.</span><span class="n">keys</span>
<span class="k">WHEN</span> <span class="n">MATCHED</span> <span class="k">THEN</span> <span class="k">UPDATE</span> <span class="p">...</span>
<span class="k">WHEN</span> <span class="k">NOT</span> <span class="n">MATCHED</span> <span class="k">THEN</span> <span class="k">INSERT</span> <span class="p">...</span>
</code></pre></div></div>
<p>So, you compare old <em>(druid_table)</em> and new data <em>(external_table)</em> with respect to a <em>matching condition</em>. This would entail a combination of timestamp and key fields, which in the above pseudocode is denoted by <em>keys</em>. There are three possible outcomes for any combination of <em>keys:</em></p>
<ol>
<li>If <em>keys</em> exists only in <em>druid table</em>, leave these data untouched.</li>
<li>If <em>keys</em> exists in both tables, replace the row(s) in <em>druid_table</em> with those in <em>external_table</em>.</li>
<li>If <em>keys</em> exists only in <em>external_table</em>, insert that data into <em>druid_table</em>.</li>
</ol>
<p>But Druid SQL does not offer a <code class="language-plaintext highlighter-rouge">MERGE</code> statement, at least not at the time of this writing. Can we do this in SQL anyway? Stay tuned if you want to know!</p>
<p>This tutorial works with <a href="https://druid.apache.org/docs/latest/tutorials/">the Druid 28 quickstart</a>.</p>
<h2 id="recap-the-data">Recap: the data</h2>
<p>Let’s use the same data as in <a href="/2023/03/07/selective-bulk-upserts-in-apache-druid/">the bulk upsert blog</a>: daily aggregated viewership data from various ad networks.</p>
<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="nl">"date"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2023-01-01T00:00:00Z"</span><span class="p">,</span><span class="w"> </span><span class="nl">"ad_network"</span><span class="p">:</span><span class="w"> </span><span class="s2">"gaagle"</span><span class="p">,</span><span class="w"> </span><span class="nl">"ads_impressions"</span><span class="p">:</span><span class="w"> </span><span class="mi">2770</span><span class="p">,</span><span class="w"> </span><span class="nl">"ads_revenue"</span><span class="p">:</span><span class="w"> </span><span class="mf">330.69</span><span class="p">}</span><span class="w">
</span><span class="p">{</span><span class="nl">"date"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2023-01-01T00:00:00Z"</span><span class="p">,</span><span class="w"> </span><span class="nl">"ad_network"</span><span class="p">:</span><span class="w"> </span><span class="s2">"fakebook"</span><span class="p">,</span><span class="w"> </span><span class="nl">"ads_impressions"</span><span class="p">:</span><span class="w"> </span><span class="mi">9646</span><span class="p">,</span><span class="w"> </span><span class="nl">"ads_revenue"</span><span class="p">:</span><span class="w"> </span><span class="mf">137.85</span><span class="p">}</span><span class="w">
</span><span class="p">{</span><span class="nl">"date"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2023-01-01T00:00:00Z"</span><span class="p">,</span><span class="w"> </span><span class="nl">"ad_network"</span><span class="p">:</span><span class="w"> </span><span class="s2">"twottr"</span><span class="p">,</span><span class="w"> </span><span class="nl">"ads_impressions"</span><span class="p">:</span><span class="w"> </span><span class="mi">1139</span><span class="p">,</span><span class="w"> </span><span class="nl">"ads_revenue"</span><span class="p">:</span><span class="w"> </span><span class="mf">493.73</span><span class="p">}</span><span class="w">
</span><span class="p">{</span><span class="nl">"date"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2023-01-02T00:00:00Z"</span><span class="p">,</span><span class="w"> </span><span class="nl">"ad_network"</span><span class="p">:</span><span class="w"> </span><span class="s2">"gaagle"</span><span class="p">,</span><span class="w"> </span><span class="nl">"ads_impressions"</span><span class="p">:</span><span class="w"> </span><span class="mi">9066</span><span class="p">,</span><span class="w"> </span><span class="nl">"ads_revenue"</span><span class="p">:</span><span class="w"> </span><span class="mf">368.66</span><span class="p">}</span><span class="w">
</span><span class="p">{</span><span class="nl">"date"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2023-01-02T00:00:00Z"</span><span class="p">,</span><span class="w"> </span><span class="nl">"ad_network"</span><span class="p">:</span><span class="w"> </span><span class="s2">"fakebook"</span><span class="p">,</span><span class="w"> </span><span class="nl">"ads_impressions"</span><span class="p">:</span><span class="w"> </span><span class="mi">4426</span><span class="p">,</span><span class="w"> </span><span class="nl">"ads_revenue"</span><span class="p">:</span><span class="w"> </span><span class="mf">170.96</span><span class="p">}</span><span class="w">
</span><span class="p">{</span><span class="nl">"date"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2023-01-02T00:00:00Z"</span><span class="p">,</span><span class="w"> </span><span class="nl">"ad_network"</span><span class="p">:</span><span class="w"> </span><span class="s2">"twottr"</span><span class="p">,</span><span class="w"> </span><span class="nl">"ads_impressions"</span><span class="p">:</span><span class="w"> </span><span class="mi">9110</span><span class="p">,</span><span class="w"> </span><span class="nl">"ads_revenue"</span><span class="p">:</span><span class="w"> </span><span class="mf">452.2</span><span class="p">}</span><span class="w">
</span><span class="p">{</span><span class="nl">"date"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2023-01-03T00:00:00Z"</span><span class="p">,</span><span class="w"> </span><span class="nl">"ad_network"</span><span class="p">:</span><span class="w"> </span><span class="s2">"gaagle"</span><span class="p">,</span><span class="w"> </span><span class="nl">"ads_impressions"</span><span class="p">:</span><span class="w"> </span><span class="mi">3275</span><span class="p">,</span><span class="w"> </span><span class="nl">"ads_revenue"</span><span class="p">:</span><span class="w"> </span><span class="mf">363.53</span><span class="p">}</span><span class="w">
</span><span class="p">{</span><span class="nl">"date"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2023-01-03T00:00:00Z"</span><span class="p">,</span><span class="w"> </span><span class="nl">"ad_network"</span><span class="p">:</span><span class="w"> </span><span class="s2">"fakebook"</span><span class="p">,</span><span class="w"> </span><span class="nl">"ads_impressions"</span><span class="p">:</span><span class="w"> </span><span class="mi">9494</span><span class="p">,</span><span class="w"> </span><span class="nl">"ads_revenue"</span><span class="p">:</span><span class="w"> </span><span class="mf">426.37</span><span class="p">}</span><span class="w">
</span><span class="p">{</span><span class="nl">"date"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2023-01-03T00:00:00Z"</span><span class="p">,</span><span class="w"> </span><span class="nl">"ad_network"</span><span class="p">:</span><span class="w"> </span><span class="s2">"twottr"</span><span class="p">,</span><span class="w"> </span><span class="nl">"ads_impressions"</span><span class="p">:</span><span class="w"> </span><span class="mi">4325</span><span class="p">,</span><span class="w"> </span><span class="nl">"ads_revenue"</span><span class="p">:</span><span class="w"> </span><span class="mf">107.44</span><span class="p">}</span><span class="w">
</span><span class="p">{</span><span class="nl">"date"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2023-01-04T00:00:00Z"</span><span class="p">,</span><span class="w"> </span><span class="nl">"ad_network"</span><span class="p">:</span><span class="w"> </span><span class="s2">"gaagle"</span><span class="p">,</span><span class="w"> </span><span class="nl">"ads_impressions"</span><span class="p">:</span><span class="w"> </span><span class="mi">8816</span><span class="p">,</span><span class="w"> </span><span class="nl">"ads_revenue"</span><span class="p">:</span><span class="w"> </span><span class="mf">311.53</span><span class="p">}</span><span class="w">
</span><span class="p">{</span><span class="nl">"date"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2023-01-04T00:00:00Z"</span><span class="p">,</span><span class="w"> </span><span class="nl">"ad_network"</span><span class="p">:</span><span class="w"> </span><span class="s2">"fakebook"</span><span class="p">,</span><span class="w"> </span><span class="nl">"ads_impressions"</span><span class="p">:</span><span class="w"> </span><span class="mi">8955</span><span class="p">,</span><span class="w"> </span><span class="nl">"ads_revenue"</span><span class="p">:</span><span class="w"> </span><span class="mf">254.5</span><span class="p">}</span><span class="w">
</span><span class="p">{</span><span class="nl">"date"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2023-01-04T00:00:00Z"</span><span class="p">,</span><span class="w"> </span><span class="nl">"ad_network"</span><span class="p">:</span><span class="w"> </span><span class="s2">"twottr"</span><span class="p">,</span><span class="w"> </span><span class="nl">"ads_impressions"</span><span class="p">:</span><span class="w"> </span><span class="mi">6905</span><span class="p">,</span><span class="w"> </span><span class="nl">"ads_revenue"</span><span class="p">:</span><span class="w"> </span><span class="mf">211.74</span><span class="p">}</span><span class="w">
</span><span class="p">{</span><span class="nl">"date"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2023-01-05T00:00:00Z"</span><span class="p">,</span><span class="w"> </span><span class="nl">"ad_network"</span><span class="p">:</span><span class="w"> </span><span class="s2">"gaagle"</span><span class="p">,</span><span class="w"> </span><span class="nl">"ads_impressions"</span><span class="p">:</span><span class="w"> </span><span class="mi">3075</span><span class="p">,</span><span class="w"> </span><span class="nl">"ads_revenue"</span><span class="p">:</span><span class="w"> </span><span class="mf">382.41</span><span class="p">}</span><span class="w">
</span><span class="p">{</span><span class="nl">"date"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2023-01-05T00:00:00Z"</span><span class="p">,</span><span class="w"> </span><span class="nl">"ad_network"</span><span class="p">:</span><span class="w"> </span><span class="s2">"fakebook"</span><span class="p">,</span><span class="w"> </span><span class="nl">"ads_impressions"</span><span class="p">:</span><span class="w"> </span><span class="mi">4870</span><span class="p">,</span><span class="w"> </span><span class="nl">"ads_revenue"</span><span class="p">:</span><span class="w"> </span><span class="mf">205.84</span><span class="p">}</span><span class="w">
</span><span class="p">{</span><span class="nl">"date"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2023-01-05T00:00:00Z"</span><span class="p">,</span><span class="w"> </span><span class="nl">"ad_network"</span><span class="p">:</span><span class="w"> </span><span class="s2">"twottr"</span><span class="p">,</span><span class="w"> </span><span class="nl">"ads_impressions"</span><span class="p">:</span><span class="w"> </span><span class="mi">1418</span><span class="p">,</span><span class="w"> </span><span class="nl">"ads_revenue"</span><span class="p">:</span><span class="w"> </span><span class="mf">282.21</span><span class="p">}</span><span class="w">
</span><span class="p">{</span><span class="nl">"date"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2023-01-06T00:00:00Z"</span><span class="p">,</span><span class="w"> </span><span class="nl">"ad_network"</span><span class="p">:</span><span class="w"> </span><span class="s2">"gaagle"</span><span class="p">,</span><span class="w"> </span><span class="nl">"ads_impressions"</span><span class="p">:</span><span class="w"> </span><span class="mi">7413</span><span class="p">,</span><span class="w"> </span><span class="nl">"ads_revenue"</span><span class="p">:</span><span class="w"> </span><span class="mf">322.43</span><span class="p">}</span><span class="w">
</span><span class="p">{</span><span class="nl">"date"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2023-01-06T00:00:00Z"</span><span class="p">,</span><span class="w"> </span><span class="nl">"ad_network"</span><span class="p">:</span><span class="w"> </span><span class="s2">"fakebook"</span><span class="p">,</span><span class="w"> </span><span class="nl">"ads_impressions"</span><span class="p">:</span><span class="w"> </span><span class="mi">1251</span><span class="p">,</span><span class="w"> </span><span class="nl">"ads_revenue"</span><span class="p">:</span><span class="w"> </span><span class="mf">265.52</span><span class="p">}</span><span class="w">
</span><span class="p">{</span><span class="nl">"date"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2023-01-06T00:00:00Z"</span><span class="p">,</span><span class="w"> </span><span class="nl">"ad_network"</span><span class="p">:</span><span class="w"> </span><span class="s2">"twottr"</span><span class="p">,</span><span class="w"> </span><span class="nl">"ads_impressions"</span><span class="p">:</span><span class="w"> </span><span class="mi">8055</span><span class="p">,</span><span class="w"> </span><span class="nl">"ads_revenue"</span><span class="p">:</span><span class="w"> </span><span class="mf">394.56</span><span class="p">}</span><span class="w">
</span><span class="p">{</span><span class="nl">"date"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2023-01-07T00:00:00Z"</span><span class="p">,</span><span class="w"> </span><span class="nl">"ad_network"</span><span class="p">:</span><span class="w"> </span><span class="s2">"gaagle"</span><span class="p">,</span><span class="w"> </span><span class="nl">"ads_impressions"</span><span class="p">:</span><span class="w"> </span><span class="mi">4279</span><span class="p">,</span><span class="w"> </span><span class="nl">"ads_revenue"</span><span class="p">:</span><span class="w"> </span><span class="mf">317.84</span><span class="p">}</span><span class="w">
</span><span class="p">{</span><span class="nl">"date"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2023-01-07T00:00:00Z"</span><span class="p">,</span><span class="w"> </span><span class="nl">"ad_network"</span><span class="p">:</span><span class="w"> </span><span class="s2">"fakebook"</span><span class="p">,</span><span class="w"> </span><span class="nl">"ads_impressions"</span><span class="p">:</span><span class="w"> </span><span class="mi">5848</span><span class="p">,</span><span class="w"> </span><span class="nl">"ads_revenue"</span><span class="p">:</span><span class="w"> </span><span class="mf">162.96</span><span class="p">}</span><span class="w">
</span><span class="p">{</span><span class="nl">"date"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2023-01-07T00:00:00Z"</span><span class="p">,</span><span class="w"> </span><span class="nl">"ad_network"</span><span class="p">:</span><span class="w"> </span><span class="s2">"twottr"</span><span class="p">,</span><span class="w"> </span><span class="nl">"ads_impressions"</span><span class="p">:</span><span class="w"> </span><span class="mi">9449</span><span class="p">,</span><span class="w"> </span><span class="nl">"ads_revenue"</span><span class="p">:</span><span class="w"> </span><span class="mf">379.21</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>
<p>Save this file as <code class="language-plaintext highlighter-rouge">data1.json</code>. Also, save the “new data” bit:</p>
<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="nl">"date"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2023-01-03T00:00:00Z"</span><span class="p">,</span><span class="w"> </span><span class="nl">"ad_network"</span><span class="p">:</span><span class="w"> </span><span class="s2">"gaagle"</span><span class="p">,</span><span class="w"> </span><span class="nl">"ads_impressions"</span><span class="p">:</span><span class="w"> </span><span class="mi">4521</span><span class="p">,</span><span class="w"> </span><span class="nl">"ads_revenue"</span><span class="p">:</span><span class="w"> </span><span class="mf">378.65</span><span class="p">}</span><span class="w">
</span><span class="p">{</span><span class="nl">"date"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2023-01-04T00:00:00Z"</span><span class="p">,</span><span class="w"> </span><span class="nl">"ad_network"</span><span class="p">:</span><span class="w"> </span><span class="s2">"gaagle"</span><span class="p">,</span><span class="w"> </span><span class="nl">"ads_impressions"</span><span class="p">:</span><span class="w"> </span><span class="mi">4330</span><span class="p">,</span><span class="w"> </span><span class="nl">"ads_revenue"</span><span class="p">:</span><span class="w"> </span><span class="mf">464.02</span><span class="p">}</span><span class="w">
</span><span class="p">{</span><span class="nl">"date"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2023-01-05T00:00:00Z"</span><span class="p">,</span><span class="w"> </span><span class="nl">"ad_network"</span><span class="p">:</span><span class="w"> </span><span class="s2">"gaagle"</span><span class="p">,</span><span class="w"> </span><span class="nl">"ads_impressions"</span><span class="p">:</span><span class="w"> </span><span class="mi">6088</span><span class="p">,</span><span class="w"> </span><span class="nl">"ads_revenue"</span><span class="p">:</span><span class="w"> </span><span class="mf">320.57</span><span class="p">}</span><span class="w">
</span><span class="p">{</span><span class="nl">"date"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2023-01-06T00:00:00Z"</span><span class="p">,</span><span class="w"> </span><span class="nl">"ad_network"</span><span class="p">:</span><span class="w"> </span><span class="s2">"gaagle"</span><span class="p">,</span><span class="w"> </span><span class="nl">"ads_impressions"</span><span class="p">:</span><span class="w"> </span><span class="mi">3417</span><span class="p">,</span><span class="w"> </span><span class="nl">"ads_revenue"</span><span class="p">:</span><span class="w"> </span><span class="mf">162.77</span><span class="p">}</span><span class="w">
</span><span class="p">{</span><span class="nl">"date"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2023-01-07T00:00:00Z"</span><span class="p">,</span><span class="w"> </span><span class="nl">"ad_network"</span><span class="p">:</span><span class="w"> </span><span class="s2">"gaagle"</span><span class="p">,</span><span class="w"> </span><span class="nl">"ads_impressions"</span><span class="p">:</span><span class="w"> </span><span class="mi">9762</span><span class="p">,</span><span class="w"> </span><span class="nl">"ads_revenue"</span><span class="p">:</span><span class="w"> </span><span class="mf">76.27</span><span class="p">}</span><span class="w">
</span><span class="p">{</span><span class="nl">"date"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2023-01-08T00:00:00Z"</span><span class="p">,</span><span class="w"> </span><span class="nl">"ad_network"</span><span class="p">:</span><span class="w"> </span><span class="s2">"gaagle"</span><span class="p">,</span><span class="w"> </span><span class="nl">"ads_impressions"</span><span class="p">:</span><span class="w"> </span><span class="mi">1484</span><span class="p">,</span><span class="w"> </span><span class="nl">"ads_revenue"</span><span class="p">:</span><span class="w"> </span><span class="mf">188.17</span><span class="p">}</span><span class="w">
</span><span class="p">{</span><span class="nl">"date"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2023-01-09T00:00:00Z"</span><span class="p">,</span><span class="w"> </span><span class="nl">"ad_network"</span><span class="p">:</span><span class="w"> </span><span class="s2">"gaagle"</span><span class="p">,</span><span class="w"> </span><span class="nl">"ads_impressions"</span><span class="p">:</span><span class="w"> </span><span class="mi">1845</span><span class="p">,</span><span class="w"> </span><span class="nl">"ads_revenue"</span><span class="p">:</span><span class="w"> </span><span class="mf">287.5</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>
<p>as <code class="language-plaintext highlighter-rouge">data2.json</code>.</p>
<h2 id="initial-data-ingestion">Initial data ingestion</h2>
<p>Let’s ingest the first data set. We want to set the segment granularity to <code class="language-plaintext highlighter-rouge">month</code>, so the ingestion statement uses a <code class="language-plaintext highlighter-rouge">PARTITIONED BY MONTH</code> clause. Moreover, we enforce secondary partitioning by choosing <code class="language-plaintext highlighter-rouge">REPLACE</code> mode and by including a <code class="language-plaintext highlighter-rouge">CLUSTERED BY</code> clause. Here’s the complete statement (replace the path in <code class="language-plaintext highlighter-rouge">baseDir</code> with the path you saved the sample file to):</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">REPLACE</span> <span class="k">INTO</span> <span class="nv">"ad_data"</span> <span class="n">OVERWRITE</span> <span class="k">ALL</span>
<span class="k">WITH</span> <span class="nv">"ext"</span> <span class="k">AS</span> <span class="p">(</span>
<span class="k">SELECT</span> <span class="o">*</span>
<span class="k">FROM</span> <span class="k">TABLE</span><span class="p">(</span>
<span class="n">EXTERN</span><span class="p">(</span>
<span class="s1">'{"type":"local","baseDir":"/<my base path>","filter":"data1.json"}'</span><span class="p">,</span>
<span class="s1">'{"type":"json"}'</span>
<span class="p">)</span>
<span class="p">)</span> <span class="n">EXTEND</span> <span class="p">(</span><span class="nv">"date"</span> <span class="nb">VARCHAR</span><span class="p">,</span> <span class="nv">"ad_network"</span> <span class="nb">VARCHAR</span><span class="p">,</span> <span class="nv">"ads_impressions"</span> <span class="nb">BIGINT</span><span class="p">,</span> <span class="nv">"ads_revenue"</span> <span class="nb">DOUBLE</span><span class="p">)</span>
<span class="p">)</span>
<span class="k">SELECT</span>
<span class="n">TIME_PARSE</span><span class="p">(</span><span class="nv">"date"</span><span class="p">)</span> <span class="k">AS</span> <span class="nv">"__time"</span><span class="p">,</span>
<span class="nv">"ad_network"</span><span class="p">,</span>
<span class="nv">"ads_impressions"</span><span class="p">,</span>
<span class="nv">"ads_revenue"</span>
<span class="k">FROM</span> <span class="nv">"ext"</span>
<span class="n">PARTITIONED</span> <span class="k">BY</span> <span class="k">MONTH</span>
<span class="n">CLUSTERED</span> <span class="k">BY</span> <span class="nv">"ad_network"</span>
</code></pre></div></div>
<p>You can run this SQL from the <code class="language-plaintext highlighter-rouge">Query</code> tab in the Druid console:</p>
<p><img src="/assets/2023-11-25-01-ingest1.jpg" alt="Console running initial ingestion" /></p>
<p>Or you can use the Ingest wizard to enter the same code.</p>
<h2 id="the-merge-query">The merge query</h2>
<p>Many thanks to <a href="https://www.linkedin.com/in/jkowtko/">John Kowtko</a> for pointing out this approach. Since we don’t have a <code class="language-plaintext highlighter-rouge">MERGE</code> statement, let’s emulate it using a <code class="language-plaintext highlighter-rouge">FULL OUTER JOIN</code>. Druid’s <a href="https://druid.apache.org/docs/latest/multi-stage-query/concepts#multi-stage-query-task-engine">MSQ engine</a> supports sort/merge joins of arbitrary size tables, so we can actually pull this off!</p>
<p>Important note: the new join algorithm needs to be explicitly requested by <a href="https://druid.apache.org/docs/latest/multi-stage-query/reference#joins">setting a query context parameter</a>. Open up the query engine menu next to the <code class="language-plaintext highlighter-rouge">Preview</code> button, and select <code class="language-plaintext highlighter-rouge">Edit context</code>:</p>
<p><img src="/assets/2023-11-25-02-ingest2.jpg" alt="Second ingestion with context" /></p>
<p>Add <code class="language-plaintext highlighter-rouge">{ "sqlJoinAlgorithm": "sortMerge" }</code> to the query context.</p>
<p><img src="/assets/2023-11-25-03-context.jpg" width="30%" /></p>
<p>Then run the ingestion query:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">REPLACE</span> <span class="k">INTO</span> <span class="nv">"ad_data"</span> <span class="n">OVERWRITE</span> <span class="k">ALL</span>
<span class="k">WITH</span> <span class="nv">"ext"</span> <span class="k">AS</span> <span class="p">(</span>
<span class="k">SELECT</span> <span class="o">*</span>
<span class="k">FROM</span> <span class="k">TABLE</span><span class="p">(</span>
<span class="n">EXTERN</span><span class="p">(</span>
<span class="s1">'{"type":"local","baseDir":"/<my base path>","filter":"data2.json"}'</span><span class="p">,</span>
<span class="s1">'{"type":"json"}'</span>
<span class="p">)</span>
<span class="p">)</span> <span class="n">EXTEND</span> <span class="p">(</span><span class="nv">"date"</span> <span class="nb">VARCHAR</span><span class="p">,</span> <span class="nv">"ad_network"</span> <span class="nb">VARCHAR</span><span class="p">,</span> <span class="nv">"ads_impressions"</span> <span class="nb">BIGINT</span><span class="p">,</span> <span class="nv">"ads_revenue"</span> <span class="nb">DOUBLE</span><span class="p">)</span>
<span class="p">)</span>
<span class="k">SELECT</span>
<span class="n">COALESCE</span><span class="p">(</span><span class="nv">"new_data"</span><span class="p">.</span><span class="nv">"__time"</span><span class="p">,</span> <span class="nv">"ad_data"</span><span class="p">.</span><span class="nv">"__time"</span><span class="p">)</span> <span class="k">AS</span> <span class="nv">"__time"</span><span class="p">,</span>
<span class="n">COALESCE</span><span class="p">(</span><span class="nv">"new_data"</span><span class="p">.</span><span class="nv">"ad_network"</span><span class="p">,</span> <span class="nv">"ad_data"</span><span class="p">.</span><span class="nv">"ad_network"</span><span class="p">)</span> <span class="k">AS</span> <span class="nv">"ad_network"</span><span class="p">,</span>
<span class="k">CASE</span> <span class="k">WHEN</span> <span class="nv">"new_data"</span><span class="p">.</span><span class="nv">"ad_network"</span> <span class="k">IS</span> <span class="k">NOT</span> <span class="k">NULL</span> <span class="k">THEN</span> <span class="nv">"new_data"</span><span class="p">.</span><span class="nv">"ads_impressions"</span> <span class="k">ELSE</span> <span class="nv">"ad_data"</span><span class="p">.</span><span class="nv">"ads_impressions"</span> <span class="k">END</span> <span class="k">AS</span> <span class="nv">"ads_impressions"</span><span class="p">,</span>
<span class="k">CASE</span> <span class="k">WHEN</span> <span class="nv">"new_data"</span><span class="p">.</span><span class="nv">"ad_network"</span> <span class="k">IS</span> <span class="k">NOT</span> <span class="k">NULL</span> <span class="k">THEN</span> <span class="nv">"new_data"</span><span class="p">.</span><span class="nv">"ads_revenue"</span> <span class="k">ELSE</span> <span class="nv">"ad_data"</span><span class="p">.</span><span class="nv">"ads_revenue"</span> <span class="k">END</span> <span class="k">AS</span> <span class="nv">"ads_revenue"</span>
<span class="k">FROM</span>
<span class="nv">"ad_data"</span>
<span class="k">FULL</span> <span class="k">OUTER</span> <span class="k">JOIN</span>
<span class="p">(</span> <span class="k">SELECT</span>
<span class="n">TIME_PARSE</span><span class="p">(</span><span class="nv">"date"</span><span class="p">)</span> <span class="k">AS</span> <span class="nv">"__time"</span><span class="p">,</span>
<span class="nv">"ad_network"</span><span class="p">,</span>
<span class="nv">"ads_impressions"</span><span class="p">,</span>
<span class="nv">"ads_revenue"</span>
<span class="k">FROM</span> <span class="nv">"ext"</span> <span class="p">)</span> <span class="nv">"new_data"</span>
<span class="k">ON</span> <span class="nv">"ad_data"</span><span class="p">.</span><span class="nv">"__time"</span> <span class="o">=</span> <span class="nv">"new_data"</span><span class="p">.</span><span class="nv">"__time"</span> <span class="k">AND</span> <span class="nv">"ad_data"</span><span class="p">.</span><span class="nv">"ad_network"</span> <span class="o">=</span> <span class="nv">"new_data"</span><span class="p">.</span><span class="nv">"ad_network"</span>
<span class="n">PARTITIONED</span> <span class="k">BY</span> <span class="k">MONTH</span>
<span class="n">CLUSTERED</span> <span class="k">BY</span> <span class="nv">"ad_network"</span>
</code></pre></div></div>
<h2 id="analysis-of-the-query">Analysis of the query</h2>
<p>What have we done here?</p>
<p>We are emulating the <code class="language-plaintext highlighter-rouge">MERGE</code> statement with a full outer join. The left side table is the data we already have in Druid; the right side is the new data. Our merge key is a combination of timestamp (daily granularity) and ad network.</p>
<p>For each key combination there are three possible outcomes:</p>
<ol>
<li>If the right hand side is <em>null</em>, leave the left hand side data as the result (leave old data untouched).</li>
<li>If neither side is <em>null</em>, replace the row(s) in the existing table with new data from the right hand side (update rows).</li>
<li>If the left hand side is <em>null</em>, insert the right hand side data into Druid.</li>
</ol>
<p>This is exactly what we wanted to happen.</p>
<p>In order to identify the correct data to be inserted, we look at the join key:</p>
<ul>
<li>Data rows that refer to <em>key fields</em> are modeled with a <code class="language-plaintext highlighter-rouge">COALESCE</code> statement: <code class="language-plaintext highlighter-rouge">COALESCE("new_data"."ad_network", "ad_data"."ad_network") AS "ad_network"</code> selects the key field from the right hand side, and if that one is <em>null</em> (right hand side doesn’t exist), then the left hand side instead.</li>
<li>For <em>non-key fields</em> the statement is a bit more complex because we still have to select based on the <em>key field</em>. Otherwise some real <em>null</em> values in the data might create inconsistencies, where we would overwrite rows only partially. Hence an expression like <code class="language-plaintext highlighter-rouge">CASE WHEN "new_data"."ad_network" IS NOT NULL THEN "new_data"."ads_impressions" ELSE "ad_data"."ads_impressions" END AS "ads_impressions"</code>.</li>
</ul>
<h2 id="can-we-be-more-selective">Can we be more selective?</h2>
<p>You might be thinking that this approach entails rewriting all the data in the existing table, even if the range of new data is much more limited. And you would be right. Fortunately, it is possible to <a href="https://druid.apache.org/docs/latest/multi-stage-query/reference#replace-specific-time-ranges">limit the date range to be overwritten</a>.</p>
<p>Let’s try this. Apparently we can specify the date range like so:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">REPLACE</span> <span class="k">INTO</span> <span class="nv">"ad_data"</span> <span class="n">OVERWRITE</span> <span class="k">WHERE</span> <span class="n">__time</span> <span class="o">>=</span> <span class="nb">TIMESTAMP</span><span class="s1">'2023-01-03'</span> <span class="k">AND</span> <span class="n">__time</span> <span class="o"><</span> <span class="nb">TIMESTAMP</span><span class="s1">'2023-01-10'</span>
<span class="p">...</span>
</code></pre></div></div>
<p>Alas, this doesn’t work:</p>
<p><img src="/assets/2023-11-25-04-granularity-error.jpg" alt="Granularity error" /></p>
<p><strong>The date filter has to be aligned with the segments</strong>, otherwise Druid will refuse to run the query. This is actually a Good Thing: in JSON ingestion mode you would be able to overwrite a whole segment with data covering a lesser date range, potentially deleting data that you actually wanted to keep!</p>
<p>If we adjust the date range clause to match the segment boundaries:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">REPLACE</span> <span class="k">INTO</span> <span class="nv">"ad_data"</span> <span class="n">OVERWRITE</span> <span class="k">WHERE</span> <span class="n">__time</span> <span class="o">>=</span> <span class="nb">TIMESTAMP</span><span class="s1">'2023-01-01'</span> <span class="k">AND</span> <span class="n">__time</span> <span class="o"><</span> <span class="nb">TIMESTAMP</span><span class="s1">'2023-02-01'</span>
<span class="p">...</span>
</code></pre></div></div>
<p>the ingestion query works fine and we get the desired result:</p>
<p><img src="/assets/2023-11-25-05-query.jpg" alt="Query" /></p>
<p>Use the new <a href="/2023/07/30/druid-sneak-peek-graphical-data-exploration/">graphical exploration mode</a> of Druid to get an idea of the data:</p>
<p><img src="/assets/2023-11-25-06-explore.jpg" alt="Explore" /></p>
<h2 id="learnings">Learnings</h2>
<ul>
<li>You can emulate the effect of a <code class="language-plaintext highlighter-rouge">MERGE</code> statement in Druid with a full outer join.</li>
<li>Make sure to enable the sort/merge join algorithm in the query context.</li>
<li>Some consideration must be taken around <em>null</em> values in the outer join result.</li>
<li>You can limit the range of data for reprocessing using <code class="language-plaintext highlighter-rouge">OVERWRITE WHERE ...</code>, but take care to align the time filter with your segment granularity.</li>
</ul>
<hr />
<p>“<a href="https://www.flickr.com/photos/mhlimages/48051262646/">This image is taken from Page 500 of Praktisches Kochbuch für die gewöhnliche und feinere Küche</a>” by <a href="https://www.flickr.com/photos/mhlimages/">Medical Heritage Library, Inc.</a> is licensed under <a target="_blank" rel="noopener noreferrer" href="https://creativecommons.org/licenses/by-nc-sa/2.0/">CC BY-NC-SA 2.0 <img src="https://mirrors.creativecommons.org/presskit/icons/cc.svg" style="height: 1em; margin-right: 0.125em; display: inline;" /><img src="https://mirrors.creativecommons.org/presskit/icons/by.svg" style="height: 1em; margin-right: 0.125em; display: inline;" /><img src="https://mirrors.creativecommons.org/presskit/icons/nc.svg" style="height: 1em; margin-right: 0.125em; display: inline;" /><img src="https://mirrors.creativecommons.org/presskit/icons/sa.svg" style="height: 1em; margin-right: 0.125em; display: inline;" /></a>.</p>Druid SQL: BETWEEN considered harmful2023-11-03T00:00:00+01:002023-11-03T00:00:00+01:00/2023/11/03/druid-sql-between-considered-harmful<p><img src="/assets/2023-11-03-903932_platinumfusi0n_grug.png" width="50%" /></p>
<p>When querying data in Druid (or another analytical database), your query will in almost all cases include a filter on the primary timestamp. And this timestamp filter will usually take the form of an interval.</p>
<p>The easiest way to describe such an interval seems to be the SQL <code class="language-plaintext highlighter-rouge">BETWEEN</code> operator.</p>
<p>Advice from a <a href="https://grugbrain.dev/">grug brained developer</a>: <strong>Don’t do that.</strong></p>
<p>Here’s why.</p>
<h2 id="a-harmless-data-sample">A harmless data sample</h2>
<p>Imagine you have a table like this:</p>
<table>
<thead>
<tr>
<th>__time</th>
<th>val</th>
</tr>
</thead>
<tbody>
<tr>
<td>2023-01-01T01:00:00.000Z</td>
<td>1</td>
</tr>
<tr>
<td>2023-01-02T00:00:00.000Z</td>
<td>1</td>
</tr>
<tr>
<td>2023-01-02T06:00:00.000Z</td>
<td>1</td>
</tr>
<tr>
<td>2023-01-03T00:00:00.000Z</td>
<td>1</td>
</tr>
<tr>
<td>2023-01-03T01:00:00.000Z</td>
<td>1</td>
</tr>
<tr>
<td>2023-01-04T00:00:00.000Z</td>
<td>1</td>
</tr>
<tr>
<td>2023-01-04T07:00:00.000Z</td>
<td>1</td>
</tr>
</tbody>
</table>
<p>You can populate such a table in Druid using SQL ingestion like so:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">REPLACE</span> <span class="k">INTO</span> <span class="nv">"sample"</span> <span class="n">OVERWRITE</span> <span class="k">ALL</span>
<span class="k">WITH</span> <span class="nv">"ext"</span> <span class="k">AS</span> <span class="p">(</span>
<span class="k">SELECT</span> <span class="o">*</span>
<span class="k">FROM</span> <span class="k">TABLE</span><span class="p">(</span>
<span class="n">EXTERN</span><span class="p">(</span>
<span class="s1">'{"type":"inline","data":"datetime,val</span><span class="se">\n</span><span class="s1">2023-01-01 01:00:00,1</span><span class="se">\n</span><span class="s1">2023-01-02 00:00:00,1</span><span class="se">\n</span><span class="s1">2023-01-02 06:00:00,1</span><span class="se">\n</span><span class="s1">2023-01-03 00:00:00,1</span><span class="se">\n</span><span class="s1">2023-01-03 01:00:00,1</span><span class="se">\n</span><span class="s1">2023-01-04 00:00:00,1</span><span class="se">\n</span><span class="s1">2023-01-04 07:00:00,1"}'</span><span class="p">,</span>
<span class="s1">'{"type":"csv","findColumnsFromHeader":true}'</span>
<span class="p">)</span>
<span class="p">)</span> <span class="n">EXTEND</span> <span class="p">(</span><span class="nv">"datetime"</span> <span class="nb">VARCHAR</span><span class="p">,</span> <span class="nv">"val"</span> <span class="nb">BIGINT</span><span class="p">)</span>
<span class="p">)</span>
<span class="k">SELECT</span>
<span class="n">TIME_PARSE</span><span class="p">(</span><span class="k">TRIM</span><span class="p">(</span><span class="nv">"datetime"</span><span class="p">))</span> <span class="k">AS</span> <span class="nv">"__time"</span><span class="p">,</span>
<span class="nv">"val"</span>
<span class="k">FROM</span> <span class="nv">"ext"</span>
<span class="n">PARTITIONED</span> <span class="k">BY</span> <span class="k">DAY</span>
</code></pre></div></div>
<p>You want to list all rows for 2nd and 3rd January. You write:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="nv">"sample"</span>
<span class="k">WHERE</span> <span class="n">__time</span> <span class="k">BETWEEN</span> <span class="nb">TIMESTAMP</span><span class="s1">'2023-01-02'</span> <span class="k">AND</span> <span class="nb">TIMESTAMP</span><span class="s1">'2023-01-03'</span>
</code></pre></div></div>
<p>And here’s the result:</p>
<table>
<thead>
<tr>
<th>__time</th>
<th>val</th>
</tr>
</thead>
<tbody>
<tr>
<td>2023-01-02T00:00:00.000Z</td>
<td>1</td>
</tr>
<tr>
<td>2023-01-02T06:00:00.000Z</td>
<td>1</td>
</tr>
<tr>
<td>2023-01-03T00:00:00.000Z</td>
<td>1</td>
</tr>
</tbody>
</table>
<p>You notice that all the rows for 2nd January are in the result, but only one row for 3rd January. What happened?</p>
<h2 id="the-solution">The solution</h2>
<p>We are being hit by two entirely documented features here, which together create a minor footgun.</p>
<ol>
<li>The <code class="language-plaintext highlighter-rouge">BETWEEN</code> operator creates a closed interval, that is it includes both the left and right boundary value. This would by itself not be a problem, were it not for the second feature.</li>
<li>The literal <code class="language-plaintext highlighter-rouge">TIMESTAMP'2023-01-03'</code> does <em>not</em> mean “the entire day of 3rd January”, as one might naïvely think. It is equivalent to “3rd January, 00:00”.</li>
</ol>
<p>What we have done is: we have created a query that includes the entire 2nd January but only the data for 00:00 on 3rd January!</p>
<p>You could fix this by writing something like <code class="language-plaintext highlighter-rouge">TIMESTAMP'2023-01-03 23:59:59'</code> for the right interval boundary. But does this really catch every last bit of the data for that day? What if you have fractional timestamps? Is your precision milliseconds? or even microseconds?</p>
<p>This is why I argue that the proper way to model such time filter conditions is to use a right-open interval, which includes the left boundary value <em>but not</em> the right boundary value. If you do that, you have to set the right boundary to the <em>next</em> day (4th January), in order to still catch all of 3rd January in your filter:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="nv">"sample"</span>
<span class="k">WHERE</span> <span class="n">__time</span> <span class="o">>=</span> <span class="nb">TIMESTAMP</span><span class="s1">'2023-01-02'</span> <span class="k">AND</span> <span class="n">__time</span> <span class="o"><</span> <span class="nb">TIMESTAMP</span><span class="s1">'2023-01-04'</span>
</code></pre></div></div>
<p>This query returns the correct result:</p>
<table>
<thead>
<tr>
<th>__time</th>
<th>val</th>
</tr>
</thead>
<tbody>
<tr>
<td>2023-01-02T00:00:00.000Z</td>
<td>1</td>
</tr>
<tr>
<td>2023-01-02T06:00:00.000Z</td>
<td>1</td>
</tr>
<tr>
<td>2023-01-03T00:00:00.000Z</td>
<td>1</td>
</tr>
<tr>
<td>2023-01-03T01:00:00.000Z</td>
<td>1</td>
</tr>
</tbody>
</table>
<p>This way of filtering is also in line with the treatment of time intervals almost everywhere in Druid. Segment time chunks, for instance, are defined in terms of right open intervals, too.</p>
<p>Edit 2023-11-06: <a href="https://pmio.hashnode.dev/">Peter</a> pointed out that you can instead use the <a href="https://druid.apache.org/docs/latest/querying/sql-scalar/#date-and-time-functions"><code class="language-plaintext highlighter-rouge">TIME_IN_INTERVAL</code></a> function. This uses ISO interval notation and creates exactly the right exclusive intervals we want. So a more elegant way of rewriting the query is:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="nv">"sample"</span>
<span class="k">WHERE</span> <span class="n">TIME_IN_INTERVAL</span><span class="p">(</span><span class="n">__time</span><span class="p">,</span> <span class="s1">'2023-01-02/2023-01-04'</span><span class="p">)</span>
</code></pre></div></div>
<h2 id="learnings">Learnings</h2>
<ul>
<li>Don’t use the <code class="language-plaintext highlighter-rouge">BETWEEN</code> operator in SQL. Especially not for time intervals. Because the operator creates an inclusive (closed) interval, the result may not be what you expect.</li>
<li>Use a <code class="language-plaintext highlighter-rouge">WHERE</code> clause with simple comparison operators instead, to create a right open interval.</li>
</ul>
<hr />
<p class="attribution">"<a target="_blank" rel="noopener noreferrer" href="https://www.newgrounds.com/art/view/platinumfusi0n/grug">Grug</a>" by <a target="_blank" rel="noopener noreferrer" href="https://platinumfusi0n.newgrounds.com/">PlatinumFusi0n</a> is licensed under <a target="_blank" rel="noopener noreferrer" href="https://creativecommons.org/licenses/by/3.0/">CC BY 3.0 <img src="https://mirrors.creativecommons.org/presskit/icons/cc.svg" style="height: 1em; margin-right: 0.125em; display: inline;" /><img src="https://mirrors.creativecommons.org/presskit/icons/by.svg" style="height: 1em; margin-right: 0.125em; display: inline;" /></a>. </p>Druid 28 Sneak Peek: Ingesting Multiple Kafka Topics into One Datasource2023-10-29T00:00:00+02:002023-10-29T00:00:00+02:00/2023/10/29/druid-28-sneak-peek-ingesting-multiple-kafka-topics-into-one-datasource<p><img src="/assets/2022-11-23-00-pizza.jpg" alt="Pizza" /></p>
<p><a href="https://druid.apache.org/">Apache Druid</a> has the concept of <a href="https://druid.apache.org/docs/latest/development/extensions-core/kafka-ingestion">supervisors</a> that orchestrate ingestion jobs and handle data handoff and failure recovery. Per datasource, you can have exactly one supervisor.</p>
<p>Until recently, that meant that one datasource could only ingest data from one stream. But many of my customers asked whether it would be possible to multiplex several streams into populating one datasource. With Druid 28, this becomes possible!</p>
<p>In this quick tutorial, you will learn how to utilize the new options in Kafka ingestion so as to stream multiple topics into one Druid datasource. You will need:</p>
<ul>
<li>a Druid 28 preview build (see below)</li>
<li>any Kafka installation</li>
<li>I am using <a href="https://github.com/Aiven-Labs/python-fake-data-producer-for-apache-kafka">Francesco’s pizza simulator</a> for generating test data.</li>
</ul>
<h2 id="building-the-distribution">Building the distribution</h2>
<p>You can use the 30 day free trial of <a href="https://imply.io/download-imply/">Imply’s Druid release</a> which already contains the new features. <a href="https://docs.imply.io/latest/druid/development/extensions-core/kafka-supervisor-reference/#ingesting-from-multiple-topics">Documentation is also available</a>.</p>
<p>But if you want to build the open source version:</p>
<p>Clone the Druid repository, check out the 28.0.0 branch, and build the tarball:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>git clone https://github.com/apache/druid.git
<span class="nb">cd </span>druid
git checkout 28.0.0
mvn clean <span class="nb">install</span> <span class="nt">-Pdist</span> <span class="nt">-DskipTests</span>
</code></pre></div></div>
<p>Then follow the <a href="https://druid.apache.org/docs/latest/development/build.html">instructions</a> to locate and install the tarball, and start Druid. Make sure you are <a href="https://druid.apache.org/docs/latest/development/extensions-core/kafka-ingestion#load-the-kafka-indexing-service">loading the Kafka indexing extension</a>. (It is included in the quickstart but not by default in the Docker image.)</p>
<h2 id="generating-test-data">Generating test data</h2>
<p>I am assuming that you are running Kafka locally on the standard port and that you have enabled auto topic creation.</p>
<p>Clone the simulator repository, change to the simulator directory and run three instances of pizza delivery:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>python3 main.py <span class="nt">--host</span> localhost <span class="nt">--port</span> 9092 <span class="nt">--topic-name</span> pizza-mario <span class="nt">--max-waiting-time</span> 5 <span class="nt">--security-protocol</span> PLAINTEXT <span class="nt">--nr-messages</span> 0 <span class="o">></span>/dev/null &
python3 main.py <span class="nt">--host</span> localhost <span class="nt">--port</span> 9092 <span class="nt">--topic-name</span> pizza-luigi <span class="nt">--max-waiting-time</span> 5 <span class="nt">--security-protocol</span> PLAINTEXT <span class="nt">--nr-messages</span> 0 <span class="o">></span>/dev/null &
python3 main.py <span class="nt">--host</span> localhost <span class="nt">--port</span> 9092 <span class="nt">--topic-name</span> my-pizza <span class="nt">--max-waiting-time</span> 5 <span class="nt">--security-protocol</span> PLAINTEXT <span class="nt">--nr-messages</span> 0 <span class="o">></span>/dev/null &
</code></pre></div></div>
<p>If you have set up Kafka differently, you may have to modify these instructions.</p>
<h2 id="connecting-druid-to-the-streams">Connecting Druid to the streams</h2>
<p>Navigate your browser to the Druid GUI (in the quickstart, this is http://localhost:8888), and start configuring a streaming ingestion:</p>
<p><img src="/assets/2023-10-29-01-streaming.jpg" width="35%" /></p>
<p>Choose Kafka as the input source. Note how there is a new option <code class="language-plaintext highlighter-rouge">topicPattern</code> in the connection settings:</p>
<p><img src="/assets/2023-10-29-02-pattern-setting.jpg" alt="Connection screen" /></p>
<p>This is a <a href="https://en.wikipedia.org/wiki/Regular_expression">regular expression</a> that you can specify in place of the topic name. Let’s try to gobble up all our pizza related topics by setting the pattern to <em>“pizza”</em>:</p>
<p><img src="/assets/2023-10-29-03-naive-pattern.jpg" alt="Naive attempt" /></p>
<p>Oh, this didn’t work as expected. But the documentation and bubble help show us the solution: The topic pattern has to match <em>the entire topic name</em>. So, the above expression actually matches like the regular expression <code class="language-plaintext highlighter-rouge">^pizza$</code>.</p>
<p>Armed with this knowledge, let’s correct the pattern:</p>
<p><img src="/assets/2023-10-29-04-match-both.jpg" alt="Preview with prefix match" /></p>
<p>This matches all topic names that start with <em>“pizza-“</em>.</p>
<h2 id="building-the-data-model">Building the data model</h2>
<p>Let’s have a look at the <code class="language-plaintext highlighter-rouge">Parse data</code> screen. Among the <a href="/2022/11/23/processing-nested-json-data-and-kafka-metadata-in-apache-druid/">Kafka metadata</a>, there is a new field containing the source topic for each row of data. The default column name is <code class="language-plaintext highlighter-rouge">kafka.topic</code> but this is configurable in the Kafka metadata settings on the right hand side:</p>
<p><img src="/assets/2023-10-29-05-topic-field.jpg" alt="Parse screen with metadata settings" /></p>
<p>Proceed to the final data model - the topic name is automatically included as a <code class="language-plaintext highlighter-rouge">string</code> column:</p>
<p><img src="/assets/2023-10-29-06-data-model.jpg" alt="Data model" /></p>
<p>Before kicking off the ingestion job, you may want to review and edit the datasource name</p>
<p><img src="/assets/2023-10-29-07-rename-datasource.jpg" width="40%" /></p>
<p>because by default, the datasource name is derived from the topic pattern and may contain a lot of special characters.</p>
<p>Once the supervisor is running, you should see data coming in from both the <code class="language-plaintext highlighter-rouge">pizza-mario</code> and <code class="language-plaintext highlighter-rouge">pizza-luigi</code> topics:</p>
<p><img src="/assets/2023-10-29-08-query.jpg" alt="Query example" /></p>
<p>What if you want to pick up all 3 topics? From the above, it should be clear - you need to pad the regular expression with <code class="language-plaintext highlighter-rouge">.*</code> on <em>both</em> sides:</p>
<p><img src="/assets/2023-10-29-09-open-pattern.jpg" width="30%" /></p>
<p>You can try it yourself!</p>
<h2 id="task-management">Task management</h2>
<p>Druid will pick up any topics that match the <code class="language-plaintext highlighter-rouge">topicPattern</code>, even if new topics are added during the ingestion.</p>
<p>How are partitions assigned to tasks?</p>
<p>The Supervisor will fetch the list of all partitions from all topics and assign the list of these partitions in same way as it assigns the partitions for one topic. In detail this means (quote from the <a href="https://docs.imply.io/latest/druid/development/extensions-core/kafka-supervisor-reference/#ingesting-from-multiple-topics">documentation</a>):</p>
<blockquote>
<p>When ingesting data from multiple topics, partitions are assigned based on the hashcode of the topic name and the id of the partition within that topic. The partition assignment might not be uniform across all the tasks.</p>
</blockquote>
<p>And looking at the code, this boils down to</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Math.abs(31 * topic().hashCode() + partitionId) % taskCount
</code></pre></div></div>
<p>This heuristic should give a fairly uniform load, provided that the data volumes per <em>partition</em> are comparable.</p>
<h2 id="conclusion">Conclusion</h2>
<ul>
<li>You can use <code class="language-plaintext highlighter-rouge">topicPattern</code> instead of <code class="language-plaintext highlighter-rouge">topic</code> in a Kafka Supervisor spec, to enable ingesting from multiple topics.</li>
<li><code class="language-plaintext highlighter-rouge">topicPattern</code> is a regex but the regex has to match the whole topic name</li>
<li>You can have as many active ingestion tasks as the total partitions of all topics. Partitions are assigned to tasks using a hashing algorithm.</li>
</ul>
<hr />
<p class="attribution">"<a target="_blank" rel="noopener noreferrer" href="https://www.flickr.com/photos/26242865@N04/5919366429">Pizza</a>" by <a target="_blank" rel="noopener noreferrer" href="https://www.flickr.com/photos/26242865@N04">Katrin Gilger</a> is licensed under <a target="_blank" rel="noopener noreferrer" href="https://creativecommons.org/licenses/by-sa/2.0/?ref=openverse">CC BY-SA 2.0 <img src="https://mirrors.creativecommons.org/presskit/icons/cc.svg" style="height: 1em; margin-right: 0.125em; display: inline;" /><img src="https://mirrors.creativecommons.org/presskit/icons/by.svg" style="height: 1em; margin-right: 0.125em; display: inline;" /><img src="https://mirrors.creativecommons.org/presskit/icons/sa.svg" style="height: 1em; margin-right: 0.125em; display: inline;" /></a>. </p>New in Imply Polaris: Data Retention Policy2023-09-24T00:00:00+02:002023-09-24T00:00:00+02:00/2023/09/24/new-in-imply-polaris-data-retention-policy<p><a href="https://druid.apache.org/">Apache Druid</a> has always had built-in data lifecycle management by way of <a href="https://druid.apache.org/docs/latest/operations/rule-configuration/">retention rules</a>. Specifying fixed time intervals or relative periods, you would tell Druid to retain only data segments that are not older than <em>x</em> days.</p>
<p>The <a href="https://docs.imply.io/polaris/release#20230816">mid-August release</a> of Polaris brings retention management to Imply Polaris, the fully managed analytics service powered by Druid. You can set the retention policy by table. Here is how it’s done:</p>
<p>In the <em>Tables</em> view, select the <code class="language-plaintext highlighter-rouge">...</code> menu for the table that you want to set the retention policy for.</p>
<p><img src="/assets/2023-09-24-01.jpg" alt="Tables view with context menu" /></p>
<p>In the <em>Edit table</em> screen, find the barrel icon with <code class="language-plaintext highlighter-rouge">Data retention</code> next to it. Select <code class="language-plaintext highlighter-rouge">Specific</code>, and enter the desired period. The format is <a href="https://en.wikipedia.org/wiki/ISO_8601#Durations">ISO-8601 duration</a>, so for instance, <code class="language-plaintext highlighter-rouge">P7D</code> means 7 days (before the current date.) Any data that is older (by primary timestamp) will be scheduled for deletion after 30 days.</p>
<p><img src="/assets/2023-09-24-02.jpg" alt="Table editor with retention menu" /></p>
<p>Then hit <code class="language-plaintext highlighter-rouge">Update</code> to apply the changes.</p>
<h2 id="conclusion">Conclusion</h2>
<ul>
<li>Data retention management is now available in Polaris.</li>
<li>Unlike Druid default (which retains data in deep storage indefinitely), data dropped from Polaris will be deleted permanently after 30 days.</li>
</ul>Apache Druid has always had built-in data lifecycle management by way of retention rules. Specifying fixed time intervals or relative periods, you would tell Druid to retain only data segments that are not older than x days.New in Apache Druid 27: Querying Deep Storage2023-09-07T00:00:00+02:002023-09-07T00:00:00+02:00/2023/09/07/new-in-apache-druid-27-querying-deep-storage<p>In realtime analytics, a common scenario is that you want to retain a lot of (years of) historical data in order to run analytics over a longer period of time. But these analytical queries occur infrequently and their performance is usually not critical. The bulk of everyday queries, however, accesses only a limited set of relatively fresh data, typically 1 or 2 weeks worth.</p>
<p>In the standard configuration of Druid, until recently you would have to preload all data that you wanted to be queryable to your data servers. That would mean a lot of local storage would be required, most of which would be accessed very rarely. You could mitigate this problem to a certain extent using <a href="https://druid.apache.org/docs/latest/operations/mixed-workloads#historical-tiering">data tiering</a>, but the cost associated with just having that storage around would still be considerable.</p>
<p>Druid 27 comes with the ability to <a href="https://druid.apache.org/docs/latest/querying/query-deep-storage">query deep storage</a> directly, meaning in the above scenario you can actually keep only your 1-2 weeks of fresh data on local SSDs and retain all your historical data in deep storage only. Because of the higher latency of cloud storage, deep storage queries are generally executed asynchronously, and there is a new API endpoint just for deep storage queries.</p>
<p>Let’s run a small example to learn how deep storage query is configured and used!</p>
<p>This tutorial works with the Druid 27 quickstart.</p>
<h2 id="building-the-test-data-set">Building the test data set</h2>
<p>Ingest the <em>wikipedia</em> example data set. We want to have a bunch of segments so let’s partition by hour. You can configure the ingestion job using the wizard, or just use this SQL statement:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">REPLACE</span> <span class="k">INTO</span> <span class="nv">"wikipedia"</span> <span class="n">OVERWRITE</span> <span class="k">ALL</span>
<span class="k">WITH</span> <span class="nv">"ext"</span> <span class="k">AS</span> <span class="p">(</span><span class="k">SELECT</span> <span class="o">*</span>
<span class="k">FROM</span> <span class="k">TABLE</span><span class="p">(</span>
<span class="n">EXTERN</span><span class="p">(</span>
<span class="s1">'{"type":"http","uris":["https://druid.apache.org/data/wikipedia.json.gz"]}'</span><span class="p">,</span>
<span class="s1">'{"type":"json"}'</span>
<span class="p">)</span>
<span class="p">)</span> <span class="n">EXTEND</span> <span class="p">(</span><span class="nv">"isRobot"</span> <span class="nb">VARCHAR</span><span class="p">,</span> <span class="nv">"channel"</span> <span class="nb">VARCHAR</span><span class="p">,</span> <span class="nv">"timestamp"</span> <span class="nb">VARCHAR</span><span class="p">,</span> <span class="nv">"flags"</span> <span class="nb">VARCHAR</span><span class="p">,</span> <span class="nv">"isUnpatrolled"</span> <span class="nb">VARCHAR</span><span class="p">,</span> <span class="nv">"page"</span> <span class="nb">VARCHAR</span><span class="p">,</span> <span class="nv">"diffUrl"</span> <span class="nb">VARCHAR</span><span class="p">,</span> <span class="nv">"added"</span> <span class="nb">BIGINT</span><span class="p">,</span> <span class="nv">"comment"</span> <span class="nb">VARCHAR</span><span class="p">,</span> <span class="nv">"commentLength"</span> <span class="nb">BIGINT</span><span class="p">,</span> <span class="nv">"isNew"</span> <span class="nb">VARCHAR</span><span class="p">,</span> <span class="nv">"isMinor"</span> <span class="nb">VARCHAR</span><span class="p">,</span> <span class="nv">"delta"</span> <span class="nb">BIGINT</span><span class="p">,</span> <span class="nv">"isAnonymous"</span> <span class="nb">VARCHAR</span><span class="p">,</span> <span class="nv">"user"</span> <span class="nb">VARCHAR</span><span class="p">,</span> <span class="nv">"deltaBucket"</span> <span class="nb">BIGINT</span><span class="p">,</span> <span class="nv">"deleted"</span> <span class="nb">BIGINT</span><span class="p">,</span> <span class="nv">"namespace"</span> <span class="nb">VARCHAR</span><span class="p">,</span> <span class="nv">"cityName"</span> <span class="nb">VARCHAR</span><span class="p">,</span> <span class="nv">"countryName"</span> <span class="nb">VARCHAR</span><span class="p">,</span> <span class="nv">"regionIsoCode"</span> <span class="nb">VARCHAR</span><span class="p">,</span> <span class="nv">"metroCode"</span> <span class="nb">BIGINT</span><span class="p">,</span> <span class="nv">"countryIsoCode"</span> <span class="nb">VARCHAR</span><span class="p">,</span> <span class="nv">"regionName"</span> <span class="nb">VARCHAR</span><span class="p">))</span>
<span class="k">SELECT</span>
<span class="n">TIME_PARSE</span><span class="p">(</span><span class="nv">"timestamp"</span><span class="p">)</span> <span class="k">AS</span> <span class="nv">"__time"</span><span class="p">,</span>
<span class="nv">"isRobot"</span><span class="p">,</span>
<span class="nv">"channel"</span><span class="p">,</span>
<span class="nv">"flags"</span><span class="p">,</span>
<span class="nv">"isUnpatrolled"</span><span class="p">,</span>
<span class="nv">"page"</span><span class="p">,</span>
<span class="nv">"diffUrl"</span><span class="p">,</span>
<span class="nv">"added"</span><span class="p">,</span>
<span class="nv">"comment"</span><span class="p">,</span>
<span class="nv">"commentLength"</span><span class="p">,</span>
<span class="nv">"isNew"</span><span class="p">,</span>
<span class="nv">"isMinor"</span><span class="p">,</span>
<span class="nv">"delta"</span><span class="p">,</span>
<span class="nv">"isAnonymous"</span><span class="p">,</span>
<span class="nv">"user"</span><span class="p">,</span>
<span class="nv">"deltaBucket"</span><span class="p">,</span>
<span class="nv">"deleted"</span><span class="p">,</span>
<span class="nv">"namespace"</span><span class="p">,</span>
<span class="nv">"cityName"</span><span class="p">,</span>
<span class="nv">"countryName"</span><span class="p">,</span>
<span class="nv">"regionIsoCode"</span><span class="p">,</span>
<span class="nv">"metroCode"</span><span class="p">,</span>
<span class="nv">"countryIsoCode"</span><span class="p">,</span>
<span class="nv">"regionName"</span>
<span class="k">FROM</span> <span class="nv">"ext"</span>
<span class="n">PARTITIONED</span> <span class="k">BY</span> <span class="n">HOUR</span>
</code></pre></div></div>
<p>You should end up with 22 segments, each spanning an hour.</p>
<h2 id="recap-retention-rules">Recap: Retention rules</h2>
<p>By default, Druid retains all data in deep storage that it has ever ingested. You have to run an explicit <a href="https://druid.apache.org/docs/latest/tutorials/tutorial-delete-data#run-a-kill-task">kill task</a> to delete data permanently.</p>
<p>However, standard Druid queries can only work with data segments that have been preloaded to the data servers. Preloading of data is configured using <a href="https://druid.apache.org/docs/latest/operations/rule-configuration">retention rules</a>, which you can configure on a per-datasource basis. Retention rules are evaluated for each segment, from top to bottom, until a rule is found that matches the segment in question. Each rule is either a <em>Load</em> rule (which tells the Coordinator to make that segment available for queries), or a <em>Drop</em> rule (which removes the segment from the list of available segments.) Rules specify either a time period (relative to the current time), or an absolute time interval.</p>
<p>In production setups you would usually find period rules (“retain only data for the last 2 weeks”), but for the tutorial we are going to use interval rules because we are working with a fixed dataset.</p>
<h2 id="first-attempt-to-configure-deep-storage-query">First attempt to configure deep storage query</h2>
<p>The data sample includes one day’s worth of data. Let’s <em>load</em> all data from noon onward, and <em>drop</em> all data from before noon, and see if we can query the data using the endpoint for deep storage.</p>
<p>Here is the first set of retention rules:</p>
<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">[</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="nl">"interval"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2016-06-27T12:00:00.000Z/2020-01-01T00:00:00.000Z"</span><span class="p">,</span><span class="w">
</span><span class="nl">"tieredReplicants"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nl">"_default_tier"</span><span class="p">:</span><span class="w"> </span><span class="mi">2</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="nl">"useDefaultTierForNull"</span><span class="p">:</span><span class="w"> </span><span class="kc">true</span><span class="p">,</span><span class="w">
</span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"loadByInterval"</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"dropForever"</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">]</span><span class="w">
</span></code></pre></div></div>
<p>If you run a standard query in the console, you see that the rules have been applied:</p>
<p><img src="/assets/2023-09-07-01-query-historical.jpg" alt="Query using standard engine, showing 10 segments" /></p>
<p>Using <code class="language-plaintext highlighter-rouge">curl</code>, I am sending the same query to <a href="https://druid.apache.org/docs/latest/api-reference/sql-api#query-from-deep-storage">the endpoint for deep storage query</a>:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>curl -L -H 'Content-Type: application/json' localhost:8888/druid/v2/sql/statements -d'{
"query": "SELECT DATE_TRUNC('\''hour'\'', __time), COUNT(*) FROM \"wikipedia\" GROUP BY 1 ORDER BY 1",
"context":{
"executionMode":"ASYNC"
}
}'
{"queryId":"query-db8b79ae-f28b-466e-b876-3f987d0e87fc","state":"ACCEPTED","createdAt":"2023-09-06T11:33:39.839Z","schema":[{"name":"EXPR$0","type":"TIMESTAMP","nativeType":"LONG"},{"name":"EXPR$1","type":"BIGINT","nativeType":"LONG"}],"durationMs":-1}
</code></pre></div></div>
<p>This is an asynchronous endpoint - it returns immediately and hands me back a query ID. I have to append the query ID to the URL in order to poll the status and eventually get the result:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>curl -L -H 'Content-Type: application/json' localhost:8888/druid/v2/sql/statements/query-db8b79ae-f28b-466e-b876-3f987d0e87fc
{"queryId":"query-db8b79ae-f28b-466e-b876-3f987d0e87fc","state":"SUCCESS","createdAt":"2023-09-06T11:33:39.839Z","schema":[{"name":"EXPR$0","type":"TIMESTAMP","nativeType":"LONG"},{"name":"EXPR$1","type":"BIGINT","nativeType":"LONG"}],"durationMs":13944,"result":{"numTotalRows":10,"totalSizeInBytes":374,"dataSource":"__query_select","sampleRecords":[[1467028800000,1219],[1467032400000,1211],[1467036000000,1353],[1467039600000,1422],[1467043200000,1442],[1467046800000,1339],[1467050400000,1321],[1467054000000,1175],[1467057600000,1213],[1467061200000,603]],"pages":[{"id":0,"numRows":10,"sizeInBytes":374}]}}
</code></pre></div></div>
<p>Oops. We got the same ten rows as from the interactive query. The naïve approach of just dropping the segments didn’t work. Or rather, it worked as intended.</p>
<h2 id="doing-it-right">Doing it right</h2>
<p>Druid actually distinguishes whether a segment is <em>unavailable</em> (and exists in deep storage only) or whether it is <em>available but not preloaded</em>, which is a new thing in Druid 27. The latter case is expressed by configuring a <em>load</em> rule for that segment, <em>but with a replication factor of 0</em>.</p>
<p>Also worth noting is that at least one segment for the datasource in question has to be preloaded, or else Druid won’t be able to query it at all.</p>
<p>So instead of dropping the segments, let’s load them with a replication factor of 0:</p>
<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">[</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="nl">"interval"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2016-06-27T12:00:00.000Z/2020-01-01T00:00:00.000Z"</span><span class="p">,</span><span class="w">
</span><span class="nl">"tieredReplicants"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nl">"_default_tier"</span><span class="p">:</span><span class="w"> </span><span class="mi">2</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="nl">"useDefaultTierForNull"</span><span class="p">:</span><span class="w"> </span><span class="kc">true</span><span class="p">,</span><span class="w">
</span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"loadByInterval"</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="nl">"interval"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2010-01-01T00:00:00.000Z/2016-06-27T12:00:00.000Z"</span><span class="p">,</span><span class="w">
</span><span class="nl">"tieredReplicants"</span><span class="p">:</span><span class="w"> </span><span class="p">{},</span><span class="w">
</span><span class="nl">"useDefaultTierForNull"</span><span class="p">:</span><span class="w"> </span><span class="kc">false</span><span class="p">,</span><span class="w">
</span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"loadByInterval"</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">]</span><span class="w">
</span></code></pre></div></div>
<p>This is how the rules look like in the console view:</p>
<p><img src="/assets/2023-09-07-04-final-load-rules.jpg" width="75%" /></p>
<p>Use the <em>Mark as used all segments</em> function to force the Coordinator to reapply the retention rules:</p>
<p><img src="/assets/2023-09-07-02-reapply-coordinator-rules.jpg" width="60%" /></p>
<p>This forces the morning segments to be available for asynchronous query only. You will see this reflected in the <code class="language-plaintext highlighter-rouge">Datasources</code> view like this:</p>
<p><img src="/assets/2023-09-07-03-segments-preloaded.jpg" width="52%" /></p>
<p>Then run the same query again:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>curl -L -H 'Content-Type: application/json' localhost:8888/druid/v2/sql/statements -d'{
"query": "SELECT DATE_TRUNC('\''hour'\'', __time), COUNT(*) FROM \"wikipedia\" GROUP BY 1 ORDER BY 1",
"context":{
"executionMode":"ASYNC"
}
}'
{"queryId":"query-7f972571-b26e-4206-a7a8-61503d386d4b","state":"ACCEPTED","createdAt":"2023-09-06T11:38:57.369Z","schema":[{"name":"EXPR$0","type":"TIMESTAMP","nativeType":"LONG"},{"name":"EXPR$1","type":"BIGINT","nativeType":"LONG"}],"durationMs":-1}
</code></pre></div></div>
<p>This time, the result has 22 rows:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>curl -L -H 'Content-Type: application/json' localhost:8888/druid/v2/sql/statements/query-7f972571-b26e-4206-a7a8-61503d386d4b
{"queryId":"query-7f972571-b26e-4206-a7a8-61503d386d4b","state":"SUCCESS","createdAt":"2023-09-06T11:38:57.369Z","schema":[{"name":"EXPR$0","type":"TIMESTAMP","nativeType":"LONG"},{"name":"EXPR$1","type":"BIGINT","nativeType":"LONG"}],"durationMs":14294,"result":{"numTotalRows":22,"totalSizeInBytes":782,"dataSource":"__query_select","sampleRecords":[[1466985600000,876],[1466989200000,870],[1466992800000,960],[1466996400000,1025],[1467000000000,936],[1467003600000,836],[1467007200000,969],[1467010800000,1135],[1467014400000,1141],[1467018000000,1137],[1467021600000,1135],[1467025200000,1115],[1467028800000,1219],[1467032400000,1211],[1467036000000,1353],[1467039600000,1422],[1467043200000,1442],[1467046800000,1339],[1467050400000,1321],[1467054000000,1175],[1467057600000,1213],[1467061200000,603]],"pages":[{"id":0,"numRows":22,"sizeInBytes":782}]}}
</code></pre></div></div>
<p>We have successfully queried data that partially exists in deep storage only!</p>
<h2 id="learnings">Learnings</h2>
<ul>
<li>Deep storage query is a great new feature that helps organizations to run Druid in a cost effective way, retaining the ability to query large amounts of historical data.</li>
<li>There is a new API endpoint for queries that include segments from deep storage. These queries run asynchronously.</li>
<li>You have to configure a <em>load</em> rule with a replication factor of 0 in order to make segments available for deep storage queries.</li>
<li>At least one segment of a datasource needs to be preloaded on the historical servers in order to run deep storage queries.</li>
</ul>In realtime analytics, a common scenario is that you want to retain a lot of (years of) historical data in order to run analytics over a longer period of time. But these analytical queries occur infrequently and their performance is usually not critical. The bulk of everyday queries, however, accesses only a limited set of relatively fresh data, typically 1 or 2 weeks worth.Using Druid with MinIO2023-08-29T00:00:00+02:002023-08-29T00:00:00+02:00/2023/08/29/using-druid-with-minio<p>With on premise setups, compute/storage separation is often implemented using a NAS or similar storage unit that exposes an S3 API endpoint.</p>
<p>I want to emulate S3 related behavior in a self contained demo that I can run on my laptop without an internet connection. This is conveniently done using MinIO as my S3 compatible storage.</p>
<p>Let’s deploy MinIO using this docker compose file:</p>
<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">version</span><span class="pi">:</span> <span class="s2">"</span><span class="s">3"</span>
<span class="na">services</span><span class="pi">:</span>
<span class="na">minio</span><span class="pi">:</span>
<span class="na">image</span><span class="pi">:</span> <span class="s">minio/minio</span>
<span class="na">container_name</span><span class="pi">:</span> <span class="s">minio</span>
<span class="na">environment</span><span class="pi">:</span>
<span class="pi">-</span> <span class="s">MINIO_ROOT_USER=admin</span>
<span class="pi">-</span> <span class="s">MINIO_ROOT_PASSWORD=password</span>
<span class="pi">-</span> <span class="s">MINIO_DOMAIN=minio</span>
<span class="na">networks</span><span class="pi">:</span>
<span class="na">minio_net</span><span class="pi">:</span>
<span class="na">aliases</span><span class="pi">:</span>
<span class="pi">-</span> <span class="s">druid.minio</span>
<span class="na">ports</span><span class="pi">:</span>
<span class="pi">-</span> <span class="s">9001:9001</span>
<span class="pi">-</span> <span class="s">9000:9000</span>
<span class="na">command</span><span class="pi">:</span> <span class="pi">[</span><span class="s2">"</span><span class="s">server"</span><span class="pi">,</span> <span class="s2">"</span><span class="s">/data"</span><span class="pi">,</span> <span class="s2">"</span><span class="s">--console-address"</span><span class="pi">,</span> <span class="s2">"</span><span class="s">:9001"</span><span class="pi">]</span>
<span class="na">mc</span><span class="pi">:</span>
<span class="na">depends_on</span><span class="pi">:</span>
<span class="pi">-</span> <span class="s">minio</span>
<span class="na">image</span><span class="pi">:</span> <span class="s">minio/mc</span>
<span class="na">container_name</span><span class="pi">:</span> <span class="s">mc</span>
<span class="na">networks</span><span class="pi">:</span>
<span class="na">minio_net</span><span class="pi">:</span>
<span class="na">environment</span><span class="pi">:</span>
<span class="pi">-</span> <span class="s">AWS_ACCESS_KEY_ID=admin</span>
<span class="pi">-</span> <span class="s">AWS_SECRET_ACCESS_KEY=password</span>
<span class="pi">-</span> <span class="s">AWS_REGION=us-east-1</span>
<span class="na">entrypoint</span><span class="pi">:</span> <span class="pi">></span>
<span class="s">/bin/sh -c "</span>
<span class="s">until (/usr/bin/mc config host add minio http://minio:9000 admin password) do echo '...waiting...' && sleep 1; done;</span>
<span class="s">/usr/bin/mc rm -r --force minio/indata;</span>
<span class="s">/usr/bin/mc mb minio/indata;</span>
<span class="s">/usr/bin/mc policy set public minio/indata;</span>
<span class="s">/usr/bin/mc rm -r --force minio/deepstorage;</span>
<span class="s">/usr/bin/mc mb minio/deepstorage;</span>
<span class="s">/usr/bin/mc policy set public minio/deepstorage;</span>
<span class="s">tail -f /dev/null</span>
<span class="s">"</span>
<span class="na">networks</span><span class="pi">:</span>
<span class="na">minio_net</span><span class="pi">:</span>
</code></pre></div></div>
<p>Save this file as <code class="language-plaintext highlighter-rouge">docker-compose.yaml</code> to your work directory and run the command</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>docker compose up <span class="nt">-d</span>
</code></pre></div></div>
<p>This gives us a MinIO instance and the <code class="language-plaintext highlighter-rouge">mc</code> client. It will also automatically create two buckets in MinIO, named <code class="language-plaintext highlighter-rouge">indata</code> and <code class="language-plaintext highlighter-rouge">deepstorage</code>, that we will need for this tutorial. If you point your browser to localhost:9000, you can verify that the buckets have been created:</p>
<p><img src="/assets/2023-08-29-01-minio-buckets.jpg" alt="MinIO Bucket Explorer screenshot" /></p>
<p>(Kudos to <a href="https://github.com/tabular-io/docker-spark-iceberg">Tabular</a> from whose GitHub repository I adapted the docker compose file.)</p>
<h2 id="configuring-minio-as-deep-storage-and-log-target">Configuring MinIO as deep storage and log target</h2>
<p>I am using the standard Druid 27.0 quickstart. If you want to start Druid using the new <code class="language-plaintext highlighter-rouge">start-druid</code> script, you find the relevant configuration settings in <code class="language-plaintext highlighter-rouge">conf/druid/auto/_common/common.runtime.properties</code> under your Druid installation directory.</p>
<p>First of all, we need to load the S3 extension by adding it to the load list - it should look similar to this:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>druid.extensions.loadList=["druid-s3-extensions", "druid-hdfs-storage", "druid-kafka-indexing-service", "druid-datasketches", "druid-multi-stage-query"]
</code></pre></div></div>
<p>Also configure the S3 default settings (endpoint, authentication):</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>druid.s3.accessKey=admin
druid.s3.secretKey=password
druid.s3.protocol=http
druid.s3.enablePathStyleAccess=true
druid.s3.endpoint.signingRegion=us-east-1
druid.s3.endpoint.url=http://localhost:9000/
</code></pre></div></div>
<p>For using MinIO as deep storage, comment out the default settings for <code class="language-plaintext highlighter-rouge">druid.storage.*</code>, and insert this section instead:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>druid.storage.type=s3
druid.storage.bucket=deepstorage
druid.storage.baseKey=segments
</code></pre></div></div>
<p>Likewise, change the default configuration for the indexer logs to:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>druid.indexer.logs.type=s3
druid.indexer.logs.s3Bucket=deepstorage
druid.indexer.logs.s3Prefix=indexing-logs
</code></pre></div></div>
<p>Then start Druid like this:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>bin/start-druid <span class="nt">-m5g</span>
</code></pre></div></div>
<h2 id="ingesting-data-from-minio">Ingesting data from MinIO</h2>
<p>By default, Druid uses the same settings in <code class="language-plaintext highlighter-rouge">common.runtime.properties</code> for ingestion from S3, too. So for instance, you can upload the <code class="language-plaintext highlighter-rouge">wikipedia</code> data sample to the <code class="language-plaintext highlighter-rouge">indata</code> bucket in your MinIO instance and we take advantage of the same settings as for deep storage. Just use <code class="language-plaintext highlighter-rouge">s3://indata/</code> as the S3 prefix in the ingestion wizard, and it should work out of the box.</p>
<p>Here is my example JSON ingestion spec:</p>
<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
</span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"index_parallel"</span><span class="p">,</span><span class="w">
</span><span class="nl">"spec"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nl">"ioConfig"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"index_parallel"</span><span class="p">,</span><span class="w">
</span><span class="nl">"inputSource"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"s3"</span><span class="p">,</span><span class="w">
</span><span class="nl">"prefixes"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
</span><span class="s2">"s3://indata/"</span><span class="w">
</span><span class="p">]</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="nl">"inputFormat"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"json"</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="nl">"tuningConfig"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"index_parallel"</span><span class="p">,</span><span class="w">
</span><span class="nl">"partitionsSpec"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"dynamic"</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="nl">"dataSchema"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nl">"dataSource"</span><span class="p">:</span><span class="w"> </span><span class="s2">"wikipedia_s3_2"</span><span class="p">,</span><span class="w">
</span><span class="nl">"timestampSpec"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nl">"column"</span><span class="p">:</span><span class="w"> </span><span class="s2">"time"</span><span class="p">,</span><span class="w">
</span><span class="nl">"format"</span><span class="p">:</span><span class="w"> </span><span class="s2">"iso"</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="nl">"granularitySpec"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nl">"queryGranularity"</span><span class="p">:</span><span class="w"> </span><span class="s2">"none"</span><span class="p">,</span><span class="w">
</span><span class="nl">"rollup"</span><span class="p">:</span><span class="w"> </span><span class="kc">false</span><span class="p">,</span><span class="w">
</span><span class="nl">"segmentGranularity"</span><span class="p">:</span><span class="w"> </span><span class="s2">"day"</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="nl">"dimensionsSpec"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nl">"dimensions"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
</span><span class="s2">"channel"</span><span class="p">,</span><span class="w">
</span><span class="s2">"cityName"</span><span class="p">,</span><span class="w">
</span><span class="s2">"comment"</span><span class="p">,</span><span class="w">
</span><span class="s2">"countryIsoCode"</span><span class="p">,</span><span class="w">
</span><span class="s2">"countryName"</span><span class="p">,</span><span class="w">
</span><span class="s2">"isAnonymous"</span><span class="p">,</span><span class="w">
</span><span class="s2">"isMinor"</span><span class="p">,</span><span class="w">
</span><span class="s2">"isNew"</span><span class="p">,</span><span class="w">
</span><span class="s2">"isRobot"</span><span class="p">,</span><span class="w">
</span><span class="s2">"isUnpatrolled"</span><span class="p">,</span><span class="w">
</span><span class="s2">"metroCode"</span><span class="p">,</span><span class="w">
</span><span class="s2">"namespace"</span><span class="p">,</span><span class="w">
</span><span class="s2">"page"</span><span class="p">,</span><span class="w">
</span><span class="s2">"regionIsoCode"</span><span class="p">,</span><span class="w">
</span><span class="s2">"regionName"</span><span class="p">,</span><span class="w">
</span><span class="s2">"user"</span><span class="p">,</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"long"</span><span class="p">,</span><span class="w">
</span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"delta"</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"long"</span><span class="p">,</span><span class="w">
</span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"added"</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"long"</span><span class="p">,</span><span class="w">
</span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"deleted"</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">]</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>
<p>Or in SQL (using the automatic conversion function):</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">REPLACE</span> <span class="k">INTO</span> <span class="nv">"wikipedia_s3_2"</span> <span class="n">OVERWRITE</span> <span class="k">ALL</span>
<span class="k">WITH</span> <span class="nv">"source"</span> <span class="k">AS</span> <span class="p">(</span><span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="k">TABLE</span><span class="p">(</span>
<span class="n">EXTERN</span><span class="p">(</span>
<span class="s1">'{"type":"s3","prefixes":["s3://indata/"]}'</span><span class="p">,</span>
<span class="s1">'{"type":"json"}'</span>
<span class="p">)</span>
<span class="p">)</span> <span class="n">EXTEND</span> <span class="p">(</span><span class="nv">"time"</span> <span class="nb">VARCHAR</span><span class="p">,</span> <span class="nv">"channel"</span> <span class="nb">VARCHAR</span><span class="p">,</span> <span class="nv">"cityName"</span> <span class="nb">VARCHAR</span><span class="p">,</span> <span class="nv">"comment"</span> <span class="nb">VARCHAR</span><span class="p">,</span> <span class="nv">"countryIsoCode"</span> <span class="nb">VARCHAR</span><span class="p">,</span> <span class="nv">"countryName"</span> <span class="nb">VARCHAR</span><span class="p">,</span> <span class="nv">"isAnonymous"</span> <span class="nb">VARCHAR</span><span class="p">,</span> <span class="nv">"isMinor"</span> <span class="nb">VARCHAR</span><span class="p">,</span> <span class="nv">"isNew"</span> <span class="nb">VARCHAR</span><span class="p">,</span> <span class="nv">"isRobot"</span> <span class="nb">VARCHAR</span><span class="p">,</span> <span class="nv">"isUnpatrolled"</span> <span class="nb">VARCHAR</span><span class="p">,</span> <span class="nv">"metroCode"</span> <span class="nb">VARCHAR</span><span class="p">,</span> <span class="nv">"namespace"</span> <span class="nb">VARCHAR</span><span class="p">,</span> <span class="nv">"page"</span> <span class="nb">VARCHAR</span><span class="p">,</span> <span class="nv">"regionIsoCode"</span> <span class="nb">VARCHAR</span><span class="p">,</span> <span class="nv">"regionName"</span> <span class="nb">VARCHAR</span><span class="p">,</span> <span class="nv">"user"</span> <span class="nb">VARCHAR</span><span class="p">,</span> <span class="nv">"delta"</span> <span class="nb">BIGINT</span><span class="p">,</span> <span class="nv">"added"</span> <span class="nb">BIGINT</span><span class="p">,</span> <span class="nv">"deleted"</span> <span class="nb">BIGINT</span><span class="p">))</span>
<span class="k">SELECT</span>
<span class="n">TIME_PARSE</span><span class="p">(</span><span class="nv">"time"</span><span class="p">)</span> <span class="k">AS</span> <span class="nv">"__time"</span><span class="p">,</span>
<span class="nv">"channel"</span><span class="p">,</span>
<span class="nv">"cityName"</span><span class="p">,</span>
<span class="nv">"comment"</span><span class="p">,</span>
<span class="nv">"countryIsoCode"</span><span class="p">,</span>
<span class="nv">"countryName"</span><span class="p">,</span>
<span class="nv">"isAnonymous"</span><span class="p">,</span>
<span class="nv">"isMinor"</span><span class="p">,</span>
<span class="nv">"isNew"</span><span class="p">,</span>
<span class="nv">"isRobot"</span><span class="p">,</span>
<span class="nv">"isUnpatrolled"</span><span class="p">,</span>
<span class="nv">"metroCode"</span><span class="p">,</span>
<span class="nv">"namespace"</span><span class="p">,</span>
<span class="nv">"page"</span><span class="p">,</span>
<span class="nv">"regionIsoCode"</span><span class="p">,</span>
<span class="nv">"regionName"</span><span class="p">,</span>
<span class="nv">"user"</span><span class="p">,</span>
<span class="nv">"delta"</span><span class="p">,</span>
<span class="nv">"added"</span><span class="p">,</span>
<span class="nv">"deleted"</span>
<span class="k">FROM</span> <span class="nv">"source"</span>
<span class="n">PARTITIONED</span> <span class="k">BY</span> <span class="k">DAY</span>
</code></pre></div></div>
<p>In either case, you can easily verify that both the segment files and the indexer logs end up in MinIO.</p>
<h2 id="changing-the-endpoint-settings-in-the-ingestion-command">Changing the endpoint settings in the ingestion command</h2>
<p>Now let’s go back to local deep storage, so that we cannot take advantage of endpoint settings that are baked into the service properties file. Hence we need to establish those settings right in the ingestion spec.</p>
<p>Restore the common properties to their default values and restart Druid. (You still need the S3 extension loaded.)</p>
<h3 id="json-version">JSON version</h3>
<p>Start the wizard as for a standard S3 ingestion. Then switch to the JSON view and edit the S3 settings in the ingestion spec:</p>
<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="w"> </span><span class="nl">"inputSource"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"s3"</span><span class="p">,</span><span class="w">
</span><span class="nl">"prefixes"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
</span><span class="s2">"s3://indata/"</span><span class="w">
</span><span class="p">],</span><span class="w">
</span><span class="nl">"properties"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nl">"accessKeyId"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"default"</span><span class="p">,</span><span class="w">
</span><span class="nl">"password"</span><span class="p">:</span><span class="w"> </span><span class="s2">"admin"</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="nl">"secretAccessKey"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"default"</span><span class="p">,</span><span class="w">
</span><span class="nl">"password"</span><span class="p">:</span><span class="w"> </span><span class="s2">"password"</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="nl">"endpointConfig"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nl">"url"</span><span class="p">:</span><span class="w"> </span><span class="s2">"http://localhost:9000"</span><span class="p">,</span><span class="w">
</span><span class="nl">"signingRegion"</span><span class="p">:</span><span class="w"> </span><span class="s2">"us-east-1"</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="nl">"clientConfig"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nl">"disableChunkedEncoding"</span><span class="p">:</span><span class="w"> </span><span class="kc">true</span><span class="p">,</span><span class="w">
</span><span class="nl">"enablePathStyleAccess"</span><span class="p">:</span><span class="w"> </span><span class="kc">true</span><span class="p">,</span><span class="w">
</span><span class="nl">"forceGlobalBucketAccessEnabled"</span><span class="p">:</span><span class="w"> </span><span class="kc">false</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>
<p>Note: In this case, because we are using plain HTTP, we need to include the <code class="language-plaintext highlighter-rouge">http://</code> in the endpoint URL. If we put it in the <code class="language-plaintext highlighter-rouge">clientConfig.protocol</code>, as you might think from the sample in the documentation, it is not recognized.</p>
<h3 id="sql-version">SQL version</h3>
<p>In the SQL version, we copy the same settings into the EXTERN statement, like so:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">REPLACE</span> <span class="k">INTO</span> <span class="nv">"wikipedia_s3_2"</span> <span class="n">OVERWRITE</span> <span class="k">ALL</span>
<span class="k">WITH</span> <span class="nv">"source"</span> <span class="k">AS</span> <span class="p">(</span><span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="k">TABLE</span><span class="p">(</span>
<span class="n">EXTERN</span><span class="p">(</span>
<span class="s1">'{ "type": "s3", "prefixes": [ "s3://indata/" ], "properties": { "accessKeyId": { "type": "default", "password": "admin" }, "secretAccessKey": { "type": "default", "password": "password" } }, "endpointConfig": { "url": "http://localhost:9000", "signingRegion": "us-east-1" }, "clientConfig": { "disableChunkedEncoding": true, "enablePathStyleAccess": true, "forceGlobalBucketAccessEnabled": false } }'</span><span class="p">,</span>
<span class="s1">'{"type":"json"}'</span>
<span class="p">)</span>
<span class="p">)</span> <span class="n">EXTEND</span> <span class="p">(</span><span class="nv">"time"</span> <span class="nb">VARCHAR</span><span class="p">,</span> <span class="nv">"channel"</span> <span class="nb">VARCHAR</span><span class="p">,</span> <span class="nv">"cityName"</span> <span class="nb">VARCHAR</span><span class="p">,</span> <span class="nv">"comment"</span> <span class="nb">VARCHAR</span><span class="p">,</span> <span class="nv">"countryIsoCode"</span> <span class="nb">VARCHAR</span><span class="p">,</span> <span class="nv">"countryName"</span> <span class="nb">VARCHAR</span><span class="p">,</span> <span class="nv">"isAnonymous"</span> <span class="nb">VARCHAR</span><span class="p">,</span> <span class="nv">"isMinor"</span> <span class="nb">VARCHAR</span><span class="p">,</span> <span class="nv">"isNew"</span> <span class="nb">VARCHAR</span><span class="p">,</span> <span class="nv">"isRobot"</span> <span class="nb">VARCHAR</span><span class="p">,</span> <span class="nv">"isUnpatrolled"</span> <span class="nb">VARCHAR</span><span class="p">,</span> <span class="nv">"metroCode"</span> <span class="nb">VARCHAR</span><span class="p">,</span> <span class="nv">"namespace"</span> <span class="nb">VARCHAR</span><span class="p">,</span> <span class="nv">"page"</span> <span class="nb">VARCHAR</span><span class="p">,</span> <span class="nv">"regionIsoCode"</span> <span class="nb">VARCHAR</span><span class="p">,</span> <span class="nv">"regionName"</span> <span class="nb">VARCHAR</span><span class="p">,</span> <span class="nv">"user"</span> <span class="nb">VARCHAR</span><span class="p">,</span> <span class="nv">"delta"</span> <span class="nb">BIGINT</span><span class="p">,</span> <span class="nv">"added"</span> <span class="nb">BIGINT</span><span class="p">,</span> <span class="nv">"deleted"</span> <span class="nb">BIGINT</span><span class="p">))</span>
<span class="k">SELECT</span>
<span class="n">TIME_PARSE</span><span class="p">(</span><span class="nv">"time"</span><span class="p">)</span> <span class="k">AS</span> <span class="nv">"__time"</span><span class="p">,</span>
<span class="nv">"channel"</span><span class="p">,</span>
<span class="nv">"cityName"</span><span class="p">,</span>
<span class="nv">"comment"</span><span class="p">,</span>
<span class="nv">"countryIsoCode"</span><span class="p">,</span>
<span class="nv">"countryName"</span><span class="p">,</span>
<span class="nv">"isAnonymous"</span><span class="p">,</span>
<span class="nv">"isMinor"</span><span class="p">,</span>
<span class="nv">"isNew"</span><span class="p">,</span>
<span class="nv">"isRobot"</span><span class="p">,</span>
<span class="nv">"isUnpatrolled"</span><span class="p">,</span>
<span class="nv">"metroCode"</span><span class="p">,</span>
<span class="nv">"namespace"</span><span class="p">,</span>
<span class="nv">"page"</span><span class="p">,</span>
<span class="nv">"regionIsoCode"</span><span class="p">,</span>
<span class="nv">"regionName"</span><span class="p">,</span>
<span class="nv">"user"</span><span class="p">,</span>
<span class="nv">"delta"</span><span class="p">,</span>
<span class="nv">"added"</span><span class="p">,</span>
<span class="nv">"deleted"</span>
<span class="k">FROM</span> <span class="nv">"source"</span>
<span class="n">PARTITIONED</span> <span class="k">BY</span> <span class="k">DAY</span>
</code></pre></div></div>
<p><img src="/assets/2023-08-29-02-druid-msq.jpg" alt="SQL ingestion from the Query tab" /></p>
<h2 id="conclusion">Conclusion</h2>
<ul>
<li>You can use MinIO or another S3 compatible storage with Druid. You configure the endpoint, protocol, and authentication settings in the common properties file.</li>
<li>If you need to ingest from a different MinIO instance, or you want to use MinIO for ingestion only, you can set or override the S3 settings in the ingestion spec. This works both in JSON and SQL mode.</li>
<li>Either way, make sure you have the S3 extension loaded.</li>
</ul>With on premise setups, compute/storage separation is often implemented using a NAS or similar storage unit that exposes an S3 API endpoint.Druid Sneak Peek: Graphical Data Exploration2023-07-30T00:00:00+02:002023-07-30T00:00:00+02:00/2023/07/30/druid-sneak-peek-graphical-data-exploration<p><img src="/assets/2023-07-30-01-timechart.jpg" alt="Screenshot of time chart" /></p>
<p>Druid’s unified console is mostly directed at data management. Among other things, you can control your ingestion tasks, manage segments and their compaction settings, monitor services, and there is also a query manager GUI that understands both SQL and Druid native queries.</p>
<p>For data visualization, up until now you had to use external tools such as Superset or Tableau, or Imply’s own <a href="https://docs.imply.io/latest/pivot-overview/">Pivot</a> that comes bundled with the commercial distribution of the software.</p>
<p>But this is going to change. Druid 28 is going to add an exploration GUI that allows visual analysis of data!</p>
<p>This is a sneak peek into Druid 28 functionality. In order to use the new functions, you can (as of the time of writing) <a href="https://druid.apache.org/docs/latest/development/build.html">build Druid</a> from the HEAD of the master branch:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>git clone https://github.com/apache/druid.git
<span class="nb">cd </span>druid
mvn clean <span class="nb">install</span> <span class="nt">-Pdist</span> <span class="nt">-DskipTests</span>
</code></pre></div></div>
<p>Then follow the instructions to locate and install the tarball.</p>
<p><em><strong>Disclaimer:</strong> This tutorial uses undocumented functionality and unreleased code. This blog is neither endorsed by Imply nor by the Apache Druid PMC. It merely collects the results of personal experiments. The features described here might, in the final release, work differently, or not at all. In addition, the entire build, or execution, may fail. Your mileage may vary.</em></p>
<p>For this post, I ingested the Wikipedia sample data, as described in <a href="https://druid.apache.org/docs/latest/tutorials/tutorial-msq-extern.html">the quickstart tutorial</a>. You are of course encouraged to try out different data sets with the new explorer.</p>
<h2 id="how-to-access-the-explorer-view">How to access the Explorer view</h2>
<p>To access the data explorer, go to the three dots <code class="language-plaintext highlighter-rouge">...</code> right next to the Services tab, open the menu and click <code class="language-plaintext highlighter-rouge">Explore</code>:</p>
<p><img src="/assets/2023-07-30-02-select-explore.jpg" alt="Screenshot of console with Explore menu selected" /></p>
<p>You will be greeted with a canvas in the middle, and surrounding GUI controls:</p>
<ul>
<li>In the top left field you select the datasource (table) that you wish to explore.</li>
<li>As soon as a datasource is selected, the left panel shows a list of all fields as they occur in the datasource. This does not care whether the fields are dimensions or metrics.</li>
<li>In the top bar you can set filters. Time filters come with an option of relative or absolute times. For character values, there is a regular expression filters as well as the ability to pick literal values.</li>
<li>In the right panel you choose one of the supported visualization types. Depending on your selection, different configuration options appear below. There’s also a <code class="language-plaintext highlighter-rouge">...</code> button, behind which you can find the query history list. This is handy if you want to know which SQL queries are generated by the Explorer.</li>
</ul>
<p>Let’s go through the list of visualization types.</p>
<h2 id="time-chart">Time chart</h2>
<p>The Time chart visualizes the development of a metric over time. This is an area chart, or optionally (if you select a dimension to stack by) a stacked area chart.</p>
<p>It is possible to limit the number of items to be displayed in the stacked dimension.</p>
<p><img src="/assets/2023-07-30-01-timechart.jpg" alt="Screenshot of time chart" /></p>
<p>This visualization allows selecting as metrics:</p>
<ul>
<li>total count</li>
<li>unique count of any column</li>
<li>minimum and maximum of timestamp</li>
<li>for numeric columns, moreover, the standard aggregators <em>sum</em>, <em>min</em>, <em>max</em>, and <em>98 percentile</em>.</li>
</ul>
<p><img src="/assets/2023-07-30-03-timechart-metrics.jpg" width="40%" /></p>
<p>This mechanism of selecting metrics is the same for all other visualizations, too.</p>
<h2 id="bar-chart">Bar chart</h2>
<p>The bar chart displays one bar column (dimension) and one metric, It is possible to sort by a metric other than the one displayed.</p>
<p><img src="/assets/2023-07-30-04-barchart.jpg" alt="Screenshort of bar chart" /></p>
<h2 id="table">Table</h2>
<p>The table chart has the most flexibility in selecting and arranging table fields. These are the options:</p>
<ul>
<li><em>Group by</em>: These are your regular BI dimensions, things to aggregate by. While all discrete dimensions just create one row by per value, <code class="language-plaintext highlighter-rouge">__time</code> has builtin intelligence when you select this, you can select the bucketing (granularity). You can select multiple dimension columns.</li>
<li><em>Show</em>: can show a column without aggregating by that column. You could view this as interpreting a dimension as a metric where you pick either the latest value or the number of values. You can add multiple columns here, too.</li>
<li><em>Pivot</em>: This displays a dimension across instead of down. The query mechanism is a bit different: it currently uses filtered metrics with one expression per dimension value.</li>
<li><em>Aggregates</em>: These are the metrics, the selection is the same as for the time chart. But you can have multiple metrics.</li>
</ul>
<p><img src="/assets/2023-07-30-05-table-pivot.jpg" alt="Screenshot of pivot table view" /></p>
<ul>
<li><em>Compares</em>: compare by time interval. You can include multiple comparisons. But Compare and Pivot are for now mutually exclusive.</li>
</ul>
<p>You can sort by any column if you click on the column header.</p>
<p><img src="/assets/2023-07-30-06-table-compare.jpg" alt="Screenshot of time comparison table view" /></p>
<h2 id="pie-chart">Pie chart</h2>
<p>This displays one metric, broken down by one dimension. You can specify the number of named slices, the rest goes into Other.</p>
<p><img src="/assets/2023-07-30-07-piechart.jpg" alt="Screenshot of Pie chart" /></p>
<h2 id="multi-axis-chart">Multi-axis chart</h2>
<p>This is a variety of the time chart, but with many metrics. They are drawn as line charts, overlayed and each to its own scale. The first metric’s axis is displayed to the left, all others are displayed to the right of the chart.</p>
<p><img src="/assets/2023-07-30-08-multi-axis.jpg" alt="Screenshot of multi axis chart" /></p>
<h2 id="conclusion">Conclusion</h2>
<p>In this post, I have shown a glimpse of the upcoming data exploration GUI that is built right into Druid. While this is currently not a replacement for a full BI suite, it is a valuable tool for the data engineer to get a better idea of how the data looks like. This can assist in understanding the distribution of the data and optimizing the data model inside Druid. It’s also valuable when an analysts asks the data team why a particular chart looks the way it does.</p>
<p>Note that the data explorer is not part of any official release (yet), and that it is likely going to change and evolve a lot. Feel free to experiment!</p>Merging Realtime Segments in Apache Druid2023-07-25T00:00:00+02:002023-07-25T00:00:00+02:00/2023/07/25/merging-realtime-segments-in-apache-druid<p>So, you want your realtime analytical queries to be really fast, and that’s why you selected <a href="https://druid.apache.org/">Apache Druid</a>! Today, let’s have a look at another aspect of how Druid achieves its amazing performance.</p>
<h2 id="data-layout-and-druid-performance">Data Layout and Druid Performance</h2>
<p>Druid’s query performance can be influenced by multiple factors in the data layout:</p>
<ul>
<li><strong>Segment size</strong>. The optimum size of a <a href="https://druid.apache.org/docs/latest/design/segments.html">data segment</a> is about 500 MB. If segments are much bigger than that, those segments need more resources for querying and also parallelism suffers. More often you encounter the opposite problem: there are too many small segments, which slows down query performance.</li>
<li><strong>Partitioning and sorting of data</strong>. Partitioning gives an extra performance boost when you can partition the segments according to the expected query pattern; also, inside a segment data is sorted by time, but then by partitioning key; this further speeds up segment scans by increasing compression ratio. For this to work you need to enable <a href="https://blog.hellmar-becker.de/2022/01/25/partitioning-in-druid-part-3-multi-dimension-range-partitioning/">range partitioning</a>.</li>
<li><strong>Rollup</strong>. This reduces both storage and query needs by pre-aggregating data. Ideally you want to have <a href="https://druid.apache.org/docs/latest/ingestion/rollup.html#perfect-rollup-vs-best-effort-rollup">perfect rollup</a> so that each unique combination of dimension values corresponds to exactly one aggregate row. For this to work, again one has to use range or hash partitioning. In fact, with range or hash partitioning only, rollup is always perfect; with <a href="https://blog.hellmar-becker.de/2022/01/06/partitioning-in-druid-part-1-dynamic-and-hash-partitioning/">dynamic partitioning</a>, rollup is best effort - the resulting table may be multiple times bigger than the optimum.</li>
</ul>
<p>Let’s find out how Druid optimizes these factors for streaming data - without any external processes!</p>
<h2 id="the-problem-with-streaming-data">The Problem with Streaming Data</h2>
<p>In batch processing, all the above factors can be addressed easily. Streaming data, however, usually does not come in neatly ordered. The point in streaming ingestion is to have these data available for analytics within a split second after an event occurs: and so, segments are built up in memory and persisted frequently. As a result, after a <em>hand-off</em> (the process of persisting a segment to deep storage) by streaming ingestion, segments are not optimal:</p>
<ul>
<li>Segments will usually be fragmented and <em>smaller than optimum</em> because we cannot wait long to initiate a handoff. in addition, we may have to juggle multiple time chunks simultaneously because of late arriving data</li>
<li>Range partitioning requires multiple steps of mapping and shuffling and merging. This is not possible to do during streaming ingestion so <em>the only allowed partitioning scheme is Dynamic</em>.</li>
<li>Because data can always be added incrementally, rollup is <em>best effort</em>.</li>
</ul>
<h2 id="managing-the-lifecycle">Managing the Lifecycle</h2>
<p>This is where many databases would add an external maintenance process that reorganizes data. It is the beauty of Druid that it handles this reorganization largely automatically by a process called <em>autocompaction</em>. Here are a few notes in passing about autocompaction and its capabilities.</p>
<p>I discussed autocompaction briefly in <a href="https://blog.hellmar-becker.de/2023/01/22/apache-druid-data-lifecycle-management/">my blog about data lifecycle management in Druid</a>. It is a data compaction process that:</p>
<ul>
<li>is done automatically by the Coordinator in the background</li>
<li>has a simple configuration, either through the Druid API or through the Unified Console GUI</li>
<li>it is basically a reindexing job - it takes all the segments for a given time chunk and re-ingests them into the same datasource, creating a new version.</li>
</ul>
<p>Autocompaction can:</p>
<ul>
<li>make sure segments have a <strong>size close to the target value</strong>;</li>
<li>set/modify the <strong>partitioning scheme</strong>;</li>
<li>modify <strong>rollup</strong> settings;</li>
<li>modify <strong>segment granularity</strong>;</li>
<li>modify <strong>query granularity</strong>.</li>
</ul>
<p>It also has a setting to leave the newest data alone so as not to interfere with the ongoing ingestion.</p>
<h3 id="set-partitioning-scheme">Set partitioning scheme</h3>
<p>Because streaming ingestion always produces dynamic partitions, you have to use autocompaction to organize your data in a better scheme. While either hash or range partitioning both achieve perfect rollup, range partitioning is recommended for most cases - particularly if you know typical query patterns in advance.</p>
<h3 id="modify-rollup-settings">Modify rollup settings</h3>
<p>You can go from a detail to a rollup table using autocompaction. There are some caveats though: this approach makes sense mostly if you are using the same aggregation functions in your queries and in rollup.</p>
<h3 id="modify-segment-granularity">Modify segment granularity</h3>
<p>Segment granularity defines the time period for each time chunk. If your data volume is low enough to have only segment per time chunk, you might consider increasing segment granularity: if there is only one segment per time chunk, secondary partitioning will do essentially nothing, so you need to make the time chunks bigger in order to to force secondary partitioning into effect.</p>
<p>Make sure segment granularities roll up into each other neatly (for instance, don’t do week to month), or else you are in <a href="https://blog.hellmar-becker.de/2023/01/22/apache-druid-data-lifecycle-management/">for some surprises</a>.</p>
<h3 id="modify-query-granularity">Modify query granularity</h3>
<p>Query granularity defines the aggregation level inside a segment. The primary timestamp will be truncated to the precision defined by the query granularity, and data is aggregated at that level.</p>
<p>You can define additional aggregation during autocompaction by making the query granularity coarser. This is a data lifecycle operation and some organizations use it to retain detail data up to a certain period, and aggregates for older data. When configuring a segment merge autocompaction, you would not usually do this.</p>
<h3 id="configure-grace-period-for-recent-data">Configure grace period for recent data</h3>
<p>Druid will soon have the ability to run ingestion and autocompaction over the same time chunk simultaneously. For now, there’s a setting <code class="language-plaintext highlighter-rouge">skipOffsetFromLatest</code>, which is by default set to <code class="language-plaintext highlighter-rouge">P1D</code> (one day). Its effect is that data younger than that period are left alone by autocompaction, because we anticipate more data to be ingested for that period. Increase this setting if you expect a lot of late arriving data.</p>
<p>This is an <a href="https://en.wikipedia.org/wiki/ISO_8601">ISO 8601</a> time period.</p>
<h2 id="configuring-it-in-the-wizard">Configuring it in the wizard</h2>
<p>Autocompaction can be configured using the unified console wizard or the <a href="https://druid.apache.org/docs/latest/data-management/automatic-compaction.html#compaction-configuration-api">API</a>.</p>
<p>In the console, autocompaction settings can be accessed from the <code class="language-plaintext highlighter-rouge">Datasources</code> tab. Clicking the compaction settings for a datasource opens a dialog for the basic settings like partitioning and recent data grace period:</p>
<p><img src="/assets/2023-07-25-01.jpg" alt="Screenshot of autocompaction wizard" /></p>
<p>For configuring rollup and granularity settings, you have to enter JSON mode and follow the reference in <a href="https://druid.apache.org/docs/latest/data-management/automatic-compaction.html#configure-automatic-compaction">the documentation</a>.</p>
<h2 id="outlook">Outlook</h2>
<p>Autocompaction has been with Druid <a href="https://druid.apache.org/docs/0.13.0-incubating/design/coordinator.html#compacting-segments">since version 0.13</a>, but it has seen a lot of improvement recently. Some notable changes that will (likely) be released in the near future:</p>
<ul>
<li>The algorithm that selects segments for compaction is being tuned to grab segments faster and to use free system resources more efficiently, resulting in a considerable speedup.</li>
<li>Fully concurrent ingestion and autocompaction - so data layout will be optimized on the fly!</li>
<li>A lot more options are available to fine tune autocompaction: refer to <a href="https://druid.apache.org/docs/latest/data-management/automatic-compaction.html">the documentation</a> for more detail!</li>
</ul>
<h2 id="conclusion">Conclusion</h2>
<p>This article gave only a glimpse into the capabilities of Druid autocompaction. What we learned:</p>
<ul>
<li>Autocompaction is a process that merges and optimizes (among others) realtime ingested segments.</li>
<li>Autocompaction runs automatically in the background. It requires no extra program invocation or scheduler setup.</li>
<li>In addition to merging segments, autocompaction can also perform more advanced data lifecycle management tasks with minimal configuration.</li>
</ul>So, you want your realtime analytical queries to be really fast, and that’s why you selected Apache Druid! Today, let’s have a look at another aspect of how Druid achieves its amazing performance.Analyzing GitHub Stars with Imply Polaris2023-07-12T00:00:00+02:002023-07-12T00:00:00+02:00/2023/07/12/analyzing-github-stars-with-imply-polaris<p><img src="/assets/2023-07-12-01-Ludwig_Richter-The_Star_Money-2-1862.jpg" alt="Sterntaler drawing" /></p>
<h2 id="why-all-this">Why all this?</h2>
<p>A while ago, <a href="https://twitter.com/whycaniuse">Will</a> asked if we could measure <a href="https://www.swyx.io/measuring-devrel">community engagement</a> in the <a href="https://druid.apache.org/">Apache Druid</a> community by analyzing the number of <a href="https://docs.github.com/en/rest/activity/starring">GitHub stars</a> that the <a href="https://github.com/apache/druid">Druid source repository</a> got over time. He wanted to compare that development with other repositories within the realtime analytics ecosystem, and possibly identify segments of GitHub users that had starred multiple repositories out of the list we are looking at.</p>
<p>This blog is <em>not</em> about the results of that endeavor. Instead, I am going to look at an interesting data/query modeling problem I encountered on the way.</p>
<h2 id="the-dataset">The dataset</h2>
<p>Let’s get the stargazers for various repos that are either competitive or complementary with druid. This includes</p>
<ul>
<li>other realtime analytics datastores</li>
<li>streaming platforms</li>
<li>stream processors</li>
<li>frontend (business intelligence) tools.</li>
</ul>
<p>For each stargazer record, we store</p>
<ul>
<li>the user</li>
<li>the repository</li>
<li>date and time when it was starred; this will be the primary timestamp for the Druid data model.</li>
</ul>
<h3 id="how-to-get-the-data">How to get the data</h3>
<p>The data we are going to analyze comes from the <a href="https://docs.github.com/en/rest/activity/starring?apiVersion=2022-11-28#list-stargazers">GitHub stargazers API</a>. <a href="https://dev.to/vnarayaj/analysing-github-stars-extracting-and-analyzing-data-from-github-using-apache-nifir-apache-kafkar-and-apache-druidr-280">Vijay has written a great blog about this</a>; I am using a simpler approach with a Python script that runs once and tries to retrieve all the data.</p>
<p>This probably warrants another blog about the quirks of the GitHub API, so for now let a few remarks suffice.</p>
<ul>
<li>Surprise: <a href="https://twitter.com/elonmusk">Elon Musk</a> did not invent <a href="https://docs.github.com/en/rest/rate-limit/rate-limit?apiVersion=2022-11-28">API rate limiting</a>! Our first idea was to get <em>all the repositories</em> that Druid stargazers also starred. This approach is not viable.</li>
<li>There are primary (hard) and secondary rate limits. Either way, if you hit a limit, GitHub throws a 403 error at you. The required action depends on the type of rate limit that was applied, and this needs to be parsed from response headers.</li>
<li>The API imposes <a href="https://docs.github.com/en/rest/guides/using-pagination-in-the-rest-api?apiVersion=2022-11-28">pagination</a> with a maximum page size of 100 records.</li>
<li>The maximum page index you can retrieve is 399.</li>
<li>As a consequence, <em>you will not get more than 40,000 stars for any one repository</em>, which will soon become important.</li>
</ul>
<p>You can find the code that I used, as well as all the SQL samples from this post, in <a href="https://github.com/hellmarbecker/druid-stargazers">my GitHub repository</a>.</p>
<h3 id="loading-the-data-into-polaris">Loading the data into Polaris</h3>
<p>While the basic SQL analysis works just as well with open source Druid, I am using <a href="https://imply.io/imply-fully-managed-dbaas-polaris/">Imply Polaris</a> because of its ease of use and built in visualization. Ingesting file data into Polaris is a streamlined process that is well described in <a href="https://docs.imply.io/polaris/quickstart/#upload-a-file-and-view-sample-data">the quickstart guide</a> - follow the instructions there.</p>
<p>Here are some sample records from my script:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{"starred_at": "2012-10-23T19:08:07Z", "user": {"login": "bennettandrews", "id": 1143, "node_id": "MDQ6VXNlcjExNDM=", "avatar_url": "https://avatars.githubusercontent.com/u/1143?v=4", "gravatar_id": "", "url": "https://api.github.com/users/bennettandrews", "html_url": "https://github.com/bennettandrews", "followers_url": "https://api.github.com/users/bennettandrews/followers", "following_url": "https://api.github.com/users/bennettandrews/following{/other_user}", "gists_url": "https://api.github.com/users/bennettandrews/gists{/gist_id}", "starred_url": "https://api.github.com/users/bennettandrews/starred{/owner}{/repo}", "subscriptions_url": "https://api.github.com/users/bennettandrews/subscriptions", "organizations_url": "https://api.github.com/users/bennettandrews/orgs", "repos_url": "https://api.github.com/users/bennettandrews/repos", "events_url": "https://api.github.com/users/bennettandrews/events{/privacy}", "received_events_url": "https://api.github.com/users/bennettandrews/received_events", "type": "User", "site_admin": false}, "starred_repo": "apache/druid"}
{"starred_at": "2012-10-23T19:08:07Z", "user": {"login": "xwmx", "id": 1246, "node_id": "MDQ6VXNlcjEyNDY=", "avatar_url": "https://avatars.githubusercontent.com/u/1246?v=4", "gravatar_id": "", "url": "https://api.github.com/users/xwmx", "html_url": "https://github.com/xwmx", "followers_url": "https://api.github.com/users/xwmx/followers", "following_url": "https://api.github.com/users/xwmx/following{/other_user}", "gists_url": "https://api.github.com/users/xwmx/gists{/gist_id}", "starred_url": "https://api.github.com/users/xwmx/starred{/owner}{/repo}", "subscriptions_url": "https://api.github.com/users/xwmx/subscriptions", "organizations_url": "https://api.github.com/users/xwmx/orgs", "repos_url": "https://api.github.com/users/xwmx/repos", "events_url": "https://api.github.com/users/xwmx/events{/privacy}", "received_events_url": "https://api.github.com/users/xwmx/received_events", "type": "User", "site_admin": false}, "starred_repo": "apache/druid"}
</code></pre></div></div>
<p>Upload the output file to Polaris and ingest only the <code class="language-plaintext highlighter-rouge">starred_at</code>, <code class="language-plaintext highlighter-rouge">user["login"]</code>, <code class="language-plaintext highlighter-rouge">user["id"]</code>, and <code class="language-plaintext highlighter-rouge">starred_repo</code> columns. (You will need to use <code class="language-plaintext highlighter-rouge">JSON_VALUE</code> to extract the nested fields.)</p>
<p>Create a <a href="https://docs.imply.io/polaris/managing-data-cubes/">data cube</a> with default settings. By default, you will get an event count measure, but you can add your own filtered or computed measures if you want.</p>
<h2 id="naïve-visualization">Naïve visualization</h2>
<p>This first data model shows only the new stars for every point in time. This looks a bit confusing, but there is one interesting fact to be gleaned already:</p>
<p><img src="/assets/2023-07-12-02-eventdata.jpg" alt="Visualization: New Star Events over Time" /></p>
<p>The new star data for the <code class="language-plaintext highlighter-rouge">superset</code> repository is gone after a certain date! Why is that?</p>
<p>Remember, we can only retrieve 40,000 stargazer records per repository. But Superset has more than 52,000 stars, so we cannot get them all.</p>
<p>This is a starting point, but what Will really wanted to see is the growth of stars over time. Something you would address using a window function and a <code class="language-plaintext highlighter-rouge">BETWEEN CURRENT AND UNBOUND PRECEDING</code> clause. But since <a href="/2023/03/26/druid-26-sneak-peek-window-functions/">window functions in Druid</a> are not quite production ready yet, we have to use a different syntax to model these queries.</p>
<p>Let’s do this with monthly resolution so we can track the month over month growth curve for each repository.</p>
<h2 id="first-attempt-at-cumulative-sums-self-join">First attempt at cumulative sums: self join</h2>
<p>Last year, I wrote about <a href="/2022/11/05/druid-data-cookbook-cumulative-sums-in-druid-sql/">emulating window functions in Druid SQL</a>, and one of the techniques I used was to join a table with itself. Conveniently, we roll up by month before joining the data, so as to keep the intermediate result sets small. Since we are repeating the same query, let’s formulate it as a common table expression.</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">WITH</span> <span class="n">cte</span> <span class="k">AS</span> <span class="p">(</span>
<span class="k">SELECT</span> <span class="n">DATE_TRUNC</span><span class="p">(</span><span class="s1">'MONTH'</span><span class="p">,</span> <span class="nv">"__time"</span><span class="p">)</span> <span class="k">AS</span> <span class="n">date_month</span><span class="p">,</span> <span class="n">starred_repo</span><span class="p">,</span> <span class="k">COUNT</span><span class="p">(</span><span class="o">*</span><span class="p">)</span> <span class="k">AS</span> <span class="n">count_monthly</span>
<span class="k">FROM</span> <span class="nv">"stargazers-ecosystem"</span>
<span class="k">GROUP</span> <span class="k">BY</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">2</span>
<span class="p">)</span>
<span class="k">SELECT</span>
<span class="n">cte</span><span class="p">.</span><span class="n">date_month</span><span class="p">,</span>
<span class="n">cte</span><span class="p">.</span><span class="n">starred_repo</span><span class="p">,</span>
<span class="k">SUM</span><span class="p">(</span><span class="n">t2</span><span class="p">.</span><span class="n">count_monthly</span><span class="p">)</span> <span class="k">AS</span> <span class="n">sum_cume</span>
<span class="k">FROM</span> <span class="n">cte</span> <span class="k">INNER</span> <span class="k">JOIN</span> <span class="n">cte</span> <span class="n">t2</span> <span class="k">ON</span> <span class="n">cte</span><span class="p">.</span><span class="n">starred_repo</span> <span class="o">=</span> <span class="n">t2</span><span class="p">.</span><span class="n">starred_repo</span>
<span class="k">WHERE</span> <span class="n">t2</span><span class="p">.</span><span class="n">date_month</span> <span class="o"><=</span> <span class="n">cte</span><span class="p">.</span><span class="n">date_month</span>
<span class="k">GROUP</span> <span class="k">BY</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">2</span>
<span class="k">ORDER</span> <span class="k">BY</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">2</span>
</code></pre></div></div>
<p>The interesting measure in this data model is <code class="language-plaintext highlighter-rouge">sum_cume</code>: the sum of stars from all past up to the reference date, per repository. Let’s visualize this in Polaris over a time period of 10 years!</p>
<p><img src="/assets/2023-07-12-03-selfjoin.jpg" alt="Visualization: Cumulative Sums with Self Join" /></p>
<p>This is <em>almost</em> good, but did you notice how the superset line drops to zero? Why is that?</p>
<p>Well, you remember the 40k stars limit? Because we don’t get new entries after a certain date, the join has nothing to join against.</p>
<p>We have been hit by a well known problem in data modeling, <a href="https://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-techniques/dimensional-modeling-techniques/factless-fact-table/"><em>factless facts</em></a>. Generally, this problem of “holes” in the data is addressed by creating a canvas table that manages to get us a data point for each <em>possible</em> combination of dimension values, not only those that we have fact data for.</p>
<h2 id="so-lets-build-up-a-calendar-dimension-instead-shall-we">So let’s build up a calendar dimension instead, shall we</h2>
<p>The straightforward approach to this task is to create a <a href="https://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-techniques/dimensional-modeling-techniques/calendar-date-dimension/"><em>calendar dimension</em></a>. Fortunately, since Druid 26, we have the ability <a href="https://blog.hellmar-becker.de/2023/04/08/druid-sneak-peek-timeseries-interpolation/">to generate an array of equally spaced points in time (with <code class="language-plaintext highlighter-rouge">DATE_EXPAND</code>), and to transform such an array into a set of single value rows (with <code class="language-plaintext highlighter-rouge">UNNEST</code>)</a>. This is not quite a fully featured sequence generator, but it should work for our case.</p>
<p>Note that for all the sample queries you will need to set a query context flag to enable <code class="language-plaintext highlighter-rouge">UNNEST</code>:</p>
<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
</span><span class="nl">"enableUnnest"</span><span class="p">:</span><span class="w"> </span><span class="kc">true</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>
<p>Let’s try to fill out the time dimension with one record per month, from the minimum to maximum timestamp that is in the data:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span> <span class="n">t</span><span class="p">.</span><span class="n">dateByWeek</span>
<span class="k">FROM</span> <span class="p">(</span>
<span class="k">SELECT</span>
<span class="n">TIMESTAMP_TO_MILLIS</span><span class="p">(</span><span class="n">TIME_FLOOR</span><span class="p">(</span><span class="k">MIN</span><span class="p">(</span><span class="n">__time</span><span class="p">),</span> <span class="s1">'P1M'</span><span class="p">))</span> <span class="k">AS</span> <span class="n">minDate</span><span class="p">,</span>
<span class="n">TIMESTAMP_TO_MILLIS</span><span class="p">(</span><span class="n">TIME_CEIL</span><span class="p">(</span><span class="k">MAX</span><span class="p">(</span><span class="n">__time</span><span class="p">),</span> <span class="s1">'P1M'</span><span class="p">))</span> <span class="k">AS</span> <span class="n">maxDate</span>
<span class="k">FROM</span>
<span class="nv">"stargazers-ecosystem"</span>
<span class="p">),</span>
<span class="k">UNNEST</span><span class="p">(</span><span class="n">DATE_EXPAND</span><span class="p">(</span><span class="n">minDate</span><span class="p">,</span> <span class="n">maxDate</span><span class="p">,</span> <span class="s1">'P1M'</span><span class="p">))</span> <span class="k">AS</span> <span class="n">t</span><span class="p">(</span><span class="n">dateByWeek</span><span class="p">)</span>
</code></pre></div></div>
<p>Unfortunately, the query fails. But it indicates clearly why:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Error: Unsupported operation
Cannot convert to Duration as this period contains months and months vary in length
</code></pre></div></div>
<p>So instead, let’s use the next largest interval that works with <code class="language-plaintext highlighter-rouge">DATE_EXPAND</code>, which is week - a week is always the same length -, then truncate to months, and deduplicate the values:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span> <span class="k">DISTINCT</span> <span class="n">TIME_FLOOR</span><span class="p">(</span><span class="n">t</span><span class="p">.</span><span class="n">dateByWeek</span><span class="p">,</span> <span class="s1">'P1M'</span><span class="p">)</span>
<span class="k">FROM</span> <span class="p">(</span>
<span class="k">SELECT</span>
<span class="n">TIMESTAMP_TO_MILLIS</span><span class="p">(</span><span class="n">TIME_FLOOR</span><span class="p">(</span><span class="k">MIN</span><span class="p">(</span><span class="n">__time</span><span class="p">),</span> <span class="s1">'P1M'</span><span class="p">))</span> <span class="k">AS</span> <span class="n">minDate</span><span class="p">,</span>
<span class="n">TIMESTAMP_TO_MILLIS</span><span class="p">(</span><span class="n">TIME_CEIL</span><span class="p">(</span><span class="k">MAX</span><span class="p">(</span><span class="n">__time</span><span class="p">),</span> <span class="s1">'P1M'</span><span class="p">))</span> <span class="k">AS</span> <span class="n">maxDate</span>
<span class="k">FROM</span>
<span class="nv">"stargazers-ecosystem"</span>
<span class="p">),</span>
<span class="k">UNNEST</span><span class="p">(</span><span class="n">DATE_EXPAND</span><span class="p">(</span><span class="n">minDate</span><span class="p">,</span> <span class="n">maxDate</span><span class="p">,</span> <span class="s1">'P1W'</span><span class="p">))</span> <span class="k">AS</span> <span class="n">t</span><span class="p">(</span><span class="n">dateByWeek</span><span class="p">)</span>
</code></pre></div></div>
<p>This works!</p>
<h2 id="join-up-against-the-fact-data">Join up against the fact data</h2>
<p>Let’s try to join the calendar dimension against the fact data. We know already that we can’t have a “less than or equal” condition in the <code class="language-plaintext highlighter-rouge">JOIN</code> clause. So let’s try and write a Cartesian join with a <code class="language-plaintext highlighter-rouge">WHERE</code> clause that does the time windowing:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">WITH</span>
<span class="n">cte_calendar</span> <span class="k">AS</span> <span class="p">(</span>
<span class="k">SELECT</span> <span class="k">DISTINCT</span> <span class="n">TIME_FLOOR</span><span class="p">(</span><span class="n">t</span><span class="p">.</span><span class="n">dateByWeek</span><span class="p">,</span> <span class="s1">'P1M'</span><span class="p">)</span> <span class="k">AS</span> <span class="n">date_month</span>
<span class="k">FROM</span> <span class="p">(</span>
<span class="k">SELECT</span>
<span class="n">TIMESTAMP_TO_MILLIS</span><span class="p">(</span><span class="n">TIME_FLOOR</span><span class="p">(</span><span class="k">MIN</span><span class="p">(</span><span class="n">__time</span><span class="p">),</span> <span class="s1">'P1M'</span><span class="p">))</span> <span class="k">AS</span> <span class="n">minDate</span><span class="p">,</span>
<span class="n">TIMESTAMP_TO_MILLIS</span><span class="p">(</span><span class="n">TIME_CEIL</span><span class="p">(</span><span class="k">MAX</span><span class="p">(</span><span class="n">__time</span><span class="p">),</span> <span class="s1">'P1M'</span><span class="p">))</span> <span class="k">AS</span> <span class="n">maxDate</span>
<span class="k">FROM</span>
<span class="nv">"stargazers-ecosystem"</span>
<span class="p">),</span>
<span class="k">UNNEST</span><span class="p">(</span><span class="n">DATE_EXPAND</span><span class="p">(</span><span class="n">minDate</span><span class="p">,</span> <span class="n">maxDate</span><span class="p">,</span> <span class="s1">'P1W'</span><span class="p">))</span> <span class="k">AS</span> <span class="n">t</span><span class="p">(</span><span class="n">dateByWeek</span><span class="p">)</span>
<span class="p">),</span>
<span class="n">cte_stars</span> <span class="k">AS</span> <span class="p">(</span>
<span class="k">SELECT</span>
<span class="n">DATE_TRUNC</span><span class="p">(</span><span class="s1">'MONTH'</span><span class="p">,</span> <span class="nv">"__time"</span><span class="p">)</span> <span class="k">AS</span> <span class="n">date_month</span><span class="p">,</span>
<span class="n">starred_repo</span><span class="p">,</span>
<span class="k">COUNT</span><span class="p">(</span><span class="o">*</span><span class="p">)</span> <span class="k">AS</span> <span class="n">count_monthly</span>
<span class="k">FROM</span> <span class="nv">"stargazers-ecosystem"</span>
<span class="k">GROUP</span> <span class="k">BY</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">2</span>
<span class="p">)</span>
<span class="k">SELECT</span>
<span class="n">cte_calendar</span><span class="p">.</span><span class="n">date_month</span><span class="p">,</span>
<span class="n">cte_stars</span><span class="p">.</span><span class="n">starred_repo</span><span class="p">,</span>
<span class="k">SUM</span><span class="p">(</span><span class="n">cte_stars</span><span class="p">.</span><span class="n">count_monthly</span><span class="p">)</span> <span class="k">AS</span> <span class="n">sum_cume</span>
<span class="k">FROM</span> <span class="n">cte_calendar</span><span class="p">,</span> <span class="n">cte_stars</span>
<span class="k">WHERE</span> <span class="n">cte_stars</span><span class="p">.</span><span class="n">date_month</span> <span class="o"><=</span> <span class="n">cte_calendar</span><span class="p">.</span><span class="n">date_month</span>
<span class="k">GROUP</span> <span class="k">BY</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">2</span>
<span class="k">ORDER</span> <span class="k">BY</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">2</span>
</code></pre></div></div>
<p>Alas, this fails too - Druid’s query planner views this still as a <code class="language-plaintext highlighter-rouge">JOIN</code> with a non-equals condition, and refuses to plan it:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>SQL requires a join with 'LESS_THAN_OR_EQUAL' condition that is not supported.
</code></pre></div></div>
<p>The message is clear, we need an equals join. Let’s do a workaround by adding <code class="language-plaintext highlighter-rouge">starred_repo</code> to the calendar canvas as well, so as to use it as a join key. So the canvas definition becomes a cross join between the monthly calendar we created above, and the list of all unique repositories:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="k">SELECT</span>
<span class="n">TIME_FLOOR</span><span class="p">(</span><span class="n">t</span><span class="p">.</span><span class="n">dateByWeek</span><span class="p">,</span> <span class="s1">'P1M'</span><span class="p">)</span> <span class="k">AS</span> <span class="n">date_month</span><span class="p">,</span>
<span class="n">starred_repo</span>
<span class="k">FROM</span> <span class="p">(</span>
<span class="k">SELECT</span>
<span class="n">TIMESTAMP_TO_MILLIS</span><span class="p">(</span><span class="n">TIME_FLOOR</span><span class="p">(</span><span class="k">MIN</span><span class="p">(</span><span class="n">__time</span><span class="p">),</span> <span class="s1">'P1M'</span><span class="p">))</span> <span class="k">AS</span> <span class="n">minDate</span><span class="p">,</span>
<span class="n">TIMESTAMP_TO_MILLIS</span><span class="p">(</span><span class="n">TIME_CEIL</span><span class="p">(</span><span class="k">MAX</span><span class="p">(</span><span class="n">__time</span><span class="p">),</span> <span class="s1">'P1M'</span><span class="p">))</span> <span class="k">AS</span> <span class="n">maxDate</span>
<span class="k">FROM</span>
<span class="nv">"stargazers-ecosystem"</span>
<span class="p">),</span>
<span class="k">UNNEST</span><span class="p">(</span><span class="n">DATE_EXPAND</span><span class="p">(</span><span class="n">minDate</span><span class="p">,</span> <span class="n">maxDate</span><span class="p">,</span> <span class="s1">'P1W'</span><span class="p">))</span> <span class="k">AS</span> <span class="n">t</span><span class="p">(</span><span class="n">dateByWeek</span><span class="p">),</span>
<span class="p">(</span> <span class="k">SELECT</span> <span class="k">DISTINCT</span> <span class="n">starred_repo</span> <span class="k">FROM</span> <span class="nv">"stargazers-ecosystem"</span> <span class="p">)</span>
<span class="k">GROUP</span> <span class="k">BY</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">2</span>
</code></pre></div></div>
<p>Then define this as a CTE, join the facts on <code class="language-plaintext highlighter-rouge">starred_repo</code>, and tuck the unbound preceding condition away into a <a href="https://druid.apache.org/docs/latest/querying/sql-aggregations.html">filtered metric</a>:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">WITH</span>
<span class="n">cte_calendar</span> <span class="k">AS</span> <span class="p">(</span>
<span class="k">SELECT</span>
<span class="n">TIME_FLOOR</span><span class="p">(</span><span class="n">t</span><span class="p">.</span><span class="n">dateByWeek</span><span class="p">,</span> <span class="s1">'P1M'</span><span class="p">)</span> <span class="k">AS</span> <span class="n">date_month</span><span class="p">,</span>
<span class="n">starred_repo</span>
<span class="k">FROM</span> <span class="p">(</span>
<span class="k">SELECT</span>
<span class="n">TIMESTAMP_TO_MILLIS</span><span class="p">(</span><span class="n">TIME_FLOOR</span><span class="p">(</span><span class="k">MIN</span><span class="p">(</span><span class="n">__time</span><span class="p">),</span> <span class="s1">'P1M'</span><span class="p">))</span> <span class="k">AS</span> <span class="n">minDate</span><span class="p">,</span>
<span class="n">TIMESTAMP_TO_MILLIS</span><span class="p">(</span><span class="n">TIME_CEIL</span><span class="p">(</span><span class="k">MAX</span><span class="p">(</span><span class="n">__time</span><span class="p">),</span> <span class="s1">'P1M'</span><span class="p">))</span> <span class="k">AS</span> <span class="n">maxDate</span>
<span class="k">FROM</span>
<span class="nv">"stargazers-ecosystem"</span>
<span class="p">),</span>
<span class="k">UNNEST</span><span class="p">(</span><span class="n">DATE_EXPAND</span><span class="p">(</span><span class="n">minDate</span><span class="p">,</span> <span class="n">maxDate</span><span class="p">,</span> <span class="s1">'P1W'</span><span class="p">))</span> <span class="k">AS</span> <span class="n">t</span><span class="p">(</span><span class="n">dateByWeek</span><span class="p">),</span>
<span class="p">(</span> <span class="k">SELECT</span> <span class="k">DISTINCT</span> <span class="n">starred_repo</span> <span class="k">FROM</span> <span class="nv">"stargazers-ecosystem"</span> <span class="p">)</span>
<span class="k">GROUP</span> <span class="k">BY</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">2</span>
<span class="p">),</span>
<span class="n">cte_stars</span> <span class="k">AS</span> <span class="p">(</span>
<span class="k">SELECT</span>
<span class="n">DATE_TRUNC</span><span class="p">(</span><span class="s1">'MONTH'</span><span class="p">,</span> <span class="nv">"__time"</span><span class="p">)</span> <span class="k">AS</span> <span class="n">date_month</span><span class="p">,</span>
<span class="n">starred_repo</span><span class="p">,</span>
<span class="k">COUNT</span><span class="p">(</span><span class="o">*</span><span class="p">)</span> <span class="k">AS</span> <span class="n">count_monthly</span>
<span class="k">FROM</span> <span class="nv">"stargazers-ecosystem"</span>
<span class="k">GROUP</span> <span class="k">BY</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">2</span>
<span class="p">)</span>
<span class="k">SELECT</span>
<span class="n">cte_calendar</span><span class="p">.</span><span class="n">date_month</span><span class="p">,</span>
<span class="n">cte_stars</span><span class="p">.</span><span class="n">starred_repo</span><span class="p">,</span>
<span class="k">SUM</span><span class="p">(</span><span class="n">cte_stars</span><span class="p">.</span><span class="n">count_monthly</span><span class="p">)</span> <span class="n">FILTER</span><span class="p">(</span><span class="k">WHERE</span> <span class="n">cte_stars</span><span class="p">.</span><span class="n">date_month</span> <span class="o"><=</span> <span class="n">cte_calendar</span><span class="p">.</span><span class="n">date_month</span><span class="p">)</span> <span class="k">AS</span> <span class="n">sum_cume</span>
<span class="k">FROM</span> <span class="n">cte_calendar</span> <span class="k">INNER</span> <span class="k">JOIN</span> <span class="n">cte_stars</span> <span class="k">ON</span> <span class="n">cte_calendar</span><span class="p">.</span><span class="n">starred_repo</span> <span class="o">=</span> <span class="n">cte_stars</span><span class="p">.</span><span class="n">starred_repo</span>
<span class="k">GROUP</span> <span class="k">BY</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">2</span>
<span class="k">ORDER</span> <span class="k">BY</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">2</span>
</code></pre></div></div>
<p>Use this query to define a cube in the Polaris GUI, and see the result:</p>
<p><img src="/assets/2023-07-12-04-calendar-canvas.jpg" alt="Visualization: Cumulative Sums" /></p>
<p>And, ceteris paribus, now the number of Superset stars maxes out at 40k but they don’t drop to zero!</p>
<h2 id="learnings">Learnings</h2>
<ul>
<li>The <a href="https://blog.hellmar-becker.de/2022/11/05/druid-data-cookbook-cumulative-sums-in-druid-sql/">self join approach to cumulative sums</a> fails when there are “holes” in the data (aka factless facts).</li>
<li>The best approach to counter this is building an explicit calendar dimension.</li>
<li><code class="language-plaintext highlighter-rouge">DATE_EXPAND</code> can be used to build a calendar canvas but has some limitations. We showed how to work around those.</li>
<li>Also, we learned how we can work around the <code class="language-plaintext highlighter-rouge">JOIN</code> limitation in Druid SQL by adding a synthetic join key to the calendar dimension and using a filtered metric.</li>
</ul>
<hr />
<p>“Ludwig_Richter-The_Star_Money-2-1862” (via <a href="https://commons.wikimedia.org/wiki/File:Ludwig_Richter-The_Star_Money-2-1862.jpg">Wikimedia Commons</a>) is in the <b><a href="https://en.wikipedia.org/wiki/public_domain" class="extiw" title="en:public domain">public domain</a></b> in its country of origin and other countries and areas where the <a href="https://en.wikipedia.org/wiki/List_of_countries%27_copyright_lengths" class="extiw" title="w:List of countries' copyright lengths">copyright term</a> is the author’s <b>life plus 100 years or fewer</b>.</p>Indexes in Apache Druid2023-06-28T00:00:00+02:002023-06-28T00:00:00+02:00/2023/06/28/indexes-in-apache-druid<p>If you come from a traditional database background, you are probably used to creating and maintaining indexes on most of your data. In a relational database, indexes can speed up queries but at a cost of slower data insertion.</p>
<p>In Druid, on the other hand, you never see a <code class="language-plaintext highlighter-rouge">CREATE INDEX</code> statement. Instead, Druid automatically indexes all data, creating optimized storage segments that provide high performance for all data types - and you never need to select or manage indexes. Let’s look at some of these data organization features!</p>
<h2 id="druid-bitmap-indexes">Druid Bitmap Indexes</h2>
<p>Druid uses <strong><em><a href="https://en.wikipedia.org/wiki/Bitmap_index">bitmap indexes</a></em></strong>. These are created automatically on all string columns and on each subfield of a JSON column. Let’s look at this design choice in some more detail.</p>
<h3 id="types-of-indexes-in-a-relational-database">Types of indexes in a relational database</h3>
<p>Relational databases use a B-tree index as their primary index type. A relational table often has a primary key that can be used to uniquely identify a row in the table. A B-tree index maps individual keys to the rows that contain them. Its use cases are:</p>
<ul>
<li>enforcing uniqueness of a key during inserting</li>
<li>quickly looking up a single value for updates, inserts, and (sometimes) join queries.</li>
</ul>
<p>A B-tree index is not a good choice for analytical queries where you have, as a rule, many rows with the same value, and you want to retrieve and aggregate data in bulk. It is also to be noted that due to the structure of a B-tree index, lookups are <em>O(log n)</em> complexity, which may be impractical for large tables.</p>
<h3 id="bitmap-indexes---why">Bitmap indexes - why?</h3>
<p><strong><em>Bitmap indexes</em></strong> came up as relational databases were enhanced with analytical features. A bitmap index stores, for each value, a bit array where the position of each row that has a <em>1</em> bit and all the other positions are <em>0</em>. It can be thought of as an <strong><em>inverted index</em></strong> that maps not a row number to a value, but a value to a collection of rows where the value occurs.</p>
<p>This has a number of advantages:</p>
<ul>
<li>Fast lookup of all rows for a value. Because the bitmap index is an array, such lookups are <em>O(1)</em>.</li>
<li>Even better, bitmap indexes are mergeable in any combination. To model logical conditions such as the union or intersection of filters, just apply bitwise logical <em>OR</em> and <em>AND</em> operations to the bitmap.</li>
<li>Bitmaps are always segment local and thus fast to maintain. If your data is partitioned or sharded, the bitmap index is partitioned in the same way.</li>
</ul>
<p>For high cardinality and sparse data, a forward index such as a B-tree may be faster but there are ways to get the best of both worlds. I’ll get to that in a moment.</p>
<p>Why doesn’t Druid use B-tree indexes as a general option? Unlike a bitmap index, a B-tree index has to be global to be fast. (A global index spans the whole table, disregarding any partitioning.) This makes insertion and index maintenance quite expensive.</p>
<h3 id="how-druid-implements-the-best-of-forward-and-inverted-index-druid-roaring-bitmaps">How Druid implements the best of forward and inverted index: Druid roaring bitmaps</h3>
<p>Let’s talk about <em>sparse indexes</em> for a moment. Contrary to a widespread belief, regular bitmaps are best for columns with medium cardinality. If the cardinality of a column is very low, the index is not very selective and you need to read a lot of data anyway. If the cardinality is very high, you have a different problem: Each value is only present in a small fraction of rows, so you would waste a lot of space storing zeroes for each value.</p>
<p>This is why Druid does not just implement bitmap indexes. Instead, bitmap indexes are by default compressed using <a href="https://www.roaringbitmap.org/">Roaring bitmap indexes</a>. The roaring bitmap algorithm cuts up the bitmap into pages of 2<sup>16</sup> rows. If the page has very few <em>1</em> bits, it stores a list of row IDs instead.</p>
<p>Roaring bitmaps also support run-length encoding of pages, which is particularly effective when indexing a dimension that is also used to pre-sort the data - more about this later.</p>
<h3 id="bitmap-indexes-and-multi-value-dimensions">Bitmap indexes and multi-value dimensions</h3>
<p><a href="/2021/08/07/multivalue-dimensions-in-apache-druid-part-1/">Multi-value dimensions</a> go nicely with bitmap indexes. A multi-value field would just have a bit set for every value that occurs in the cell. That is another reason to prefer bitmap indexes.</p>
<h2 id="colocating-data-partitioning-and-clustering">Colocating Data: Partitioning and Clustering</h2>
<p>In relational data modeling, the main abstraction is that you look at the table as a whole. There is no implicit ordering in the way the data is laid out. It has long been known that this is not the best model for analytical queries. That is why there are options in Druid that inform the physical layout of the data.</p>
<h3 id="time-partitioning-granularities-and-sorting-by-time">Time partitioning, granularities and sorting by time</h3>
<p>All data in Druid is partitioned and sorted by time. Each row has a primary timestamp, and part of the data modeling process is to define a <em>segment granularity</em> and <em>query granularity</em>.</p>
<p><em>Segment granularity</em> is defined by the <code class="language-plaintext highlighter-rouge">PARTITIONED BY</code> clause in SQL based ingestion and it translates directly into the time chunks that define the segment timeline. (Within each time chunk, there may be multiple segments.) Within a segment, data is sorted by primary timestamp. This creates the equivalent of a <strong><em>timeseries index</em></strong>.</p>
<p><em>Query granularity</em> is defined by truncating the primary timestamp in the ingestion query. Druid uses query granularity to deliberately define the time resolution such that data can be rolled up efficiently. This can greatly improve query performance and storage use.</p>
<h3 id="special-case-multiple-time-granularities">Special case: Multiple time granularities</h3>
<p>If you want to achieve primary sorting by another column than time, you should set segment and query granularity to the same value. If you still need detailed timestamps, you can define the detailed time as a <a href="https://druid.apache.org/docs/latest/ingestion/schema-design.html#secondary-timestamps">secondary timestamp</a>. The main criteria for this design decision is if you expect to be running predominantly analytical queries that do not have timeseries characteristics, but you want to retain the ability to run some timeseries queries. The number of timestamp fields is in principle not limited.</p>
<h3 id="secondary-partitioning-pruning-and-range-queries">Secondary partitioning: Pruning and range queries</h3>
<p>Below the timestamp level, there is <em>secondary partitioning</em>, which is usually implemented as <a href="/partitioning-in-druid-part-3-multi-dimension-range-partitioning/">range partitioning</a>. This defines a list of dimension fields to partition by. In SQL based ingestion, this corresponds to the <code class="language-plaintext highlighter-rouge">CLUSTERED BY</code> clause. You want to order your partitioning columns first in the ingestion query, too. Then your data will be sorted according to the partitioning columns, and like values will be grouped together physically. If you filter by the partitioning key in a query, Druid uses this information to determine which data segments to look at, even before scanning any data. This is called <strong><em>partition pruning</em></strong> and is a great way to speed up queries.</p>
<h3 id="how-druid-implements-composite-index-functionality">How Druid implements composite index functionality</h3>
<p>With multi-dimension range partitioning, Druid achieves the same functionality as a <strong><em>composite index</em></strong>. In an RDBMS, you would use a composite index whenever you have a combination of columns that you use to filter or group by in most of the queries that you typically run.</p>
<p>That being said, because we use bitmap indexes on all columns, we also achieve composite index functionality by merging bitmap indexes across columns.</p>
<h3 id="how-druid-implements-range-index-functionality">How Druid implements range index functionality</h3>
<p>Another advantage of multi-dimension range partitioning is where you query for a range of values. Because the partitioning key also determines sort order, values within a range are grouped together. This achieves the functionality of a <strong><em>range index</em></strong>.</p>
<h3 id="be-extra-space-efficient-front-coding">Be extra space efficient: Front coding</h3>
<p>In addition to range sorting, Druid implements <em>front coding</em> for character data. All data is represented by a dictionary (which can be thought of as a <strong><em>forward index</em></strong>), and common prefixes are shared between entries. That way, we optimize space usage without sacrificing speed.</p>
<h2 id="structured-data-nested-columns">Structured Data: Nested Columns</h2>
<p>For nested (JSON) columns, Druid creates a bitmap index <em>for each nested field</em>. With that, you get the functionality of a <strong><em>document (JSON) index</em></strong>. Again, Druid does the right thing automatically without requiring any explicit configuration.</p>
<h2 id="conclusion">Conclusion</h2>
<p>In this article, I gave a quick tour of data organization and indexing features in Apache Druid. What have we learned?</p>
<ul>
<li>You might be asking: where are the indexes? In Druid, indexes are created and maintained automatically. And a lot of index functionality is done with features that are not technically indexes, but achieve the same effect.</li>
<li>For analytical queries, bitmap indexes are the best choice for many scenarios. Druid creates bitmap indexes on all (string) columns by default.</li>
<li>Bitmap indexes allow merging and logical operations, and thus support arbitrary column combinations, superseding composite indexes.</li>
<li>Our implementation of Roaring bitmaps uses forward lookup for sparse columns: this optimizes both query speed and storage.</li>
<li>Time partitioning aids pruning in time based queries.</li>
<li>Time sorting is great for time series and time range queries.</li>
<li>Secondary partitioning replaces composite and range indexes.</li>
<li>Each field inside a nested column (document column) has its own bitmap index so JSON index functionality is achieved.</li>
</ul>If you come from a traditional database background, you are probably used to creating and maintaining indexes on most of your data. In a relational database, indexes can speed up queries but at a cost of slower data insertion.New in Druid 26: Data Provenance Tracking with Kafka Headers, Automatically2023-06-27T00:00:00+02:002023-06-27T00:00:00+02:00/2023/06/27/new-in-druid-26-data-provenance-tracking-with-kafka-metadata-automatically<p><img src="/assets/2023-06-27-00-airplane.jpg" alt="Lufthansa Airbus A350 XWB D-AIXP arrives SFO L1060413, by wbaiv (Bill Abbott)" /></p>
<p>I have previously written about <a href="https://blog.hellmar-becker.de/2022/08/30/processing-flight-radar-ads-b-data-with-decodable-and-imply/">processing</a> and <a href="https://blog.hellmar-becker.de/2023/02/01/street-level-maps-in-imply-pivot-with-flight-data-and-confluent-cloud/">visualizing</a> ADS-B flight radar data with Kafka and Druid. This time, let’s look at some new possibilities with ingesting those data in a bit more detail.</p>
<p>The story starts with a discussion within our DevRel team at <a href="https://imply.io/">Imply</a>. Wouldn’t it be nice to have multiple flight radar receivers in different locations, and have them all produce data into the same Kafka topic (which lives in Confluent Cloud.) But then, one should also be able to add a unique client ID (and possibly other metadata) to each message. In short, we need data provenance tracking. This is indeed of practical use: in any serious enterprise use case, <a href="https://en.wikipedia.org/wiki/Data_lineage">data lineage</a> tracking is indisposable!</p>
<p>In Kafka, data lineage is tracked with <a href="https://www.confluent.io/blog/5-things-every-kafka-developer-should-know/#tip-5-record-headers">message headers</a>. These are basically key-value pairs that can be defined freely. Inside Kafka, the header values are coded as binary bytes - their meaning and encoding is governed by your data contract, something to keep in mind for later.</p>
<p>Druid has been able to ingest Kafka metadata for a while, <a href="https://blog.hellmar-becker.de/2022/11/23/processing-nested-json-data-and-kafka-metadata-in-apache-druid/">and I have written about it before</a>. But before version 26, you had to edit the ingestion spec manually to enable this feature. Now, it is supported by the Druid console, making things a lot easier. Let’s see how this works for our flight radar data!</p>
<p>In this tutorial, you will</p>
<ul>
<li>generate Kafka messages with headers from flight radar data</li>
<li>ingest and model these data inside Druid</li>
<li>and show how these data can be queried just like any other table column using Druid SQL.</li>
</ul>
<p>For the tutorial, use at least Druid version 26.0. The Druid quickstart works fine.</p>
<h2 id="generating-the-data">Generating the data</h2>
<p>In my <a href="https://blog.hellmar-becker.de/2022/08/30/processing-flight-radar-ads-b-data-with-decodable-and-imply/">blog, I’ve previously described</a> how you can use a Raspberry Pi with a DVB-T stick to receive flight radar data. Let’s modify the Kafka connector script to generate ourselves some data with Kafka headers. <code class="language-plaintext highlighter-rouge">kcat</code> comes with a <code class="language-plaintext highlighter-rouge">-H</code> option to inject arbitrary headers into a Kafka message.</p>
<p>Edit the following script, entering a unique client ID of your choice and your geographical coordinates. Then follow the instructions in the blog above to install the script as a service on your Raspberry Pi.</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#!/bin/bash</span>
<span class="nv">CC_BOOTSTRAP</span><span class="o">=</span><span class="s2">"<confluent cloud bootstrap server>"</span>
<span class="nv">CC_APIKEY</span><span class="o">=</span><span class="s2">"<api key>"</span>
<span class="nv">CC_SECRET</span><span class="o">=</span><span class="s2">"<secret>"</span>
<span class="nv">CC_SECURE</span><span class="o">=</span><span class="s2">"-X security.protocol=SASL_SSL -X sasl.mechanism=PLAIN -X sasl.username=</span><span class="k">${</span><span class="nv">CC_APIKEY</span><span class="k">}</span><span class="s2"> -X sasl.password=</span><span class="k">${</span><span class="nv">CC_SECRET</span><span class="k">}</span><span class="s2">"</span>
<span class="nv">CLIENT_ID</span><span class="o">=</span><span class="s2">"<client id>"</span>
<span class="nv">LON</span><span class="o">=</span><span class="s2">"0.0"</span>
<span class="nv">LAT</span><span class="o">=</span><span class="s2">"0.0"</span>
<span class="nv">TOPIC_NAME</span><span class="o">=</span><span class="s2">"adsb-raw"</span>
nc localhost 30003 <span class="se">\</span>
| <span class="nb">awk</span> <span class="nt">-F</span> <span class="s2">","</span> <span class="s1">'{ print $5 "|" $0 }'</span> <span class="se">\</span>
| kafkacat <span class="nt">-P</span> <span class="se">\</span>
<span class="nt">-t</span> <span class="k">${</span><span class="nv">TOPIC_NAME</span><span class="k">}</span> <span class="se">\</span>
<span class="nt">-b</span> <span class="k">${</span><span class="nv">CC_BOOTSTRAP</span><span class="k">}</span> <span class="se">\</span>
<span class="nt">-H</span> <span class="s2">"ClientID=</span><span class="k">${</span><span class="nv">CLIENT_ID</span><span class="k">}</span><span class="s2">"</span> <span class="se">\</span>
<span class="nt">-H</span> <span class="s2">"ReceiverLon=</span><span class="k">${</span><span class="nv">LON</span><span class="k">}</span><span class="s2">"</span> <span class="se">\</span>
<span class="nt">-H</span> <span class="s2">"ReceiverLat=</span><span class="k">${</span><span class="nv">LAT</span><span class="k">}</span><span class="s2">"</span> <span class="se">\</span>
<span class="nt">-K</span> <span class="s2">"|"</span> <span class="se">\</span>
<span class="k">${</span><span class="nv">CC_SECURE</span><span class="k">}</span>
</code></pre></div></div>
<p>This adds a Kafka key (the aircraft hex ID), a unique ID for the radar receiver, and also the receiver coordinates, as Kafka headers.</p>
<h2 id="ingesting-the-data">Ingesting the data</h2>
<p>In Druid, create a Kafka connection. In my lab, I am using Confluent Cloud so I have to encode the credentials in the consumer properties as described <a href="https://blog.hellmar-becker.de/2021/10/19/reading-avro-streams-from-confluent-cloud-into-druid/">in another of my blog posts</a>. (If you are using a local, unsecured Kafka service, it is sufficient to enter the bootstrap server and Kafka topic.)</p>
<p>Note how the preview looks different from previous Druid versions:</p>
<p><img src="/assets/2023-06-27-01-preview.jpg" alt="Kafka topic preview with metadata" /></p>
<p>It now lists the Kafka metadata:</p>
<ul>
<li>timestamp</li>
<li>key</li>
<li>headers</li>
</ul>
<p>along with the payload.</p>
<p>In the <code class="language-plaintext highlighter-rouge">Parse data</code> wizard, enter the column headers for the flight data:</p>
<pre><code class="language-csv">MT,TT,SID,AID,Hex,FID,DMG,TMG,DML,TML,CS,Alt,GS,Trk,Lat,Lng,VR,Sq,Alrt,Emer,SPI,Gnd
</code></pre>
<p>Also make sure to enable the switch for parsing Kafka metadata (it should be on by default):</p>
<p><img src="/assets/2023-06-27-02-parse-kafka.jpg" alt="Kafka Parser with metadata" /></p>
<p>If you scroll down the right window pane, you will find a number of new options about handling the metadata.</p>
<p><img src="/assets/2023-06-27-03-kafka-metadata-options.jpg" alt="Kafka metadata options" /></p>
<p>Here you specify how the key is parsed. (You could in theory have a structured key, because the key is parsed into an input format just like the payload. In practice, you will usually have a single string that can be parsed using a regular expression or <a href="https://blog.hellmar-becker.de/2022/11/23/processing-nested-json-data-and-kafka-metadata-in-apache-druid/">a degenerate CSV parser</a>.)</p>
<p>Moreover, this is where you define the prefixes to be used for the metadata in your final data model. And last but no least, you define how to decode the header values. In most cases, UTF-8 is a good choice, but it really depends on what your producer puts in at the other end.</p>
<p>The Kafka timestamp is automatically suggested as the primary Druid timestamp:</p>
<p><img src="/assets/2023-06-27-04-kafka-timestamp.jpg" alt="Model with timestampt" /></p>
<p>So, with a minimum configuration (as usual, you have to define your segment granularity and datasource name), you have your Kafka ingestion ready:</p>
<p><img src="/assets/2023-06-27-05-view-spec.jpg" alt="Ingestion spec" /></p>
<p>After submitting the spec, run a quick query to verify that indeed, the Kafka metadata has been parsed and ingested correctly:</p>
<p><img src="/assets/2023-06-27-06-query.jpg" alt="Example query" /></p>
<p>And that is how easily Kafka metadata goes into Apache Druid!</p>
<h2 id="conclusion">Conclusion</h2>
<ul>
<li>Data lineage can be tracked with Kafka headers.</li>
<li>Starting with Druid 26, Kafka metadata (timestamp, key, headers) are supported by the unified console wizard.</li>
<li>With this, we can easily build a distributed flight data service using only one Kafka topic.</li>
</ul>
<hr />
<p class="attribution">"<a target="_blank" rel="noopener noreferrer" href="https://www.flickr.com/photos/wbaiv/52202356360/">Lufthansa Airbus A350 XWB D-AIXP arrives SFO L1060413</a>" by <a target="_blank" rel="noopener noreferrer" href="https://www.flickr.com/photos/wbaiv">wbaiv</a> is licensed under <a target="_blank" rel="noopener noreferrer" href="https://creativecommons.org/licenses/by-sa/2.0/">CC BY-SA 2.0 <img src="https://mirrors.creativecommons.org/presskit/icons/cc.svg" style="height: 1em; margin-right: 0.125em; display: inline;" /><img src="https://mirrors.creativecommons.org/presskit/icons/by.svg" style="height: 1em; margin-right: 0.125em; display: inline;" /><img src="https://mirrors.creativecommons.org/presskit/icons/sa.svg" style="height: 1em; margin-right: 0.125em; display: inline;" /></a>. </p>Overlaying Multiple Metrics in Imply Pivot2023-05-31T00:00:00+02:002023-05-31T00:00:00+02:00/2023/05/31/overlaying-multiple-metrics-in-imply-pivot<p><img src="/assets/2023-05-31-01.jpg" alt="Screenshot with 3 metrics overlayed" /></p>
<p>Today we are going to look at a new enhancement for line chart graphs in <a href="https://docs.imply.io/latest/pivot-overview/">Imply’s Pivot</a>, such as timeseries curves. Up until recently, one chart would only show a single measure<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup>.
If you pulled in multiple metrics, you would get each in its own chart, like this:</p>
<p><img src="/assets/2023-05-31-02.jpg" alt="Screenshot with 3 metrics in rows" /></p>
<p>What analysts asked for was to have all curves overlaid in one chart, like in the screenshot at the beginnning of this article.</p>
<p>This is possible now. But how do you go about it? Let’s have a look!</p>
<h2 id="two-or-more-measures-in-one-chart">Two (or more) measures in one chart</h2>
<p>Here is how to show multiple measures in one chart. In this example, we are looking at clickstream data and we want to show the total number of events, the number of clicks, and the number of sessions.</p>
<p>Drag all the measures you want to show into the show bar:</p>
<p><img src="/assets/2023-05-31-03.jpg" alt="Screenshots with 3 measures in rows, highlight the drag and drop from events, clicks, unique sessions" /></p>
<p>Select the paintbrush icon on the right sidebar and from the option menu, select “Show measures in” “Cell”:</p>
<p><img src="/assets/2023-05-31-04.jpg" alt="Screenshot with the menu options highlighted, and the curves overlaid" /></p>
<p>This looks quite good. But what if the measures are to a vastly different scale?</p>
<h2 id="two-measures-with-separate-axis-scaling">Two measures with separate axis scaling</h2>
<p>Let’s stick to the clickstream data and say we have a conversion goal and we want to look at both the total traffic and the conversion rate. We follow the same steps as before, but this time we use the number of clicks and the conversion rate as measures.</p>
<p><img src="/assets/2023-05-31-05.jpg" alt="Screenshot with clicks and conversion rate, have a balloon on the curve to show the numbers at one point" /></p>
<p>As you can see, the scales are so vastly different that the conversion rate all but disappears. But there is a solution: if you have only two measures you can show them on different axes so that both curves fill the canvas.</p>
<p><img src="/assets/2023-05-31-06.jpg" alt="Screenshot with clicks and conversion rate, highlight dual axis menu" /></p>
<p>In the formatting options, choose whether you want to show horizontal grid line for both axes or only for the first:</p>
<p><img src="/assets/2023-05-31-07.jpg" alt="Highlight show horizontal grid menu and lines for both axes" /></p>
<h2 id="learnings">Learnings</h2>
<ul>
<li>Pivot can now display multiple line graphs in one chart.</li>
<li>If you show more than two measures, they all share the same <em>y</em> axis scaling.</li>
<li>If you show only two measures, you can scale the <em>y</em> axis for each of them independently.</li>
</ul>
<hr />
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>You would be able to display a second measure as a dotted line using the comparison feature, but options are limited. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>Druid Sneak Peek: Schema Inference and Arrays2023-05-01T00:00:00+02:002023-05-01T00:00:00+02:00/2023/05/01/druid-sneak-peek-schema-inference-and-arrays<p>One of the strong points of Druid has always been <a href="/2021/08/13/experiments-with-schema-evolution-in-apache-druid/">built-in schema evolution</a>. However, upon getting data of changing shape into Druid, you had two choices:</p>
<ul>
<li>either, specify each field with its type in the ingestion spec, which requires to know all the fields ahead of time</li>
<li>or pick up whatever comes in using <a href="https://druid.apache.org/docs/latest/ingestion/schema-design.html#schema-less-dimensions">schemaless ingestion</a>, with the downside that any dimension ingested that way would be interpreted as a string.</li>
</ul>
<p>The good news is that this is going to change. Druid 26 is going to come with the ability to infer its schema completely from the input data, and even ingest structured data automatically.</p>
<p><em><strong>Disclaimer:</strong> This tutorial uses undocumented functionality and unreleased code. This blog is neither endorsed by Imply nor by the Apache Druid PMC. It merely collects the results of personal experiments. The features described here might, in the final release, work differently, or not at all. In addition, the entire build, or execution, may fail. Your mileage may vary.</em></p>
<p>Druid 26 hasn’t been released yet, but you can <a href="https://druid.apache.org/docs/latest/development/build.html">build Druid</a> from the master branch of the repository and try out the new features.</p>
<p>I am going to pick up the <a href="/2023/04/23/multivalue-dimensions-in-apache-druid-part-5/">multi-value dimensions example from last week</a>, but this time I want you to get an idea how these types of scenarios are going to be handled in the future. We are going to:</p>
<ul>
<li>ingest data using the new schema discovery feature</li>
<li>ingest structured data into an SQL ARRAY</li>
<li>show how <code class="language-plaintext highlighter-rouge">GROUP BY</code> and lateral joins work with that array.</li>
</ul>
<h2 id="ingestion-schema-inference">Ingestion: Schema Inference</h2>
<p>We are using the <code class="language-plaintext highlighter-rouge">ristorante</code> dataset that you can find <a href="/2021/09/25/multivalue-dimensions-in-apache-druid-part-3/">here</a>, but with a little twist: On the <code class="language-plaintext highlighter-rouge">Configure schema</code> tab, uncheck <code class="language-plaintext highlighter-rouge">Explicitly specify dimension list</code>.</p>
<p><img src="/assets/2023-05-01-01-autodetect.jpg" alt="Set autodetect" /></p>
<p>Confirm the warning dialog that pops up, and continue modeling the data. When you proceed to the <code class="language-plaintext highlighter-rouge">Edit spec</code> stage, you can see a new setting that slipped in:</p>
<p><img src="/assets/2023-05-01-02-useSchemaDiscovery.jpg" alt="Autodetect" /></p>
<p>The <code class="language-plaintext highlighter-rouge">dimensionsSpec</code> has no dimension list now, but there is a new flag <code class="language-plaintext highlighter-rouge">useSchemaDiscovery</code>:</p>
<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="w"> </span><span class="nl">"dimensionsSpec"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nl">"useSchemaDiscovery"</span><span class="p">:</span><span class="w"> </span><span class="kc">true</span><span class="p">,</span><span class="w">
</span><span class="nl">"includeAllDimensions"</span><span class="p">:</span><span class="w"> </span><span class="kc">true</span><span class="p">,</span><span class="w">
</span><span class="nl">"dimensionExclusions"</span><span class="p">:</span><span class="w"> </span><span class="p">[]</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>
<h2 id="querying-the-data">Querying the data</h2>
<p>Let’s look at the resulting data with a simple <code class="language-plaintext highlighter-rouge">SELECT *</code> query:</p>
<p><img src="/assets/2023-05-01-03-select-trueArray.jpg" alt="Select all" /></p>
<p>Notice how Druid has automatically detected that <code class="language-plaintext highlighter-rouge">orders</code> is an array of primitives (strings, in this case.) You recognize this by the symbol next to the columns name, which now looks like this: [··]. In older versions, this would have been either a multi-value string. But now, Druid has true <code class="language-plaintext highlighter-rouge">ARRAY</code> columns!</p>
<p>(In the more general case of nested objects, Druid would have generated a nested JSON column.)</p>
<p>In order to take the arrays apart, we can once again make use of the <code class="language-plaintext highlighter-rouge">UNNEST</code> function. This has to be enabled using a query context flag. In the console, use the <code class="language-plaintext highlighter-rouge">Edit context</code> function inside the query engine menu</p>
<p><img src="/assets/2023-05-01-04-editcontext.jpg" width="40%" /></p>
<p>and enter the context:</p>
<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
</span><span class="nl">"enableUnnest"</span><span class="p">:</span><span class="w"> </span><span class="kc">true</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>
<p>In the REST API, you can pass the context directly.</p>
<p>Then, unnest and group the items:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span>
<span class="n">order_item</span><span class="p">,</span>
<span class="k">COUNT</span><span class="p">(</span><span class="o">*</span><span class="p">)</span> <span class="k">AS</span> <span class="n">order_count</span>
<span class="k">FROM</span> <span class="nv">"ristorante_auto"</span><span class="p">,</span> <span class="k">UNNEST</span><span class="p">(</span><span class="n">orders</span><span class="p">)</span> <span class="k">AS</span> <span class="n">t</span><span class="p">(</span><span class="n">order_item</span><span class="p">)</span>
<span class="k">GROUP</span> <span class="k">BY</span> <span class="mi">1</span>
</code></pre></div></div>
<p><img src="/assets/2023-05-01-05-groupby.jpg" alt="Select groupby" /></p>
<p>Once you have done this, you can filter by individual order items and you don’t have all the quirks that we talked about when doing multi-value dimensions:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span>
<span class="n">customer</span><span class="p">,</span>
<span class="n">order_item</span><span class="p">,</span>
<span class="k">COUNT</span><span class="p">(</span><span class="o">*</span><span class="p">)</span> <span class="k">AS</span> <span class="n">order_count</span>
<span class="k">FROM</span> <span class="nv">"ristorante_auto"</span><span class="p">,</span> <span class="k">UNNEST</span><span class="p">(</span><span class="n">orders</span><span class="p">)</span> <span class="k">AS</span> <span class="n">t</span><span class="p">(</span><span class="n">order_item</span><span class="p">)</span>
<span class="k">WHERE</span> <span class="n">order_item</span> <span class="o">=</span> <span class="s1">'tiramisu'</span>
<span class="k">GROUP</span> <span class="k">BY</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">2</span>
</code></pre></div></div>
<p><img src="/assets/2023-05-01-06-filter.jpg" alt="Filtered groupby" /></p>
<h2 id="conclusion">Conclusion</h2>
<ul>
<li>Druid can now do schema inference.</li>
<li>It can automatically detect primitive types, but also nested objects and arrays of primitives.</li>
<li>Typical Druid queries that would use multi-value dimensions in the past can now be done in a more standard way using array columns and <code class="language-plaintext highlighter-rouge">UNNEST</code>.</li>
</ul>One of the strong points of Druid has always been built-in schema evolution. However, upon getting data of changing shape into Druid, you had two choices:Multi-Value Dimensions in Apache Druid (Part 5)2023-04-23T00:00:00+02:002023-04-23T00:00:00+02:00/2023/04/23/multivalue-dimensions-in-apache-druid-part-5<p><img src="/assets/2023-04-23-07.jpg" alt="" /></p>
<p>An interesting discussion that I had with a Druid user prompts me to continue the loose miniseries about multi-value dimensions in Apache Druid. The previous posts can be found here:</p>
<ul>
<li><a href="/2021/08/07/multivalue-dimensions-in-apache-druid-part-1/">part 1</a></li>
<li><a href="/2021/08/29/multivalue-dimensions-in-apache-druid-part-2/">part 2</a></li>
<li><a href="/2021/09/25/multivalue-dimensions-in-apache-druid-part-3/">part 3</a></li>
<li><a href="/2021/10/03/multivalue-dimensions-in-apache-druid-part-4/">part 4</a></li>
</ul>
<p>In <a href="/2021/08/07/multivalue-dimensions-in-apache-druid-part-1/">part 1</a> I pointed out what multi-value dimensions (MVD) are, and how they behave with respect to <code class="language-plaintext highlighter-rouge">GROUP BY</code> (they do an implicit unnest or, if you will, a lateral join), and also with respect to filtering using a <code class="language-plaintext highlighter-rouge">WHERE</code> clause (you get all the rows that match the <code class="language-plaintext highlighter-rouge">WHERE</code> condition, but no unnesting happens.)</p>
<p>But what if you want to combine grouping and filtering? The behavior of Druid in these case could be a bit surprising. Let’s have a look!</p>
<p>I am using Imply’s version 2023.03.01 of Druid, because I am going to show a few things using Imply’s graphical frontend. If you want to run the SQL examples only, Druid 25 quickstart works fine.</p>
<p>We are using the <code class="language-plaintext highlighter-rouge">ristorante</code> datasource from <a href="/2021/09/25/multivalue-dimensions-in-apache-druid-part-3/">part 3</a>; to create the datasource, follow the instructions given there. (You can make your life a bit easier because by now, Druid allows specifying the multi-value handling mode in the wizard.)</p>
<p>Start with a simple analysis, breaking down the count of items by item and customer:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span> <span class="n">customer</span><span class="p">,</span> <span class="n">orders</span><span class="p">,</span> <span class="k">COUNT</span><span class="p">(</span><span class="o">*</span><span class="p">)</span> <span class="k">AS</span> <span class="n">numOrders</span>
<span class="k">FROM</span> <span class="nv">"ristorante"</span>
<span class="k">GROUP</span> <span class="k">BY</span> <span class="mi">1</span><span class="p">,</span><span class="mi">2</span>
</code></pre></div></div>
<p><img src="/assets/2023-04-23-01.jpg" alt="" /></p>
<p>No surprises here. The MVD is unnested and the counts are broken down by item, as expected.</p>
<h2 id="quirks-in-multi-value-filtering">Quirks in multi-value filtering</h2>
<p>Now let’s filter by one specific item.</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span> <span class="n">customer</span><span class="p">,</span> <span class="n">orders</span><span class="p">,</span> <span class="k">COUNT</span><span class="p">(</span><span class="o">*</span><span class="p">)</span> <span class="k">AS</span> <span class="n">numOrders</span>
<span class="k">FROM</span> <span class="nv">"ristorante"</span>
<span class="k">WHERE</span> <span class="n">orders</span> <span class="o">=</span> <span class="s1">'tiramisu'</span>
<span class="k">GROUP</span> <span class="k">BY</span> <span class="mi">1</span><span class="p">,</span><span class="mi">2</span>
</code></pre></div></div>
<p><img src="/assets/2023-04-23-02.jpg" alt="" /></p>
<p>The result contains a lot of items that are definitely not Tiramisu! We got the filtering behavior from the plain query (without <code class="language-plaintext highlighter-rouge">GROUP BY</code>) and only after that the unnesting was applied!</p>
<p>Maybe if we try to filter <em>after</em> the grouping step, it would work?</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span> <span class="n">customer</span><span class="p">,</span> <span class="n">orders</span><span class="p">,</span> <span class="k">COUNT</span><span class="p">(</span><span class="o">*</span><span class="p">)</span> <span class="k">AS</span> <span class="n">numOrders</span>
<span class="k">FROM</span> <span class="nv">"ristorante"</span>
<span class="k">GROUP</span> <span class="k">BY</span> <span class="mi">1</span><span class="p">,</span><span class="mi">2</span>
<span class="k">HAVING</span> <span class="n">orders</span> <span class="o">=</span> <span class="s1">'tiramisu'</span>
</code></pre></div></div>
<p><img src="/assets/2023-04-23-03.jpg" alt="" /></p>
<p>Alas, the result is the same. No matter how you write the filter, the query plan always selects whole rows of data as they are in the datasource. This is a common trap for the unwary, although the behavior is documented <a href="https://docs.imply.io/latest/druid/querying/multi-value-dimensions/#filtering">here</a> for native queries, into which SQL queries are translated internally.</p>
<p>The same paragraph also mentions <a href="https://docs.imply.io/latest/druid/querying/sql-multivalue-string-functions/">SQL multi-value functions</a>. This is where the path to a solution lies.</p>
<h2 id="filtering-multi-value-strings-properly">Filtering multi-value strings, properly</h2>
<p>The core to the solution is the <code class="language-plaintext highlighter-rouge">MV_FILTER_ONLY</code> function, which is applied to a multi-value field in the <em>projection</em> clause of the <code class="language-plaintext highlighter-rouge">SELECT</code> statement. Its first argument is the field that you want to filter on, the second argument is an <em>array literal</em> of the values that you want to keep.</p>
<p>Arrays are currently the red-headed stepchild of Druid data modeling, although this is about to change soon and there will be a lot more support for them. For now, you cannot declare an <code class="language-plaintext highlighter-rouge">ARRAY</code> column (MVDs are of type string). But you can define an array literal with the <code class="language-plaintext highlighter-rouge">ARRAY</code> constructor. There is also a set of multi-value functions that manipulate such <code class="language-plaintext highlighter-rouge">ARRAY</code>s, but that is another story for another time.</p>
<p>(The complementary function to <code class="language-plaintext highlighter-rouge">MV_FILTER_ONLY</code>, <code class="language-plaintext highlighter-rouge">MV_FILTER_NONE</code>, keeps only the values that are <em>not</em> contained in the array that you pass as the second argument.)</p>
<p>Let’s put together the query:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span>
<span class="n">customer</span><span class="p">,</span>
<span class="n">MV_FILTER_ONLY</span><span class="p">(</span><span class="n">orders</span><span class="p">,</span> <span class="n">ARRAY</span><span class="p">[</span><span class="s1">'tiramisu'</span><span class="p">])</span> <span class="k">AS</span> <span class="n">orderItem</span><span class="p">,</span>
<span class="k">COUNT</span><span class="p">(</span><span class="o">*</span><span class="p">)</span> <span class="k">AS</span> <span class="n">numOrders</span>
<span class="k">FROM</span> <span class="nv">"ristorante"</span>
<span class="k">WHERE</span> <span class="n">orders</span> <span class="o">=</span> <span class="s1">'tiramisu'</span>
<span class="k">GROUP</span> <span class="k">BY</span> <span class="mi">1</span><span class="p">,</span><span class="mi">2</span>
</code></pre></div></div>
<p><img src="/assets/2023-04-23-04.jpg" alt="" /></p>
<p>You might be thinking that we can do without the <code class="language-plaintext highlighter-rouge">WHERE</code> clause, now that the filter is applied in the projection. Let’s try it out:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span>
<span class="n">customer</span><span class="p">,</span>
<span class="n">MV_FILTER_ONLY</span><span class="p">(</span><span class="n">orders</span><span class="p">,</span> <span class="n">ARRAY</span><span class="p">[</span><span class="s1">'tiramisu'</span><span class="p">])</span> <span class="k">AS</span> <span class="n">orderItem</span><span class="p">,</span>
<span class="k">COUNT</span><span class="p">(</span><span class="o">*</span><span class="p">)</span> <span class="k">AS</span> <span class="n">numOrders</span>
<span class="k">FROM</span> <span class="nv">"ristorante"</span>
<span class="k">GROUP</span> <span class="k">BY</span> <span class="mi">1</span><span class="p">,</span><span class="mi">2</span>
</code></pre></div></div>
<p><img src="/assets/2023-04-23-05.jpg" alt="" /></p>
<p>Unfortunately, now the result set has rows even for customers that didn’t order Tiramisu, and what is worse, they get a <code class="language-plaintext highlighter-rouge">numOrders</code> value of 1. You have to apply both filters in order to get the correct result.</p>
<h2 id="more-complex-filters">More complex filters</h2>
<p>What if we want to list the orders not for one, but for multiple items? Sure, you could write a query like</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span>
<span class="n">customer</span><span class="p">,</span>
<span class="n">MV_FILTER_ONLY</span><span class="p">(</span><span class="n">orders</span><span class="p">,</span> <span class="n">ARRAY</span><span class="p">[</span><span class="s1">'espresso'</span><span class="p">,</span> <span class="s1">'tiramisu'</span><span class="p">])</span> <span class="k">AS</span> <span class="n">orderItem</span><span class="p">,</span>
<span class="k">COUNT</span><span class="p">(</span><span class="o">*</span><span class="p">)</span> <span class="k">AS</span> <span class="n">numOrders</span>
<span class="k">FROM</span> <span class="nv">"ristorante"</span>
<span class="k">WHERE</span> <span class="n">orders</span> <span class="o">=</span> <span class="s1">'tiramisu'</span> <span class="k">OR</span> <span class="n">orders</span> <span class="o">=</span> <span class="s1">'espresso'</span>
<span class="k">GROUP</span> <span class="k">BY</span> <span class="mi">1</span><span class="p">,</span><span class="mi">2</span>
</code></pre></div></div>
<p>with a boolean condition in the <code class="language-plaintext highlighter-rouge">WHERE</code> clause. But there is a more elegant way, and it involves more <code class="language-plaintext highlighter-rouge">MV_</code> functions. Instead of the <code class="language-plaintext highlighter-rouge">OR</code> condition, write this:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span>
<span class="n">customer</span><span class="p">,</span>
<span class="n">MV_FILTER_ONLY</span><span class="p">(</span><span class="n">orders</span><span class="p">,</span> <span class="n">ARRAY</span><span class="p">[</span><span class="s1">'espresso'</span><span class="p">,</span> <span class="s1">'tiramisu'</span><span class="p">])</span> <span class="k">AS</span> <span class="n">orderItem</span><span class="p">,</span>
<span class="k">COUNT</span><span class="p">(</span><span class="o">*</span><span class="p">)</span> <span class="k">AS</span> <span class="n">numOrders</span>
<span class="k">FROM</span> <span class="nv">"ristorante"</span>
<span class="k">WHERE</span> <span class="n">MV_OVERLAP</span><span class="p">(</span><span class="n">orders</span><span class="p">,</span> <span class="n">ARRAY</span><span class="p">[</span><span class="s1">'espresso'</span><span class="p">,</span> <span class="s1">'tiramisu'</span><span class="p">])</span>
<span class="k">GROUP</span> <span class="k">BY</span> <span class="mi">1</span><span class="p">,</span><span class="mi">2</span>
</code></pre></div></div>
<p><img src="/assets/2023-04-23-06.jpg" alt="" /></p>
<ul>
<li><code class="language-plaintext highlighter-rouge">MV_OVERLAP</code> returns 1 when both array arguments have any elements in common, meaning it can be used to model an <code class="language-plaintext highlighter-rouge">OR</code> condition which is true if any of the filter elements is in the data column.</li>
<li>Likewise, <code class="language-plaintext highlighter-rouge">MV_CONTAINS</code> returns 1 if <em>all</em> elements of its second parameter array are contained within the first parameter, and can be used to model an <code class="language-plaintext highlighter-rouge">AND</code> condition.</li>
</ul>
<h2 id="visualizing-it-with-imply-pivot">Visualizing it with Imply Pivot</h2>
<p>Imply Pivot now has an option to enable this strict filtering. If you filter by an MVD, there is an additional checkbox “Hide filtered-out values” that enables the behavior we just built manually with <code class="language-plaintext highlighter-rouge">MV</code> functions.</p>
<p><img src="/assets/2023-04-23-07.jpg" alt="" /></p>
<p>With the checkbox checked, we get the correct result:</p>
<p><img src="/assets/2023-04-23-08.jpg" alt="" /></p>
<p>With the checkbox unchecked, we get the same result as in the beginning - all orders of all people that had Tiramisu:</p>
<p><img src="/assets/2023-04-23-09.jpg" alt="" /></p>
<h2 id="learnings">Learnings</h2>
<ul>
<li>Because of the way implicit unnesting works with Apache Druid, you may be surprised by the result when you filter and group by the same multi-value column.</li>
<li>Strict filtering can be enabled using SQL multi-value functions.</li>
<li><code class="language-plaintext highlighter-rouge">MV_FILTER_ONLY</code> and <code class="language-plaintext highlighter-rouge">MV_FILTER_NONE</code> are used in the <em>projection</em> clause to eliminate unwanted values.</li>
<li><code class="language-plaintext highlighter-rouge">MV_CONTAINS</code> and <code class="language-plaintext highlighter-rouge">MV_OVERLAP</code> are used in the <em>filter</em> clause to eliminate rows that have none of the wanted values at all, and would not be caught in the projection clause.</li>
<li>The two sets of functions usually have to be used together to obtain correct results.</li>
<li>Imply Pivot is able to apply this logic transparently when querying one of its data cubes.</li>
</ul>Druid Sneak Peek: Timeseries Interpolation2023-04-08T00:00:00+02:002023-04-08T00:00:00+02:00/2023/04/08/druid-sneak-peek-timeseries-interpolation<p><img src="/assets/2023-04-08-01-hotandcold.jpg" alt="Druid Cookbook" /></p>
<p>Today I am going to look at another new Druid feature.</p>
<p>This is currently only available in <a href="https://imply.io/download-imply/">Imply Enterprise</a> which ships with all the featured discussed today, and comes with a free 30 day trial license. I sure hope it will come to open source Druid too.</p>
<p>In this tutorial, you will</p>
<ul>
<li>ingest a data sample and</li>
<li>run a query to fill in missing values at regular time intervals, using a simple linear interpolation scheme.</li>
</ul>
<p>Why is it cool? It uses</p>
<ul>
<li>the new <code class="language-plaintext highlighter-rouge">UNNEST</code> function, which takes a collection and joins it laterally against the main table</li>
<li>the new <code class="language-plaintext highlighter-rouge">DATE_EXPAND</code> function, which takes a start and end date and a step interval, and creates an array of timestamps, spaced by the step interval, between the start and end points</li>
<li><a href="/2023/03/26/druid-26-sneak-peek-window-functions/">window functions</a>, in our case the <code class="language-plaintext highlighter-rouge">LEAD</code> function to retrieve values from the succeeding row.</li>
</ul>
<h2 id="the-data-sample">The data sample</h2>
<p>Today’s data set is a simple time series of temperature measurements, which have been conducted every 6 hours:</p>
<pre><code class="language-csv">date_start,temperature
2023-04-07T00:00:00Z,5
2023-04-07T06:00:00Z,8
2023-04-07T12:00:00Z,14
2023-04-07T18:00:00Z,12
2023-04-08T00:00:00Z,3
2023-04-08T06:00:00Z,6
2023-04-08T12:00:00Z,11
2023-04-08T18:00:00Z,5
</code></pre>
<p>We would like to fill the gaps, interpolating temperature values for each hour between the measurements.</p>
<p>Let’s ingest the data into Druid.</p>
<p>The ingestion spec is straightforward:</p>
<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
</span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"index_parallel"</span><span class="p">,</span><span class="w">
</span><span class="nl">"spec"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nl">"ioConfig"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"index_parallel"</span><span class="p">,</span><span class="w">
</span><span class="nl">"inputSource"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"inline"</span><span class="p">,</span><span class="w">
</span><span class="nl">"data"</span><span class="p">:</span><span class="w"> </span><span class="s2">"date_start,temperature</span><span class="se">\n</span><span class="s2">2023-04-07T00:00:00Z,5</span><span class="se">\n</span><span class="s2">2023-04-07T06:00:00Z,8</span><span class="se">\n</span><span class="s2">2023-04-07T12:00:00Z,14</span><span class="se">\n</span><span class="s2">2023-04-07T18:00:00Z,12</span><span class="se">\n</span><span class="s2">2023-04-08T00:00:00Z,3</span><span class="se">\n</span><span class="s2">2023-04-08T06:00:00Z,6</span><span class="se">\n</span><span class="s2">2023-04-08T12:00:00Z,11</span><span class="se">\n</span><span class="s2">2023-04-08T18:00:00Z,5"</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="nl">"inputFormat"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"csv"</span><span class="p">,</span><span class="w">
</span><span class="nl">"findColumnsFromHeader"</span><span class="p">:</span><span class="w"> </span><span class="kc">true</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="nl">"tuningConfig"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"index_parallel"</span><span class="p">,</span><span class="w">
</span><span class="nl">"partitionsSpec"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"dynamic"</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="nl">"dataSchema"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nl">"dataSource"</span><span class="p">:</span><span class="w"> </span><span class="s2">"iot_data"</span><span class="p">,</span><span class="w">
</span><span class="nl">"timestampSpec"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nl">"column"</span><span class="p">:</span><span class="w"> </span><span class="s2">"date_start"</span><span class="p">,</span><span class="w">
</span><span class="nl">"format"</span><span class="p">:</span><span class="w"> </span><span class="s2">"iso"</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="nl">"granularitySpec"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nl">"queryGranularity"</span><span class="p">:</span><span class="w"> </span><span class="s2">"none"</span><span class="p">,</span><span class="w">
</span><span class="nl">"rollup"</span><span class="p">:</span><span class="w"> </span><span class="kc">false</span><span class="p">,</span><span class="w">
</span><span class="nl">"segmentGranularity"</span><span class="p">:</span><span class="w"> </span><span class="s2">"month"</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="nl">"dimensionsSpec"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nl">"dimensions"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"double"</span><span class="p">,</span><span class="w">
</span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"temperature"</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">]</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>
<h2 id="query-the-data">Query the data</h2>
<p>Both the window functions and the <code class="language-plaintext highlighter-rouge">UNNEST</code> function are currently hidden behind context flags. Use the following query context:</p>
<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
</span><span class="nl">"windowsAreForClosers"</span><span class="p">:</span><span class="w"> </span><span class="kc">true</span><span class="p">,</span><span class="w">
</span><span class="nl">"enableUnnest"</span><span class="p">:</span><span class="w"> </span><span class="kc">true</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>
<p>With that, here is the query:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">WITH</span> <span class="n">cte</span> <span class="k">AS</span> <span class="p">(</span>
<span class="k">SELECT</span>
<span class="n">__time</span> <span class="k">AS</span> <span class="n">thisTime</span><span class="p">,</span>
<span class="n">temperature</span><span class="p">,</span>
<span class="n">LEAD</span><span class="p">(</span><span class="n">__time</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span> <span class="n">OVER</span> <span class="p">(</span><span class="k">ORDER</span> <span class="k">BY</span> <span class="n">__time</span><span class="p">)</span> <span class="n">nextTime</span><span class="p">,</span>
<span class="n">LEAD</span><span class="p">(</span><span class="n">temperature</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span> <span class="n">OVER</span> <span class="p">(</span><span class="k">ORDER</span> <span class="k">BY</span> <span class="n">__time</span><span class="p">)</span> <span class="n">nextTemperature</span>
<span class="k">FROM</span> <span class="nv">"iot_data"</span>
<span class="k">GROUP</span> <span class="k">BY</span> <span class="mi">1</span><span class="p">,</span><span class="mi">2</span>
<span class="p">)</span>
<span class="k">SELECT</span>
<span class="n">timeByHour</span><span class="p">,</span>
<span class="k">CASE</span> <span class="p">(</span><span class="n">TIMESTAMP_TO_MILLIS</span><span class="p">(</span><span class="n">nextTime</span><span class="p">)</span> <span class="o">-</span> <span class="n">TIMESTAMP_TO_MILLIS</span><span class="p">(</span><span class="n">thisTime</span><span class="p">))</span>
<span class="k">WHEN</span> <span class="mi">0</span> <span class="k">THEN</span> <span class="n">temperature</span>
<span class="k">ELSE</span> <span class="p">((</span><span class="n">TIMESTAMP_TO_MILLIS</span><span class="p">(</span><span class="n">nextTime</span><span class="p">)</span> <span class="o">-</span> <span class="n">TIMESTAMP_TO_MILLIS</span><span class="p">(</span><span class="n">timeByHour</span><span class="p">))</span> <span class="o">*</span> <span class="n">temperature</span>
<span class="o">+</span> <span class="p">(</span><span class="n">TIMESTAMP_TO_MILLIS</span><span class="p">(</span><span class="n">timeByHour</span><span class="p">)</span> <span class="o">-</span> <span class="n">TIMESTAMP_TO_MILLIS</span><span class="p">(</span><span class="n">thisTime</span><span class="p">))</span> <span class="o">*</span> <span class="n">nextTemperature</span><span class="p">)</span>
<span class="o">/</span> <span class="p">(</span><span class="n">TIMESTAMP_TO_MILLIS</span><span class="p">(</span><span class="n">nextTime</span><span class="p">)</span> <span class="o">-</span> <span class="n">TIMESTAMP_TO_MILLIS</span><span class="p">(</span><span class="n">thisTime</span><span class="p">))</span>
<span class="k">END</span> <span class="n">interpTemp</span>
<span class="k">FROM</span> <span class="n">cte</span><span class="p">,</span> <span class="k">UNNEST</span><span class="p">(</span><span class="n">DATE_EXPAND</span><span class="p">(</span><span class="n">TIMESTAMP_TO_MILLIS</span><span class="p">(</span><span class="n">thisTime</span><span class="p">),</span> <span class="n">TIMESTAMP_TO_MILLIS</span><span class="p">(</span><span class="n">NVL</span><span class="p">(</span><span class="n">nextTime</span><span class="p">,</span> <span class="n">thisTime</span><span class="p">)),</span> <span class="s1">'PT1H'</span><span class="p">))</span> <span class="k">AS</span> <span class="n">t</span><span class="p">(</span><span class="n">timeByHour</span><span class="p">)</span>
<span class="k">WHERE</span> <span class="n">timeByHour</span> <span class="o"><></span> <span class="n">nextTime</span>
</code></pre></div></div>
<p>It uses the common table expression technique that already came in handy last time.</p>
<p>Here is the result:</p>
<p><img src="/assets/2023-04-08-02.jpg" alt="query result" /></p>
<p>As you can see in the last column, the values have been neatly interpolated.</p>
<h2 id="side-quests">Side Quests</h2>
<p>It is worth looking at some details of the query. Some of these are just common, others are due to quirks in the Druid query engine.</p>
<h3 id="the-date-expansion">The date expansion</h3>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">DATE_EXPAND</span><span class="p">(</span><span class="n">TIMESTAMP_TO_MILLIS</span><span class="p">(</span><span class="n">thisTime</span><span class="p">),</span> <span class="n">TIMESTAMP_TO_MILLIS</span><span class="p">(</span><span class="n">NVL</span><span class="p">(</span><span class="n">nextTime</span><span class="p">,</span> <span class="n">thisTime</span><span class="p">)),</span> <span class="s1">'PT1H'</span><span class="p">)</span>
</code></pre></div></div>
<p>The general syntax would be <code class="language-plaintext highlighter-rouge">DATE_EXPAND(from, to, interval)</code>. But since we are using <code class="language-plaintext highlighter-rouge">LEAD()</code> to get the <code class="language-plaintext highlighter-rouge">to</code> value, the last row will have <em>null</em> in that place. Unfortunately, <code class="language-plaintext highlighter-rouge">DATE_EXPAND</code> doesn’t handle that situation well and the query fails. That’s why in the case of a <em>null</em> value, I use the row time instead, generating only one row.</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">WHERE</span> <span class="n">timeByHour</span> <span class="o"><></span> <span class="n">nextTime</span>
</code></pre></div></div>
<p><code class="language-plaintext highlighter-rouge">DATE_EXPAND</code> considers its time interval as left and right inclusive. This means that the end values will be duplicated with the start values of the next interval. The <code class="language-plaintext highlighter-rouge">WHERE</code> clause filters out the duplicates.</p>
<h3 id="the-interpolation">The interpolation</h3>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="k">CASE</span> <span class="p">(</span><span class="n">TIMESTAMP_TO_MILLIS</span><span class="p">(</span><span class="n">nextTime</span><span class="p">)</span> <span class="o">-</span> <span class="n">TIMESTAMP_TO_MILLIS</span><span class="p">(</span><span class="n">thisTime</span><span class="p">))</span>
<span class="k">WHEN</span> <span class="mi">0</span> <span class="k">THEN</span> <span class="n">temperature</span> <span class="p">...</span>
</code></pre></div></div>
<p>The general formula for linear interpolation has to divide by the total time range. If this is 0, just use the one value that is provided. This comes from the corner case treatment above.</p>
<h2 id="conclusion">Conclusion</h2>
<p>Druid’s timeseries capabilities are ever expanding.</p>
<ul>
<li>With <code class="language-plaintext highlighter-rouge">DATE_EXPAND</code> and <code class="language-plaintext highlighter-rouge">UNNEST</code>, it is possible to generate evenly spaced time series.</li>
<li>Using window functions and standard interpolation algorithms, this can be used to fill in missing values.</li>
<li>Currently this is only available in Imply’s release.</li>
</ul>
<hr />
<p class="attribution">"<a target="_blank" rel="noopener noreferrer" href="https://www.flickr.com/photos/53575715@N02/6620214217">Hot & Cold</a>" by <a target="_blank" rel="noopener noreferrer" href="https://www.flickr.com/photos/53575715@N02">astronomy_blog</a> is licensed under <a target="_blank" rel="noopener noreferrer" href="https://creativecommons.org/licenses/by-nc-sa/2.0/?ref=openverse">CC BY-NC-SA 2.0
<img src="https://mirrors.creativecommons.org/presskit/icons/cc.svg" style="height: 1em; margin-right: 0.125em; display: inline;" />
<img src="https://mirrors.creativecommons.org/presskit/icons/by.svg" style="height: 1em; margin-right: 0.125em; display: inline;" />
<img src="https://mirrors.creativecommons.org/presskit/icons/nc.svg" style="height: 1em; margin-right: 0.125em; display: inline;" />
<img src="https://mirrors.creativecommons.org/presskit/icons/sa.svg" style="height: 1em; margin-right: 0.125em; display: inline;" /></a>. </p>Druid 26 Sneak Peek: Window Functions2023-03-26T00:00:00+01:002023-03-26T00:00:00+01:00/2023/03/26/druid-26-sneak-peek-window-functions<p><img src="/assets/2021-12-21-elf.jpg" alt="Druid Cookbook" /></p>
<p><a href="https://www.linkedin.com/feed/update/urn:li:activity:7043593237915148288/">Great changes have been announced for the upcoming Druid 26.0 release.</a> The one that excites me the most is the introduction of <a href="https://github.com/paul-rogers/druid/wiki/Window-Functions">window functions</a>.</p>
<p>Window functions allow a query to interrelate and aggregate rows beyond a simple <code class="language-plaintext highlighter-rouge">GROUP BY</code>. <a href="/2022/11/05/druid-data-cookbook-cumulative-sums-in-druid-sql/">Previously</a>, I have looked at ways to emulate such processing patterns using self joins or grouping sets in Druid. But now, we are close to getting window functions as first class citizens.</p>
<p>This is a sneak peek into Druid 26 functionality. In order to use the new functions, you can (as of the time of writing) <a href="https://druid.apache.org/docs/latest/development/build.html">build Druid</a> from the HEAD of the master branch:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>git clone https://github.com/apache/druid.git
<span class="nb">cd </span>druid
mvn clean <span class="nb">install</span> <span class="nt">-Pdist</span> <span class="nt">-DskipTests</span>
</code></pre></div></div>
<p>Then follow the instructions to locate and install the tarball.</p>
<p>All this is still under development so it is undocumented, and hidden behind a secret query context option. (We will look at that in a moment). Also notice, that window functions only work within <code class="language-plaintext highlighter-rouge">GROUP BY</code> queries, and there are still some other limitations. But it is fast progressing work.</p>
<p>In this tutorial, you will</p>
<ul>
<li>ingest a data sample and</li>
<li>do a quick cumulative report using window functions.</li>
</ul>
<p><em><strong>Disclaimer:</strong> This tutorial uses undocumented functionality and unreleased code. This blog is neither endorsed by Imply nor by the Apache Druid PMC. It merely collects the results of personal experiments. The features described here might, in the final release, work differently, or not at all. In addition, the entire build, or execution, may fail. Your mileage may vary.</em></p>
<h2 id="lets-do-it-in-practice">Let’s do it in practice</h2>
<p>I am taking a data sample from <a href="https://www.tinybird.co/blog-posts/coming-soon-on-clickhouse-window-functions">the Tinybird blog</a> which is simulated data from an ecommerce store. The data is downloadable from <a href="https://storage.googleapis.com/tinybird-assets/datasets/guides/events_10K.csv">here</a> and has a straightforward format:</p>
<ul>
<li>a <em>timestamp</em></li>
<li>string fields for <em>product id, user id,</em> and <em>event type</em></li>
<li>an <em>extra data</em> field: this is a variable JSON object whose schema depends on the event type.</li>
</ul>
<p>Let’s see if we can do some interesting things with this!</p>
<h2 id="ingestion">Ingestion</h2>
<p>Ingest the data using <a href="https://druid.apache.org/docs/latest/multi-stage-query/index.html">SQL based ingestion</a>. In order to keep the <code class="language-plaintext highlighter-rouge">extra_data</code> column as nested JSON, apply the <code class="language-plaintext highlighter-rouge">PARSE_JSON</code> function in the ingestion query:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">REPLACE</span> <span class="k">INTO</span> <span class="nv">"events"</span> <span class="n">OVERWRITE</span> <span class="k">ALL</span>
<span class="k">WITH</span> <span class="nv">"ext"</span> <span class="k">AS</span> <span class="p">(</span><span class="k">SELECT</span> <span class="o">*</span>
<span class="k">FROM</span> <span class="k">TABLE</span><span class="p">(</span>
<span class="n">EXTERN</span><span class="p">(</span>
<span class="s1">'{"type":"http","uris":["https://storage.googleapis.com/tinybird-assets/datasets/guides/events_10K.csv"]}'</span><span class="p">,</span>
<span class="s1">'{"type":"csv","findColumnsFromHeader":false,"columns":["date","product_id","user_id","event","extra_data"]}'</span><span class="p">,</span>
<span class="s1">'[{"name":"date","type":"string"},{"name":"product_id","type":"string"},{"name":"user_id","type":"long"},{"name":"event","type":"string"},{"name":"extra_data","type":"string"}]'</span>
<span class="p">)</span>
<span class="p">))</span>
<span class="k">SELECT</span>
<span class="n">TIME_PARSE</span><span class="p">(</span><span class="nv">"date"</span><span class="p">)</span> <span class="k">AS</span> <span class="nv">"__time"</span><span class="p">,</span>
<span class="nv">"product_id"</span><span class="p">,</span>
<span class="nv">"user_id"</span><span class="p">,</span>
<span class="nv">"event"</span><span class="p">,</span>
<span class="n">PARSE_JSON</span><span class="p">(</span><span class="nv">"extra_data"</span><span class="p">)</span> <span class="k">AS</span> <span class="nv">"extra_data"</span>
<span class="k">FROM</span> <span class="nv">"ext"</span>
<span class="n">PARTITIONED</span> <span class="k">BY</span> <span class="k">MONTH</span>
</code></pre></div></div>
<p>You can run this in the query tab of the Druid console like so:</p>
<p><img src="/assets/2023-03-26-01-ingest.jpg" alt="MSQ ingestion of data sample" /></p>
<p>or you can enter the same SQL in the SQL ingestion wizard and monitor progress in the ingestion tab.</p>
<h2 id="looking-at-the-data">Looking at the data</h2>
<p>Let’s get an idea of the amount of data in there. One of the neat things in the Druid console is that it has the queries for these basic aggregations in the context menu for each datasource in the query window:</p>
<p><img src="/assets/2023-03-26-02-selectminmaxtime.jpg" width="50%" /></p>
<p>This gives us a quick query for the date range of the sample</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span>
<span class="k">MIN</span><span class="p">(</span><span class="nv">"__time"</span><span class="p">)</span> <span class="k">AS</span> <span class="nv">"min_time"</span><span class="p">,</span>
<span class="k">MAX</span><span class="p">(</span><span class="nv">"__time"</span><span class="p">)</span> <span class="k">AS</span> <span class="nv">"max_time"</span>
<span class="k">FROM</span> <span class="nv">"events"</span>
<span class="k">GROUP</span> <span class="k">BY</span> <span class="p">()</span>
</code></pre></div></div>
<p>which shows that the data is from more than 3 years (2017-2020).</p>
<p>This is why I chose monthly time partitions - given the small size of the sample, yearly would also work well.</p>
<p>Look at the data with a <code class="language-plaintext highlighter-rouge">SELECT * FROM "events"</code> query:</p>
<p><img src="/assets/2023-03-26-03-selectstar.jpg" alt="Select all data" /></p>
<p>We are interested in <code class="language-plaintext highlighter-rouge">buy</code> events: for these, the amount of the purchase is in the <code class="language-plaintext highlighter-rouge">price</code> subfield that we can extract with <code class="language-plaintext highlighter-rouge">JSON_VALUE</code>. One of the latest additions in Druid is that you can specify the expected return type inside the function call like so:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>JSON_VALUE(extra_data, '$.price' RETURNING DOUBLE)
</code></pre></div></div>
<p>Thus we guarantee that we get only <code class="language-plaintext highlighter-rouge">DOUBLE</code> values.</p>
<h2 id="building-the-report">Building the report</h2>
<p>I would like to get a report like this: For each day, give me</p>
<ul>
<li>the number of purchase transactions for that day</li>
<li>the cumulative number of transactions from all history up to and including that day</li>
<li>the total revenue of that day</li>
<li>the total revenue up to and including that day.</li>
</ul>
<h3 id="using-a-cte-to-prepare-the-fields">Using a CTE to prepare the fields</h3>
<p>In order to prepare that report, let’s first collect the fields we need in a <em><a href="https://learnsql.com/blog/what-is-common-table-expression/">common table expression (CTE)</a>:</em></p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="k">SELECT</span>
<span class="n">FLOOR</span><span class="p">(</span><span class="n">__time</span> <span class="k">TO</span> <span class="k">DAY</span><span class="p">)</span> <span class="k">AS</span> <span class="nv">"date"</span><span class="p">,</span>
<span class="k">COUNT</span><span class="p">(</span><span class="o">*</span><span class="p">)</span> <span class="k">AS</span> <span class="n">purchases</span><span class="p">,</span>
<span class="k">SUM</span><span class="p">(</span><span class="n">JSON_VALUE</span><span class="p">(</span><span class="n">extra_data</span><span class="p">,</span> <span class="s1">'$.price'</span> <span class="n">RETURNING</span> <span class="nb">DOUBLE</span><span class="p">))</span> <span class="k">AS</span> <span class="n">revenue</span>
<span class="k">FROM</span> <span class="nv">"events"</span>
<span class="k">WHERE</span> <span class="n">event</span> <span class="o">=</span> <span class="s1">'buy'</span>
<span class="k">GROUP</span> <span class="k">BY</span> <span class="mi">1</span>
</code></pre></div></div>
<p>Here, we filter the data, extract the <code class="language-plaintext highlighter-rouge">price</code> field, and group everything by day. We will package that into a <code class="language-plaintext highlighter-rouge">WITH</code> clause that defines the input for the main query.</p>
<h3 id="setting-the-context-flag-to-enable-experimental-window-functions">Setting the context flag to enable experimental window functions</h3>
<p>From the menu next to the <code class="language-plaintext highlighter-rouge">Run</code> button, select <code class="language-plaintext highlighter-rouge">Edit Context</code></p>
<p><img src="/assets/2023-03-26-04-editcontext.jpg" width="50%" /></p>
<p>and enter the option <code class="language-plaintext highlighter-rouge">"windowsAreForClosers": true</code> to enable window functions:</p>
<p><img src="/assets/2023-03-26-05-contextoption.png" width="50%" /></p>
<p>You could also specify the context when running the query through the <a href="https://druid.apache.org/docs/latest/querying/sql-api.html">REST API endpoint</a> (unfortunately not yet through JDBC, though.)</p>
<h3 id="putting-the-query-together">Putting the query together</h3>
<p>Now we have everything we need. The cumulative sums will be computed using a window clause like this:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SUM</span><span class="p">(</span><span class="n">purchases</span><span class="p">)</span> <span class="n">OVER</span> <span class="p">(</span><span class="k">ORDER</span> <span class="k">BY</span> <span class="nv">"date"</span> <span class="k">ASC</span> <span class="k">ROWS</span> <span class="k">BETWEEN</span> <span class="n">UNBOUNDED</span> <span class="k">PRECEDING</span> <span class="k">AND</span> <span class="k">CURRENT</span> <span class="k">ROW</span><span class="p">)</span>
</code></pre></div></div>
<p>where the daily sums have been computed by the <code class="language-plaintext highlighter-rouge">GROUP BY</code> in the CTE, and the window aggregation does the cumulative sums.</p>
<p>Here is the whole query:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">WITH</span> <span class="n">cte</span> <span class="k">AS</span> <span class="p">(</span>
<span class="k">SELECT</span>
<span class="n">FLOOR</span><span class="p">(</span><span class="n">__time</span> <span class="k">TO</span> <span class="k">DAY</span><span class="p">)</span> <span class="k">AS</span> <span class="nv">"date"</span><span class="p">,</span>
<span class="k">COUNT</span><span class="p">(</span><span class="o">*</span><span class="p">)</span> <span class="k">AS</span> <span class="n">purchases</span><span class="p">,</span>
<span class="k">SUM</span><span class="p">(</span><span class="n">JSON_VALUE</span><span class="p">(</span><span class="n">extra_data</span><span class="p">,</span> <span class="s1">'$.price'</span> <span class="n">RETURNING</span> <span class="nb">DOUBLE</span><span class="p">))</span> <span class="k">AS</span> <span class="n">revenue</span>
<span class="k">FROM</span> <span class="nv">"events"</span>
<span class="k">WHERE</span> <span class="n">event</span> <span class="o">=</span> <span class="s1">'buy'</span>
<span class="k">GROUP</span> <span class="k">BY</span> <span class="mi">1</span>
<span class="p">)</span>
<span class="k">SELECT</span>
<span class="nv">"date"</span><span class="p">,</span>
<span class="n">purchases</span> <span class="k">AS</span> <span class="n">daily_purchases</span><span class="p">,</span>
<span class="k">SUM</span><span class="p">(</span><span class="n">purchases</span><span class="p">)</span> <span class="n">OVER</span> <span class="p">(</span><span class="k">ORDER</span> <span class="k">BY</span> <span class="nv">"date"</span> <span class="k">ASC</span> <span class="k">ROWS</span> <span class="k">BETWEEN</span> <span class="n">UNBOUNDED</span> <span class="k">PRECEDING</span> <span class="k">AND</span> <span class="k">CURRENT</span> <span class="k">ROW</span><span class="p">)</span> <span class="k">AS</span> <span class="n">cume_purchases</span><span class="p">,</span>
<span class="n">revenue</span> <span class="k">AS</span> <span class="n">daily_revenue</span><span class="p">,</span>
<span class="k">SUM</span><span class="p">(</span><span class="n">revenue</span><span class="p">)</span> <span class="n">OVER</span> <span class="p">(</span><span class="k">ORDER</span> <span class="k">BY</span> <span class="nv">"date"</span> <span class="k">ASC</span> <span class="k">ROWS</span> <span class="k">BETWEEN</span> <span class="n">UNBOUNDED</span> <span class="k">PRECEDING</span> <span class="k">AND</span> <span class="k">CURRENT</span> <span class="k">ROW</span><span class="p">)</span> <span class="k">AS</span> <span class="n">cume_revenue</span>
<span class="k">FROM</span> <span class="n">cte</span>
<span class="k">ORDER</span> <span class="k">BY</span> <span class="mi">1</span> <span class="k">ASC</span>
</code></pre></div></div>
<p>You can run it in the console:</p>
<p><img src="/assets/2023-03-26-06-query.jpg" alt="Window query in Druid console" /></p>
<p>The columns named <em>cume…</em> contain the result of the window aggregations.</p>
<p>And using the <code class="language-plaintext highlighter-rouge">Explain</code> function, notice that this SQL actually translates to a new native query type:</p>
<p><img src="/assets/2023-03-26-07-nativequery.jpg" width="70%" /></p>
<h2 id="conclusion">Conclusion</h2>
<ul>
<li>If you take a sneak peek at the public Druid repository, you can follow the work that is being done on window functions. While these are currently a bit rough around the edges, you can already do quite a bit with this new functionality.</li>
<li>Because it is work in progress, this is currently undocumented and hidden behind a feature flag that needs to be enabled in the query context for each query that uses it.</li>
<li>This is evolving rapidly and will likely see a lot of enhancements very soon.</li>
</ul>
<p><em>Edit 2023-03-27:</em> One of my readers pointed out a simplification of the query - the first version carried a redundant <code class="language-plaintext highlighter-rouge">GROUP BY</code> in the final query, but it turns out that Druid is smart enough to plan a grouped (timeseries) query based on the grouping in the CTE. This is reflected above now.</p>
<hr />
<p>“<a href="https://www.flickr.com/photos/mhlimages/48051262646/">This image is taken from Page 500 of Praktisches Kochbuch für die gewöhnliche und feinere Küche</a>” by <a href="https://www.flickr.com/photos/mhlimages/">Medical Heritage Library, Inc.</a> is licensed under <a target="_blank" rel="noopener noreferrer" href="https://creativecommons.org/licenses/by-nc-sa/2.0/">CC BY-NC-SA 2.0 <img src="https://mirrors.creativecommons.org/presskit/icons/cc.svg" style="height: 1em; margin-right: 0.125em; display: inline;" /><img src="https://mirrors.creativecommons.org/presskit/icons/by.svg" style="height: 1em; margin-right: 0.125em; display: inline;" /><img src="https://mirrors.creativecommons.org/presskit/icons/nc.svg" style="height: 1em; margin-right: 0.125em; display: inline;" /><img src="https://mirrors.creativecommons.org/presskit/icons/sa.svg" style="height: 1em; margin-right: 0.125em; display: inline;" /></a>.</p>Selective Bulk Upserts in Apache Druid2023-03-07T00:00:00+01:002023-03-07T00:00:00+01:00/2023/03/07/selective-bulk-upserts-in-apache-druid<p><a href="https://druid.apache.org/">Apache Druid</a> is designed for high query speed. The <a href="https://druid.apache.org/docs/latest/design/segments.html">data segments</a> that make up a Druid datasource (think: table) are generally immutable: You do not update or replace individual rows of data; however you can replace an entire segment with a new version of itself.</p>
<p>Sometimes in analytics, you have to update or insert rows of data in a segment. This may be due to a state change - such as an order being shipped, or canceled, or returned. Generally, you would have a <em>key</em> column in your data, and based on that key you would update a row if it exists in the table already, and insert it otherwise. This is called <code class="language-plaintext highlighter-rouge">upsert</code>, after the name of the command that is used in many SQL dialects.</p>
<p><a href="https://imply.io/blog/upserts-and-data-deduplication-with-druid/">This Imply blog</a> talks about the various strategies to handle such scenarios with Druid. But today, I want to look at a special case of Upsert, where you want to update or insert a bunch of rows based on a key and time interval.</p>
<h2 id="the-use-case">The use case</h2>
<p>I encountered this scenario with some of my AdTech customers. They obtain performance analytics data by issuing API calls to the ad network providers. These API calls have to cover certain predefined time ranges - data is downloaded in bulk. Moreover, depending on factors like late arriving conversion data or changes of the attribution model, metrics associated with the data rows may change over time.</p>
<p>If we want to make these data available in Druid, we will have to cut out existing data by key and interval, and transplant the new data instead, like in this diagram:</p>
<p><img src="/assets/2023-03-07-01.png" alt="Combining ingestion" /></p>
<h2 id="solution-outline">Solution outline</h2>
<p>In order to achieve this behavior in Druid, we will use a <a href="https://druid.apache.org/docs/latest/ingestion/native-batch-input-sources.html#combining-input-source"><code class="language-plaintext highlighter-rouge">combining</code> input source</a> in the ingestion spec. A combining input source contains a list of delegate input sources - we will use two, but you can actually have more than two.</p>
<p>The ingestion process will read data from all delegate input sources and ingest them, much like what a <code class="language-plaintext highlighter-rouge">union all</code> in SQL does. The nice thing is that this process is transactional - it will succeed either completely, or not at all.</p>
<p>We have to make sure that all input sources have the same schema and, where that applies, the same input format. In practice this means:</p>
<ul>
<li>you can combine multiple external sources only if they are all parsed in the same way</li>
<li>or you can combine external sources like above with any number of <code class="language-plaintext highlighter-rouge">druid</code> input sources (reindexing).</li>
</ul>
<p>The latter is what we are going to do.</p>
<h2 id="tutorial-how-to-do-it-in-practise">Tutorial: How to do it in practise</h2>
<p>In this tutorial, we will set up a bulk upsert using the combining input source technique and two stripped down sample data sets.</p>
<p>We will:</p>
<ul>
<li>load an initial data sample for multiple ad networks</li>
<li>show the upsert technique by replacing data for one network and a specific date range.</li>
</ul>
<p>The tutorial can be done using the <a href="https://druid.apache.org/docs/latest/tutorials/index.html">Druid 25.0 quickstart</a>.</p>
<p>Note: Because the tutorial assumes that you are running all Druid processes on a single machine, it can work with local file system data. In a cluster setup, you would have to use a network mount or (more commonly) cloud storage, like S3.</p>
<h3 id="initial-load">Initial load</h3>
<p>The first data sample serves to populate the table. It has one week’s worth of data from three ad networks:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{"date": "2023-01-01T00:00:00Z", "ad_network": "gaagle", "ads_impressions": 2770, "ads_revenue": 330.69}
{"date": "2023-01-01T00:00:00Z", "ad_network": "fakebook", "ads_impressions": 9646, "ads_revenue": 137.85}
{"date": "2023-01-01T00:00:00Z", "ad_network": "twottr", "ads_impressions": 1139, "ads_revenue": 493.73}
{"date": "2023-01-02T00:00:00Z", "ad_network": "gaagle", "ads_impressions": 9066, "ads_revenue": 368.66}
{"date": "2023-01-02T00:00:00Z", "ad_network": "fakebook", "ads_impressions": 4426, "ads_revenue": 170.96}
{"date": "2023-01-02T00:00:00Z", "ad_network": "twottr", "ads_impressions": 9110, "ads_revenue": 452.2}
{"date": "2023-01-03T00:00:00Z", "ad_network": "gaagle", "ads_impressions": 3275, "ads_revenue": 363.53}
{"date": "2023-01-03T00:00:00Z", "ad_network": "fakebook", "ads_impressions": 9494, "ads_revenue": 426.37}
{"date": "2023-01-03T00:00:00Z", "ad_network": "twottr", "ads_impressions": 4325, "ads_revenue": 107.44}
{"date": "2023-01-04T00:00:00Z", "ad_network": "gaagle", "ads_impressions": 8816, "ads_revenue": 311.53}
{"date": "2023-01-04T00:00:00Z", "ad_network": "fakebook", "ads_impressions": 8955, "ads_revenue": 254.5}
{"date": "2023-01-04T00:00:00Z", "ad_network": "twottr", "ads_impressions": 6905, "ads_revenue": 211.74}
{"date": "2023-01-05T00:00:00Z", "ad_network": "gaagle", "ads_impressions": 3075, "ads_revenue": 382.41}
{"date": "2023-01-05T00:00:00Z", "ad_network": "fakebook", "ads_impressions": 4870, "ads_revenue": 205.84}
{"date": "2023-01-05T00:00:00Z", "ad_network": "twottr", "ads_impressions": 1418, "ads_revenue": 282.21}
{"date": "2023-01-06T00:00:00Z", "ad_network": "gaagle", "ads_impressions": 7413, "ads_revenue": 322.43}
{"date": "2023-01-06T00:00:00Z", "ad_network": "fakebook", "ads_impressions": 1251, "ads_revenue": 265.52}
{"date": "2023-01-06T00:00:00Z", "ad_network": "twottr", "ads_impressions": 8055, "ads_revenue": 394.56}
{"date": "2023-01-07T00:00:00Z", "ad_network": "gaagle", "ads_impressions": 4279, "ads_revenue": 317.84}
{"date": "2023-01-07T00:00:00Z", "ad_network": "fakebook", "ads_impressions": 5848, "ads_revenue": 162.96}
{"date": "2023-01-07T00:00:00Z", "ad_network": "twottr", "ads_impressions": 9449, "ads_revenue": 379.21}
</code></pre></div></div>
<p>Save this sample locally to a file named <code class="language-plaintext highlighter-rouge">data1.json</code> and ingest it using this ingestion spec (replace the path in <code class="language-plaintext highlighter-rouge">baseDir</code> with the path you saved the sample file to):</p>
<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
</span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"index_parallel"</span><span class="p">,</span><span class="w">
</span><span class="nl">"spec"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nl">"ioConfig"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"index_parallel"</span><span class="p">,</span><span class="w">
</span><span class="nl">"inputSource"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"local"</span><span class="p">,</span><span class="w">
</span><span class="nl">"baseDir"</span><span class="p">:</span><span class="w"> </span><span class="s2">"/<my base path>"</span><span class="p">,</span><span class="w">
</span><span class="nl">"filter"</span><span class="p">:</span><span class="w"> </span><span class="s2">"data1.json"</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="nl">"inputFormat"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"json"</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="nl">"appendToExisting"</span><span class="p">:</span><span class="w"> </span><span class="kc">false</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="nl">"tuningConfig"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"index_parallel"</span><span class="p">,</span><span class="w">
</span><span class="nl">"partitionsSpec"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"hashed"</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="nl">"forceGuaranteedRollup"</span><span class="p">:</span><span class="w"> </span><span class="kc">true</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="nl">"dataSchema"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nl">"dataSource"</span><span class="p">:</span><span class="w"> </span><span class="s2">"ad_data"</span><span class="p">,</span><span class="w">
</span><span class="nl">"timestampSpec"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nl">"column"</span><span class="p">:</span><span class="w"> </span><span class="s2">"date"</span><span class="p">,</span><span class="w">
</span><span class="nl">"format"</span><span class="p">:</span><span class="w"> </span><span class="s2">"iso"</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="nl">"dimensionsSpec"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nl">"dimensions"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
</span><span class="s2">"ad_network"</span><span class="p">,</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"long"</span><span class="p">,</span><span class="w">
</span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"ads_impressions"</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"ads_revenue"</span><span class="p">,</span><span class="w">
</span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"double"</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">]</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="nl">"granularitySpec"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nl">"queryGranularity"</span><span class="p">:</span><span class="w"> </span><span class="s2">"none"</span><span class="p">,</span><span class="w">
</span><span class="nl">"rollup"</span><span class="p">:</span><span class="w"> </span><span class="kc">false</span><span class="p">,</span><span class="w">
</span><span class="nl">"segmentGranularity"</span><span class="p">:</span><span class="w"> </span><span class="s2">"week"</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>
<p>You can create this ingestion spec by clicking through the console wizard, too. There are a few notable settings here though:</p>
<ul>
<li>I’ve used hash partitioning over all partitions here. The default in the wizard is dynamic partitioning, but you would usually use dymanic partitioning with batch data only if you want to append data to an existing data set. In all other cases, use hash or range partitioning.</li>
<li>I’ve configured weekly segments. This is to show that the technique works even if the updated range does not align with segment boundaries.</li>
</ul>
<h3 id="doing-the-upsert">Doing the upsert</h3>
<p>Now, let’s fast-forward two days in time. We have downloaded a bunch of new and updated data from the <code class="language-plaintext highlighter-rouge">gaagle</code> network. The new data looks like this:</p>
<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="nl">"date"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2023-01-03T00:00:00Z"</span><span class="p">,</span><span class="w"> </span><span class="nl">"ad_network"</span><span class="p">:</span><span class="w"> </span><span class="s2">"gaagle"</span><span class="p">,</span><span class="w"> </span><span class="nl">"ads_impressions"</span><span class="p">:</span><span class="w"> </span><span class="mi">4521</span><span class="p">,</span><span class="w"> </span><span class="nl">"ads_revenue"</span><span class="p">:</span><span class="w"> </span><span class="mf">378.65</span><span class="p">}</span><span class="w">
</span><span class="p">{</span><span class="nl">"date"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2023-01-04T00:00:00Z"</span><span class="p">,</span><span class="w"> </span><span class="nl">"ad_network"</span><span class="p">:</span><span class="w"> </span><span class="s2">"gaagle"</span><span class="p">,</span><span class="w"> </span><span class="nl">"ads_impressions"</span><span class="p">:</span><span class="w"> </span><span class="mi">4330</span><span class="p">,</span><span class="w"> </span><span class="nl">"ads_revenue"</span><span class="p">:</span><span class="w"> </span><span class="mf">464.02</span><span class="p">}</span><span class="w">
</span><span class="p">{</span><span class="nl">"date"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2023-01-05T00:00:00Z"</span><span class="p">,</span><span class="w"> </span><span class="nl">"ad_network"</span><span class="p">:</span><span class="w"> </span><span class="s2">"gaagle"</span><span class="p">,</span><span class="w"> </span><span class="nl">"ads_impressions"</span><span class="p">:</span><span class="w"> </span><span class="mi">6088</span><span class="p">,</span><span class="w"> </span><span class="nl">"ads_revenue"</span><span class="p">:</span><span class="w"> </span><span class="mf">320.57</span><span class="p">}</span><span class="w">
</span><span class="p">{</span><span class="nl">"date"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2023-01-06T00:00:00Z"</span><span class="p">,</span><span class="w"> </span><span class="nl">"ad_network"</span><span class="p">:</span><span class="w"> </span><span class="s2">"gaagle"</span><span class="p">,</span><span class="w"> </span><span class="nl">"ads_impressions"</span><span class="p">:</span><span class="w"> </span><span class="mi">3417</span><span class="p">,</span><span class="w"> </span><span class="nl">"ads_revenue"</span><span class="p">:</span><span class="w"> </span><span class="mf">162.77</span><span class="p">}</span><span class="w">
</span><span class="p">{</span><span class="nl">"date"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2023-01-07T00:00:00Z"</span><span class="p">,</span><span class="w"> </span><span class="nl">"ad_network"</span><span class="p">:</span><span class="w"> </span><span class="s2">"gaagle"</span><span class="p">,</span><span class="w"> </span><span class="nl">"ads_impressions"</span><span class="p">:</span><span class="w"> </span><span class="mi">9762</span><span class="p">,</span><span class="w"> </span><span class="nl">"ads_revenue"</span><span class="p">:</span><span class="w"> </span><span class="mf">76.27</span><span class="p">}</span><span class="w">
</span><span class="p">{</span><span class="nl">"date"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2023-01-08T00:00:00Z"</span><span class="p">,</span><span class="w"> </span><span class="nl">"ad_network"</span><span class="p">:</span><span class="w"> </span><span class="s2">"gaagle"</span><span class="p">,</span><span class="w"> </span><span class="nl">"ads_impressions"</span><span class="p">:</span><span class="w"> </span><span class="mi">1484</span><span class="p">,</span><span class="w"> </span><span class="nl">"ads_revenue"</span><span class="p">:</span><span class="w"> </span><span class="mf">188.17</span><span class="p">}</span><span class="w">
</span><span class="p">{</span><span class="nl">"date"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2023-01-09T00:00:00Z"</span><span class="p">,</span><span class="w"> </span><span class="nl">"ad_network"</span><span class="p">:</span><span class="w"> </span><span class="s2">"gaagle"</span><span class="p">,</span><span class="w"> </span><span class="nl">"ads_impressions"</span><span class="p">:</span><span class="w"> </span><span class="mi">1845</span><span class="p">,</span><span class="w"> </span><span class="nl">"ads_revenue"</span><span class="p">:</span><span class="w"> </span><span class="mf">287.5</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>
<p>Save this sample as <code class="language-plaintext highlighter-rouge">data2.json</code> and proceed to replace/insert the new data using this spec:</p>
<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
</span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"index_parallel"</span><span class="p">,</span><span class="w">
</span><span class="nl">"spec"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nl">"ioConfig"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"index_parallel"</span><span class="p">,</span><span class="w">
</span><span class="nl">"inputSource"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"combining"</span><span class="p">,</span><span class="w">
</span><span class="nl">"delegates"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"druid"</span><span class="p">,</span><span class="w">
</span><span class="nl">"dataSource"</span><span class="p">:</span><span class="w"> </span><span class="s2">"ad_data"</span><span class="p">,</span><span class="w">
</span><span class="nl">"interval"</span><span class="p">:</span><span class="w"> </span><span class="s2">"1000/3000"</span><span class="p">,</span><span class="w">
</span><span class="nl">"filter"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"not"</span><span class="p">,</span><span class="w">
</span><span class="nl">"field"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"and"</span><span class="p">,</span><span class="w">
</span><span class="nl">"fields"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"selector"</span><span class="p">,</span><span class="w">
</span><span class="nl">"dimension"</span><span class="p">:</span><span class="w"> </span><span class="s2">"ad_network"</span><span class="p">,</span><span class="w">
</span><span class="nl">"value"</span><span class="p">:</span><span class="w"> </span><span class="s2">"gaagle"</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"interval"</span><span class="p">,</span><span class="w">
</span><span class="nl">"dimension"</span><span class="p">:</span><span class="w"> </span><span class="s2">"__time"</span><span class="p">,</span><span class="w">
</span><span class="nl">"intervals"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
</span><span class="s2">"2023-01-03T00:00:00Z/2023-01-10T00:00:00Z"</span><span class="w">
</span><span class="p">],</span><span class="w">
</span><span class="nl">"extractionFn"</span><span class="p">:</span><span class="w"> </span><span class="kc">null</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">]</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"local"</span><span class="p">,</span><span class="w">
</span><span class="nl">"files"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="s2">"/<my base path>/data2.json"</span><span class="p">]</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">]</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="nl">"inputFormat"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"json"</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="nl">"tuningConfig"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"index_parallel"</span><span class="p">,</span><span class="w">
</span><span class="nl">"partitionsSpec"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"hashed"</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="nl">"forceGuaranteedRollup"</span><span class="p">:</span><span class="w"> </span><span class="kc">true</span><span class="p">,</span><span class="w">
</span><span class="nl">"maxNumConcurrentSubTasks"</span><span class="p">:</span><span class="w"> </span><span class="mi">2</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="nl">"dataSchema"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nl">"timestampSpec"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nl">"column"</span><span class="p">:</span><span class="w"> </span><span class="s2">"__time"</span><span class="p">,</span><span class="w">
</span><span class="nl">"missingValue"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2010-01-01T00:00:00Z"</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="nl">"transformSpec"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nl">"transforms"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"__time"</span><span class="p">,</span><span class="w">
</span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"expression"</span><span class="p">,</span><span class="w">
</span><span class="nl">"expression"</span><span class="p">:</span><span class="w"> </span><span class="s2">"nvl(timestamp_parse(date), </span><span class="se">\"</span><span class="s2">__time</span><span class="se">\"</span><span class="s2">)"</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">]</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="nl">"granularitySpec"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nl">"rollup"</span><span class="p">:</span><span class="w"> </span><span class="kc">false</span><span class="p">,</span><span class="w">
</span><span class="nl">"queryGranularity"</span><span class="p">:</span><span class="w"> </span><span class="s2">"none"</span><span class="p">,</span><span class="w">
</span><span class="nl">"segmentGranularity"</span><span class="p">:</span><span class="w"> </span><span class="s2">"week"</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="nl">"dimensionsSpec"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nl">"dimensions"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"ad_network"</span><span class="p">,</span><span class="w">
</span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"string"</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"ads_impressions"</span><span class="p">,</span><span class="w">
</span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"long"</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"ads_revenue"</span><span class="p">,</span><span class="w">
</span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"double"</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">]</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="nl">"dataSource"</span><span class="p">:</span><span class="w"> </span><span class="s2">"ad_data"</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>
<p>Here’s the result of a <code class="language-plaintext highlighter-rouge">SELECT *</code> query after the ingestion finishes:</p>
<table>
<thead>
<tr>
<th>__time</th>
<th>ad_network</th>
<th>ads_impressions</th>
<th>ads_revenue</th>
</tr>
</thead>
<tbody>
<tr>
<td>2023-01-01T00:00:00.000Z</td>
<td>fakebook</td>
<td>9646</td>
<td>137.85</td>
</tr>
<tr>
<td>2023-01-01T00:00:00.000Z</td>
<td>gaagle</td>
<td>2770</td>
<td>330.69</td>
</tr>
<tr>
<td>2023-01-01T00:00:00.000Z</td>
<td>twottr</td>
<td>1139</td>
<td>493.73</td>
</tr>
<tr>
<td>2023-01-02T00:00:00.000Z</td>
<td>fakebook</td>
<td>4426</td>
<td>170.96</td>
</tr>
<tr>
<td>2023-01-02T00:00:00.000Z</td>
<td>gaagle</td>
<td>9066</td>
<td>368.66</td>
</tr>
<tr>
<td>2023-01-02T00:00:00.000Z</td>
<td>twottr</td>
<td>9110</td>
<td>452.2</td>
</tr>
<tr>
<td>2023-01-03T00:00:00.000Z</td>
<td>fakebook</td>
<td>9494</td>
<td>426.37</td>
</tr>
<tr>
<td><em>2023-01-03T00:00:00.000Z</em></td>
<td><em>gaagle</em></td>
<td><em>4521</em></td>
<td><em>378.65</em></td>
</tr>
<tr>
<td>2023-01-03T00:00:00.000Z</td>
<td>twottr</td>
<td>4325</td>
<td>107.44</td>
</tr>
<tr>
<td>2023-01-04T00:00:00.000Z</td>
<td>fakebook</td>
<td>8955</td>
<td>254.5</td>
</tr>
<tr>
<td><em>2023-01-04T00:00:00.000Z</em></td>
<td><em>gaagle</em></td>
<td><em>4330</em></td>
<td><em>464.02</em></td>
</tr>
<tr>
<td>2023-01-04T00:00:00.000Z</td>
<td>twottr</td>
<td>6905</td>
<td>211.74</td>
</tr>
<tr>
<td>2023-01-05T00:00:00.000Z</td>
<td>fakebook</td>
<td>4870</td>
<td>205.84</td>
</tr>
<tr>
<td><em>2023-01-05T00:00:00.000Z</em></td>
<td><em>gaagle</em></td>
<td><em>6088</em></td>
<td><em>320.57</em></td>
</tr>
<tr>
<td>2023-01-05T00:00:00.000Z</td>
<td>twottr</td>
<td>1418</td>
<td>282.21</td>
</tr>
<tr>
<td>2023-01-06T00:00:00.000Z</td>
<td>fakebook</td>
<td>1251</td>
<td>265.52</td>
</tr>
<tr>
<td><em>2023-01-06T00:00:00.000Z</em></td>
<td><em>gaagle</em></td>
<td><em>3417</em></td>
<td><em>162.77</em></td>
</tr>
<tr>
<td>2023-01-06T00:00:00.000Z</td>
<td>twottr</td>
<td>8055</td>
<td>394.56</td>
</tr>
<tr>
<td>2023-01-07T00:00:00.000Z</td>
<td>fakebook</td>
<td>5848</td>
<td>162.96</td>
</tr>
<tr>
<td><em>2023-01-07T00:00:00.000Z</em></td>
<td><em>gaagle</em></td>
<td><em>9762</em></td>
<td><em>76.27</em></td>
</tr>
<tr>
<td>2023-01-07T00:00:00.000Z</td>
<td>twottr</td>
<td>9449</td>
<td>379.21</td>
</tr>
<tr>
<td><em>2023-01-08T00:00:00.000Z</em></td>
<td><em>gaagle</em></td>
<td><em>1484</em></td>
<td><em>188.17</em></td>
</tr>
<tr>
<td><em>2023-01-09T00:00:00.000Z</em></td>
<td><em>gaagle</em></td>
<td><em>1845</em></td>
<td><em>287.5</em></td>
</tr>
</tbody>
</table>
<p>Note how all the rows in <em>italics</em> come from the second data set. They have either been inserted (the last two rows), or they replace previous rows for the same time interval and network.</p>
<h3 id="taking-a-closer-look">Taking a closer look</h3>
<p>Let’s go through some interesting points in the ingestion spec.</p>
<h4 id="the-input-sources">The input sources</h4>
<p>As mentioned above, the <code class="language-plaintext highlighter-rouge">combining</code> input source works like a <code class="language-plaintext highlighter-rouge">union all</code>. The members of the union are specified in the <code class="language-plaintext highlighter-rouge">delegates</code> array, and they are input source definitions themselves.</p>
<p>This tutorial uses only two input sources, but generally you could have more than two. A delegate input source can be any input source, but with one important restriction: all input sources that need an <code class="language-plaintext highlighter-rouge">inputFormat</code> have to share the same <code class="language-plaintext highlighter-rouge">inputFormat</code>.</p>
<p>This means that as soon as file-shaped input sources are involved, the all have to be the same format. But you can freely combine file-shaped input with Druid reindexing, and probably also with SQL input (although I haven’t tested that.)</p>
<p>Here is the combine clause for our tutorial:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> "inputSource": {
"type": "combining",
"delegates": [
{
"type": "druid",
"dataSource": "ad_data",
...
},
{
"type": "local",
"files": ["/<my base path>/data2.json"]
}
]
}
</code></pre></div></div>
<p>The first part pulls data from the existing Druid datasource. It will apply a filter (left out above for brevity), which I am covering in the next paragraph. The second part gets the new data from a file.</p>
<p>The file input source does not have the ability to specify a filter, but then, we don’t need it because the file contains exactly the data we want to ingest.</p>
<p>The schemas of the two sources match almost but not quite. We will come to this when we look at the timestamp definition.</p>
<h4 id="druid-reindexing-interval-boundaries">Druid reindexing: Interval boundaries</h4>
<p>Any Druid reindexing job needs to define the interval that will be considered as the domain of reindexing. If you want to consider all data that exists in the datasource, specify an interval that is large enough to cover all possible timestamps:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> "interval": "1000/3000",
</code></pre></div></div>
<p>This shorthand is actually a set of two <a href="https://en.wikipedia.org/wiki/ISO_8601">ISO 8601</a> timestamps with year granularity! The numbers are year numbers, and year numbers alone are perfectly legal timestamps.</p>
<p>Why do we not specify the timestamp filter here? We cannot use the <code class="language-plaintext highlighter-rouge">"interval"</code> setting because we want to <em>cut out</em> an interval. I’ll come to this in the next paragraph.</p>
<p>(What we <em>can</em> do with <code class="language-plaintext highlighter-rouge">"interval"</code> though, is limit the amount of data that Druid needs to reindex. If you know that all the data you are going to touch is within a specific time range, this can speed up things. But make sure that your interval boundaries are aligned with the segment boundaries in Druid, otherwise you will lose data.)</p>
<h4 id="ingestion-filter-on-the-druid-reindexing-part">Ingestion filter on the Druid reindexing part</h4>
<p>Here’s where cutting out of data happens. The Druid input source allows you to introduce a set of filters that work the same way as filters inside the <code class="language-plaintext highlighter-rouge">transformSpec</code>, but, and this is important, are applied to that input source only.</p>
<p><a href="https://druid.apache.org/docs/latest/querying/filters.html">Filters</a> offer various ways to specify filter conditions, and to string them together using boolean operators in prefix notation. The condition tells us which rows to <em>keep</em>. Here is what the filter for our case looks like:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> "filter": {
"type": "not",
"field": {
"type": "and",
"fields": [
{
"type": "selector",
"dimension": "ad_network",
"value": "gaagle"
},
{
"type": "interval",
"dimension": "__time",
"intervals": [
"2023-01-03T00:00:00Z/2023-01-10T00:00:00Z"
],
"extractionFn": null
}
]
}
}
</code></pre></div></div>
<p>This filter will keep all rows that satisfy a condition of <code class="language-plaintext highlighter-rouge">not(and(ad_network=gaagle, timestamp in [interval]))</code>. Or, to express it in simpler words, it drops all rows that are from <code class="language-plaintext highlighter-rouge">gaagle</code> and within the time interval 3 to 10 January (left inclusive).</p>
<h4 id="schema-alignment-timestamp-definition">Schema alignment: Timestamp definition</h4>
<p>Most of the fields in the Druid datasource and in the input file match by name and type, because we defined it that way. There is one notable exception though:</p>
<p>The primary timestamp comes from a column <code class="language-plaintext highlighter-rouge">date</code> and is in ISO-8601 format, but in Druid the timestamp is a <code class="language-plaintext highlighter-rouge">long</code> value, expressed in milliseconds since Epoch, and is always named <code class="language-plaintext highlighter-rouge">__time</code>.</p>
<p><strong>If you do not reconcile these different timestamps, you will get confusing errors.</strong> Maybe Druid will not ingest fresh data at all. In another scenario, I saw an error complaining about a missing interval definition in the partition configuration. At any rate, watch out for your timestamps.</p>
<p>Luckily, it is easy to <a href="https://blog.hellmar-becker.de/2022/02/09/druid-data-cookbook-ingestion-transforms/#composite-timestamps">populate the timestamp using a Druid expression</a>. Here’s how it works:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> "timestampSpec": {
"column": "__time",
"missingValue": "2010-01-01T00:00:00Z"
},
"transformSpec": {
"transforms": [
{
"name": "__time",
"type": "expression",
"expression": "nvl(timestamp_parse(date), \"__time\")"
}
]
}
</code></pre></div></div>
<ul>
<li>The default is to pick up the timestamp from the <code class="language-plaintext highlighter-rouge">__time</code> column, which works for the reindexing case. This is coded in <code class="language-plaintext highlighter-rouge">timestampSpec</code>.</li>
<li>A transform overrides the value, replacing it by what is found in the <code class="language-plaintext highlighter-rouge">date</code> column (the file case.) If that value doesn’t exist, we fall back to <code class="language-plaintext highlighter-rouge">__time</code>.</li>
</ul>
<h4 id="tuning-configuration">Tuning configuration</h4>
<p>The documentation mentions that</p>
<blockquote>
<p>The secondary partitioning method determines the requisite number of concurrent worker tasks that run in parallel to complete ingestion with the Combining input source. Set this value in <code class="language-plaintext highlighter-rouge">maxNumConcurrentSubTasks</code> in <code class="language-plaintext highlighter-rouge">tuningConfig</code> based on the secondary partitioning method:</p>
<ul>
<li><code class="language-plaintext highlighter-rouge">range</code> or <code class="language-plaintext highlighter-rouge">single_dim</code> partitioning: greater than or equal to 1</li>
<li><code class="language-plaintext highlighter-rouge">hashed</code> or <code class="language-plaintext highlighter-rouge">dynamic</code> partitioning: greater than or equal to 2</li>
</ul>
</blockquote>
<p><strong>This advice is to be taken seriously.</strong> If you try to run with an insufficient number of subtasks you will get a highly misleading error message that looks like:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>java.lang.UnsupportedOperationException: Implement this method properly if needsFormat() = true
</code></pre></div></div>
<p>Make sure you configure at least two concurrent subtasks if you are using <code class="language-plaintext highlighter-rouge">hashed</code> or <code class="language-plaintext highlighter-rouge">dynamic</code> partitioning.</p>
<h2 id="conclusion">Conclusion</h2>
<p>This tutorial showed how to fold new and updated data into an existing datasource, to the effect of a <em>selective bulk upsert</em>. Let’s recap a few learnings:</p>
<ul>
<li>Selective bulk upserts are done using the <code class="language-plaintext highlighter-rouge">combining inputSource</code> idiom in Druid.</li>
<li>For reindexing Druid data, choose the <code class="language-plaintext highlighter-rouge">interval</code> to align with segment boundaries, or to be large enough to cover all data. You can apply fine grained date/time filters in the <code class="language-plaintext highlighter-rouge">filter</code> clause.</li>
<li>Ingestion filters are very expressive and allow a detailed specification of which data to retain or replace.</li>
<li>Make sure timestamp definitions are aligned between your Druid datasource and external data.</li>
<li>Configure a sufficient number of subtasks, according to the documentation.</li>
</ul>Apache Druid is designed for high query speed. The data segments that make up a Druid datasource (think: table) are generally immutable: You do not update or replace individual rows of data; however you can replace an entire segment with a new version of itself.