Data Cookbook Kitchen

This is a question I encountered recently. A customer wanted to get started with ClickHouse but they had their historical data in CSV files in Google Drive.

Here’s how to get the job done. Main source is this StackOverflow answer.

Get the file ID:

  • Go to your Google Drive in your browser.
  • Right-click the file you want to download and click Get shareable link. The link looks like this: https://drive.google.com/file/d/XXX/view?usp=sharing. Make note of the file ID XXX; you will be needing it below.

Get an OAuth token:

  • Go to OAuth 2.0 Playground
  • In the Select & authorize APIs box, scroll down, expand Drive API v3, and select https://www.googleapis.com/auth/drive.readonly.
  • Click Authorize APIs and then Exchange authorization code for tokens. Copy the Access token YYY; you will be needing it below.

Find the table schema:

Using the values obtained in the previous steps, create the DESCRIBE command to read the table definition using the url table function. Use the setting schema_inference_make_columns_nullable = 0 to make all columns non nullable.

DESCRIBE url(
    'https://www.googleapis.com/drive/v3/files/XXX?alt=media', 
    'CSVWithNames', 
    headers('Authorization' = 'Bearer YYY')
)
SETTINGS schema_inference_make_columns_nullable = 0;

Choose the column(s) for your primary key from the displayed column list. I’ll assume you are going to use a column named ZZZ as the primary key.

Create and populate the table:

CREATE OR REPLACE TABLE t  
ENGINE = MergeTree
ORDER BY `ZZZ`
EMPTY AS 
SELECT * FROM url(
    'https://www.googleapis.com/drive/v3/files/XXX?alt=media', 
    'CSVWithNames', 
    headers('Authorization' = 'Bearer YYY')
)
SETTINGS schema_inference_make_columns_nullable = 0;
INSERT INTO t
SELECT * FROM url(
    'https://www.googleapis.com/drive/v3/files/XXX?alt=media', 
    'CSVWithNames', 
    headers('Authorization' = 'Bearer YYY')
)
SETTINGS schema_inference_make_columns_nullable = 0;

Conclusion

  • Files in Google Drive can be accessed programmatically through the GDrive API.
  • Use the url table function to ingest such files into ClickHouse.

This image is taken from Page 377 of Praktisches Kochbuch für die gewöhnliche und feinere Küche” by Medical Heritage Library, Inc. is licensed under CC BY-NC-SA 2.0 .