ClickHouse Array Storage: The Explosion Technique

clickhouse database arrays storage performance analytics

Discover how ClickHouse efficiently stores arrays using values + offsets columns, enabling blazing-fast analytical queries and optimal compression.

ClickHouse Logo

The Discovery

I just discovered how ClickHouse stores Arrays efficiently using values + offsets - a technique that’s both elegant and performant.

ClickHouse has a reputation for being ridiculously fast at analytical queries. Part of that speed comes from its columnar storage design, but one detail that surprised me is how it stores arrays and nested data structures.

Instead of keeping each row’s array as is, ClickHouse explodes arrays into two separate columns:

  • Values column: a flat sequence of all elements from all arrays
  • Offsets column: an array of integers marking where each row’s array ends

The Problem with Naive Approaches

Imagine you have a table with arrays:

CREATE TABLE example (
  id UInt32,
  tags Array(String)
) ENGINE = MergeTree();

And you insert some rows:

idtags
1["db", "metrics"]
2["ai", "ml", "genai"]
3["infra"]

A naive database might:

  • Store each row’s array as a blob (["db", "metrics"]) → bad for compression
  • Store each row in a separate subtable → bad for query speed

The ClickHouse Way: Exploding Arrays

ClickHouse stores the same data in this format:


tags.values = ["db", "metrics", "ai", "ml", "genai", "infra"]
tags.offsets = [2, 5, 6]
  • tags.values holds all elements in sequence, concatenated across rows
  • tags.offsets tells you how many elements belong to each row, by marking the ending index

Reconstructing Original Arrays

So how do you reconstruct the original row arrays?

  • Row 1: take values from index 0 to 1 (offset 2) → ["db", "metrics"]
  • Row 2: take values from index 2 to 4 (offset 5) → ["ai", "ml", "genai"]
  • Row 3: take values from index 5 to 5 (offset 6) → ["infra"]

That’s it! Two flat columns, but the ability to map back to nested arrays.

Why This Is Brilliant

1. Append-friendly

Adding a new row is just appending more values and a new offset. No reshuffling required.

2. Cache-friendly scans

Columnar queries (like arrayJoin(tags)) only need to scan the flat tags.values, which is contiguous in memory.

3. Compression wins

Flat columns compress better than row-level blobs because similar data (like repeated tags) sit close together.

4. Vectorized execution

Queries operate on contiguous memory regions, enabling SIMD acceleration and fast parallel scans.

The Technical Implementation

This approach allows ClickHouse to:

  • Process arrays efficiently without complex nested data structures
  • Leverage columnar storage benefits even for nested data
  • Enable fast analytical queries on array data
  • Achieve better compression ratios compared to JSON-like storage

References