Getting Started with DuckLake 1.0: A SQL-Based Data Lake Format
By ⚡ min read
<h2 id="overview">Overview</h2>
<p>DuckLake 1.0 introduces a fresh approach to managing data lake metadata. Instead of scattering metadata across numerous files in object storage, it centralizes table metadata in a SQL database—making updates, sorting, and partitioning more efficient. Built as a DuckDB extension, DuckLake integrates seamlessly with existing workflows and offers compatibility with Iceberg-style features. This guide walks you through its setup, core operations, and common pitfalls.</p><figure style="margin:20px 0"><img src="https://res.infoq.com/news/2026/05/ducklake-sql-catalog/en/headerimage/generatedHeaderImage-1776423164012.jpg" alt="Getting Started with DuckLake 1.0: A SQL-Based Data Lake Format" style="width:100%;height:auto;border-radius:8px" loading="lazy"><figcaption style="font-size:12px;color:#666;margin-top:5px">Source: www.infoq.com</figcaption></figure>
<h2 id="prerequisites">Prerequisites</h2>
<ul>
<li><strong>DuckDB</strong>: Version 0.10.0 or higher (command-line interface or Python binding).</li>
<li><strong>Object Storage</strong>: A bucket or directory (e.g., S3, MinIO, local filesystem) for storing parquet files.</li>
<li><strong>SQL Database</strong>: For the catalog—DuckDB itself works for local testing; production uses PostgreSQL or MySQL.</li>
<li><strong>DuckLake Extension</strong>: Install via <code>INSTALL ducklake; LOAD ducklake;</code>.</li>
</ul>
<h2 id="step-by-step">Step-by-Step Instructions</h2>
<h3 id="install-extension">1. Install and Load the DuckLake Extension</h3>
<p>Open DuckDB and run:</p>
<pre><code>INSTALL ducklake FROM community;
LOAD ducklake;</code></pre>
<p>This registers DuckLake’s functions and types. Verify with <code>SELECT * FROM ducklake_version();</code></p>
<h3 id="create-catalog">2. Create a DuckLake Catalog</h3>
<p>A catalog holds all table metadata. Use <code>CREATE DUCKLAKE CATALOG</code>:</p>
<pre><code>CREATE DUCKLAKE CATALOG my_catalog
DATABASE 'duckdb' -- can be 'postgresql' or 'mysql'
CONNECTION_STRING 'file:///path/to/catalog.db';
-- Switch to the catalog
USE my_catalog;</code></pre>
<p><em>Tip</em>: For remote databases, use a connection string like <code>postgresql://user:pass@host/db</code>.</p>
<h3 id="create-table">3. Create a DuckLake Table</h3>
<p>Define a table with partitioning and sorting:</p>
<pre><code>CREATE DUCKLAKE TABLE sales (
order_id INTEGER,
amount DECIMAL(10,2),
order_date DATE,
region VARCHAR
)
PARTITIONED BY (region)
SORTED BY (order_date);</code></pre>
<p>This creates a logical table. Data is stored as Parquet files in your object storage.</p>
<h3 id="write-data">4. Insert Data</h3>
<p>Insert directly or from a SELECT:</p>
<pre><code>INSERT INTO sales VALUES
(1, 150.00, '2025-01-15', 'East'),
(2, 200.50, '2025-01-16', 'West');</code></pre>
<p>DuckLake automatically writes new Parquet files per partition and updates the catalog.</p>
<h3 id="read-data">5. Query the Table</h3>
<p>Standard SQL works—DuckLake reads the catalog to locate files:</p>
<pre><code>SELECT region, SUM(amount) AS total_sales
FROM sales
WHERE order_date >= '2025-01-01'
GROUP BY region;</code></pre>
<p>Partition pruning and sorting are applied automatically.</p><figure style="margin:20px 0"><img src="https://imgopt.infoq.com/fit-in/100x100/filters:quality(80)/presentations/game-vr-flat-screens/en/smallimage/thumbnail-1775637585504.jpg" alt="Getting Started with DuckLake 1.0: A SQL-Based Data Lake Format" style="width:100%;height:auto;border-radius:8px" loading="lazy"><figcaption style="font-size:12px;color:#666;margin-top:5px">Source: www.infoq.com</figcaption></figure>
<h3 id="manage-partitions">6. Manage Partitions and Small Updates</h3>
<p>DuckLake supports incremental updates without rewriting whole partitions. Use <code>MERGE</code> or <code>DELETE</code>:</p>
<pre><code>DELETE FROM sales WHERE order_id = 1;
MERGE INTO sales AS target
USING (VALUES (3, 300.00, '2025-01-20', 'East')) AS src
ON target.order_id = src.column1
WHEN MATCHED THEN UPDATE SET amount = src.column2
WHEN NOT MATCHED THEN INSERT (order_id, amount, order_date, region)
VALUES (src.column1, src.column2, src.column3, src.column4);</code></pre>
<p>The catalog tracks these small changes efficiently.</p>
<h3 id="iceberg-compat">7. Iceberg Compatibility</h3>
<p>DuckLake can read Iceberg tables if you enable compatibility mode:</p>
<pre><code>SET ducklake_iceberg_compat = true;
SELECT * FROM iceberg_scan('s3://bucket/iceberg_table');</code></pre>
<p>Write support is limited to DuckLake-native tables.</p>
<h2 id="common-mistakes">Common Mistakes</h2>
<ul>
<li><strong>Forgetting to load the extension</strong>: Always run <code>LOAD ducklake;</code> after installation.</li>
<li><strong>Wrong catalog connection string</strong>: Ensure the path or database URL is correct and accessible.</li>
<li><strong>Partition key mismatch</strong>: When inserting, include the partition column; missing it causes errors.</li>
<li><strong>Overwriting small files</strong>: DuckLake handles small updates, but avoid frequent tiny inserts—compact periodically with <code>OPTIMIZE TABLE sales;</code>.</li>
<li><strong>Ignoring sorting</strong>: Define a sort column to speed up range queries; otherwise full scans occur.</li>
</ul>
<h2 id="summary">Summary</h2>
<p>DuckLake 1.0 simplifies data lake management by storing metadata in SQL, enabling faster updates and smarter partitioning. With its DuckDB extension, you get a lightweight yet powerful alternative to Hive or Iceberg for analytical workloads. Start small, tune your partitions, and enjoy seamless SQL-driven data lakes.</p>