Code: github.com/KellerKev/ducklake-pixi — A pixi-managed DuckLake data lake with row-level access control, S3 storage, and local/production parity.
DuckLake is DuckDB’s answer to the lakehouse format — an open standard that uses a SQL database (PostgreSQL, MySQL, or SQLite) as the catalog and Parquet files on object storage as the data layer. No Hive Metastore, no Iceberg REST catalog, no heavyweight infrastructure. Just SQL and Parquet.
This project is inspired by Floyd Berndsen’s excellent ducklake-hetzner — which deploys a DuckLake stack on Hetzner using OpenTofu and PyInfra — and his ducklake-guard, which adds per-table access control with synchronized S3 bucket policies, PostgreSQL RLS, and an audit log. Great work that showed the potential of DuckLake as a governed, low-cost data lake.
I wanted to see if the same result could be achieved with pixi alone — managing all dependencies, tasks, and environments in a single reproducible pixi.toml. The result is ducklake-pixi: a complete DuckLake deployment with row-level access control, S3 bucket policies, and a local-to-production workflow — all through pixi. Total cost on Hetzner: under 10 euros a month.
Architecture#
┌─────────────────────────────────────────────────────────┐
│ DuckLake Stack │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ DuckDB │ │ PostgreSQL │ │ S3 Storage │ │
│ │ │ │ │ │ │ │
│ │ Query │ │ Catalog │ │ Data │ │
│ │ Engine │ │ Metadata │ │ (Parquet) │ │
│ │ │ │ Access │ │ │ │
│ │ SQL │ │ Control │ │ Hetzner / │ │
│ │ Analytics │ │ (RLS) │ │ MinIO │ │
│ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │
│ │ │ │ │
│ └─────────────────┼─────────────────┘ │
│ │ │
│ pixi.toml │
│ (dependencies + tasks) │
└─────────────────────────────────────────────────────────┘DuckDB is the query engine — SQL analytics directly on Parquet files in S3. No cluster to manage, no warehouse to size.
PostgreSQL is the catalog — stores table metadata, schema definitions, and row-level security policies. Multiple DuckDB clients can connect concurrently to the same catalog, enabling “multiplayer DuckDB.”
S3-compatible storage holds the actual data as Parquet files. Locally this is MinIO; in production it’s Hetzner Object Storage (or any S3-compatible service).
pixi manages everything — DuckDB, PostgreSQL, MinIO, Python, and all utility scripts. One pixi install and you’re ready.
Why Pixi?#
The original ducklake-hetzner uses OpenTofu for infrastructure provisioning and PyInfra for server configuration — great choices for production infrastructure management. I personally wanted to explore whether pixi alone could cover this use case too, because I think it’s a perfect fit for data engineering workflows. Everything lives in a single pixi.toml:
- One file defines everything — dependencies (DuckDB, PostgreSQL, Python, MinIO), tasks (start, stop, shell, guard), and environment configuration. No Makefile, no Docker Compose, no requirements.txt.
- Reproducible across machines —
pixi.lockpins every dependency to an exact version and hash. Clone the repo, runpixi install, and you get the identical stack on any machine. - No system dependencies — pixi installs DuckDB, PostgreSQL, and MinIO from conda-forge. Nothing touches your system Python or package manager.
- Task runner built in —
pixi run local-up,pixi run shell,pixi run guard-acl-grant. No need to remember which script lives where.
This is what makes pixi compelling for data engineering projects: it collapses the dependency management, environment setup, and task orchestration into a single, portable file.
Why DuckLake?#
Traditional lakehouse formats (Iceberg, Delta Lake, Hudi) use file-based metadata — JSON manifests, Avro snapshots, transaction logs stored alongside data files. This works, but it adds complexity: you need a catalog service (Hive Metastore, Nessie, Unity Catalog) to make sense of it.
DuckLake takes a different approach: the catalog is just a SQL database. Table definitions, column metadata, file references, snapshots — all stored in PostgreSQL (or MySQL/SQLite). This means:
- No additional catalog service to deploy or maintain
- Concurrent access via standard database connections
- Access control via standard SQL grants and row-level security
- Familiar tooling — any PostgreSQL client can inspect the catalog
The data itself stays in open Parquet format on object storage. You get the benefits of a lakehouse (schema evolution, time travel, concurrent access) without the operational overhead.
Guard — Row-Level Access Control#
The access control approach is inspired by Floyd Berndsen’s ducklake-guard, which pioneered the idea of synchronizing PostgreSQL RLS, S3 bucket policies, and a grant database for DuckLake lakehouses. This project adapts that pattern into pixi-managed tasks with a dual-layer system that restricts what data different users can see:
Layer 1: PostgreSQL RLS — row-level security policies on the catalog database hide table metadata from unauthorized roles. A reader who doesn’t have access to the customer table won’t even see it in the catalog.
Layer 2: S3 Bucket Policies — reader credentials are scoped to specific S3 key prefixes, so even if someone bypasses the catalog, the storage layer blocks access to unauthorized data.
┌──────────┐ catalog query ┌────────────────┐
│ Reader │ ─────────────────► │ PostgreSQL │
│ (DuckDB)│ │ RLS filters │
│ │ ◄───────────────── │ visible tables│
└────┬─────┘ only granted └────────────────┘
│
│ data query (only granted prefixes)
▼
┌────────────────┐
│ S3 Storage │
│ Bucket policy │
│ restricts to │
│ table prefix │
└────────────────┘Grant and revoke access with simple commands:
# Grant a reader access to the customer table
SCHEMA=tpch TABLE=customer pixi run guard-acl-grant
# Revoke access
SCHEMA=tpch TABLE=customer pixi run guard-acl-revoke
# Show current grants
pixi run guard-acl-showLocal Development#
Everything runs locally with MinIO as the S3 replacement:
# Clone and start
git clone https://github.com/KellerKev/ducklake-pixi.git
cd ducklake-pixi
cp .env.local .env
pixi run local-up # starts PostgreSQL + MinIO, creates bucket
# Open a DuckDB shell connected to the lake
pixi run shell-- Create a table — stored as Parquet in MinIO
CREATE TABLE my_lake.main.sensors (
ts TIMESTAMP, device_id VARCHAR, reading DOUBLE
);
INSERT INTO my_lake.main.sensors VALUES
(now(), 'temp-01', 22.5),
(now(), 'temp-02', 23.1);
SELECT * FROM my_lake.main.sensors;Load the TPC-H sample dataset (~86K rows, 8 tables) for testing access control:
pixi run guard-load-sample # load TPC-H data
pixi run guard-pg-setup # create reader role + RLS policies
pixi run guard-acl-grant # grant access to specific tables
pixi run reader-shell # open a restricted DuckDB sessionProduction on Hetzner#
Deploy to Hetzner for under 10 euros a month:
| Component | Hetzner Service | Cost |
|---|---|---|
| PostgreSQL + DuckDB | CX22 VPS (2 vCPU, 4 GB RAM) | ~5.50 EUR/mo |
| Object Storage | Hetzner S3 (1 TB included) | ~3.50 EUR/mo |
| Total | ~9 EUR/mo |
# Configure production credentials
cp .env.prod.sample .env
# Edit .env with your Hetzner S3 credentials and PostgreSQL connection
# Create the bucket and initialize
pixi run prod-up
# Query production data
pixi run shellThe DuckLake catalog format is identical between local and production. You can develop locally with MinIO, then point to Hetzner S3 with a config change.
Pixi Task Reference#
Everything is a pixi task — no Makefiles, no Docker Compose, no shell scripts to remember:
| Task | What It Does |
|---|---|
pixi run local-up | Start PostgreSQL + MinIO, create bucket |
pixi run local-down | Stop local services |
pixi run shell | Open admin DuckDB session |
pixi run reader-shell | Open restricted DuckDB session |
pixi run status | Health check all services |
pixi run prod-up | Create Hetzner S3 bucket |
pixi run guard-load-sample | Load TPC-H dataset |
pixi run guard-pg-setup | Create reader role + RLS |
pixi run guard-acl-grant | Grant table access |
pixi run guard-acl-revoke | Revoke table access |
pixi run guard-acl-show | Show current grants |
When to Use This#
This setup makes sense when:
- You need a lightweight data lake without the overhead of Iceberg/Delta Lake infrastructure
- You want SQL-native catalog management rather than file-based metadata
- You need row-level access control on a data lake without a commercial platform
- You’re running on a budget — under 10 EUR/mo for a fully governed data lake
- You want local/production parity with the same catalog format
It’s not a replacement for Snowflake or Databricks for large-scale production workloads. It’s a practical, governed data lake for analytics teams, side projects, and environments where simplicity and cost matter more than scale.
