Skip to main content
  1. Posts/

DuckLake on Hetzner with Pixi — A Governed Data Lake for Under 10 Euros a Month

Code: github.com/KellerKev/ducklake-pixi — A pixi-managed DuckLake data lake with row-level access control, S3 storage, and local/production parity.

DuckLake is DuckDB’s answer to the lakehouse format — an open standard that uses a SQL database (PostgreSQL, MySQL, or SQLite) as the catalog and Parquet files on object storage as the data layer. No Hive Metastore, no Iceberg REST catalog, no heavyweight infrastructure. Just SQL and Parquet.

This project is inspired by Floyd Berndsen’s excellent ducklake-hetzner — which deploys a DuckLake stack on Hetzner using OpenTofu and PyInfra — and his ducklake-guard, which adds per-table access control with synchronized S3 bucket policies, PostgreSQL RLS, and an audit log. Great work that showed the potential of DuckLake as a governed, low-cost data lake.

I wanted to see if the same result could be achieved with pixi alone — managing all dependencies, tasks, and environments in a single reproducible pixi.toml. The result is ducklake-pixi: a complete DuckLake deployment with row-level access control, S3 bucket policies, and a local-to-production workflow — all through pixi. Total cost on Hetzner: under 10 euros a month.


Architecture
#

  ┌─────────────────────────────────────────────────────────┐
  │                    DuckLake Stack                        │
  │                                                         │
  │  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  │
  │  │   DuckDB     │  │  PostgreSQL  │  │  S3 Storage  │  │
  │  │              │  │              │  │              │  │
  │  │  Query       │  │  Catalog     │  │  Data        │  │
  │  │  Engine      │  │  Metadata    │  │  (Parquet)   │  │
  │  │              │  │  Access      │  │              │  │
  │  │  SQL         │  │  Control     │  │  Hetzner /   │  │
  │  │  Analytics   │  │  (RLS)       │  │  MinIO       │  │
  │  └──────┬───────┘  └──────┬───────┘  └──────┬───────┘  │
  │         │                 │                 │          │
  │         └─────────────────┼─────────────────┘          │
  │                           │                            │
  │                    pixi.toml                           │
  │              (dependencies + tasks)                    │
  └─────────────────────────────────────────────────────────┘

DuckDB is the query engine — SQL analytics directly on Parquet files in S3. No cluster to manage, no warehouse to size.

PostgreSQL is the catalog — stores table metadata, schema definitions, and row-level security policies. Multiple DuckDB clients can connect concurrently to the same catalog, enabling “multiplayer DuckDB.”

S3-compatible storage holds the actual data as Parquet files. Locally this is MinIO; in production it’s Hetzner Object Storage (or any S3-compatible service).

pixi manages everything — DuckDB, PostgreSQL, MinIO, Python, and all utility scripts. One pixi install and you’re ready.


Why Pixi?
#

The original ducklake-hetzner uses OpenTofu for infrastructure provisioning and PyInfra for server configuration — great choices for production infrastructure management. I personally wanted to explore whether pixi alone could cover this use case too, because I think it’s a perfect fit for data engineering workflows. Everything lives in a single pixi.toml:

  • One file defines everything — dependencies (DuckDB, PostgreSQL, Python, MinIO), tasks (start, stop, shell, guard), and environment configuration. No Makefile, no Docker Compose, no requirements.txt.
  • Reproducible across machinespixi.lock pins every dependency to an exact version and hash. Clone the repo, run pixi install, and you get the identical stack on any machine.
  • No system dependencies — pixi installs DuckDB, PostgreSQL, and MinIO from conda-forge. Nothing touches your system Python or package manager.
  • Task runner built inpixi run local-up, pixi run shell, pixi run guard-acl-grant. No need to remember which script lives where.

This is what makes pixi compelling for data engineering projects: it collapses the dependency management, environment setup, and task orchestration into a single, portable file.


Why DuckLake?
#

Traditional lakehouse formats (Iceberg, Delta Lake, Hudi) use file-based metadata — JSON manifests, Avro snapshots, transaction logs stored alongside data files. This works, but it adds complexity: you need a catalog service (Hive Metastore, Nessie, Unity Catalog) to make sense of it.

DuckLake takes a different approach: the catalog is just a SQL database. Table definitions, column metadata, file references, snapshots — all stored in PostgreSQL (or MySQL/SQLite). This means:

  • No additional catalog service to deploy or maintain
  • Concurrent access via standard database connections
  • Access control via standard SQL grants and row-level security
  • Familiar tooling — any PostgreSQL client can inspect the catalog

The data itself stays in open Parquet format on object storage. You get the benefits of a lakehouse (schema evolution, time travel, concurrent access) without the operational overhead.


Guard — Row-Level Access Control
#

The access control approach is inspired by Floyd Berndsen’s ducklake-guard, which pioneered the idea of synchronizing PostgreSQL RLS, S3 bucket policies, and a grant database for DuckLake lakehouses. This project adapts that pattern into pixi-managed tasks with a dual-layer system that restricts what data different users can see:

Layer 1: PostgreSQL RLS — row-level security policies on the catalog database hide table metadata from unauthorized roles. A reader who doesn’t have access to the customer table won’t even see it in the catalog.

Layer 2: S3 Bucket Policies — reader credentials are scoped to specific S3 key prefixes, so even if someone bypasses the catalog, the storage layer blocks access to unauthorized data.

  ┌──────────┐    catalog query    ┌────────────────┐
  │  Reader  │ ─────────────────► │  PostgreSQL    │
  │  (DuckDB)│                    │  RLS filters   │
  │          │ ◄───────────────── │  visible tables│
  └────┬─────┘    only granted    └────────────────┘
       │  data query (only granted prefixes)
  ┌────────────────┐
  │  S3 Storage    │
  │  Bucket policy │
  │  restricts to  │
  │  table prefix  │
  └────────────────┘

Grant and revoke access with simple commands:

# Grant a reader access to the customer table
SCHEMA=tpch TABLE=customer pixi run guard-acl-grant

# Revoke access
SCHEMA=tpch TABLE=customer pixi run guard-acl-revoke

# Show current grants
pixi run guard-acl-show

Local Development
#

Everything runs locally with MinIO as the S3 replacement:

# Clone and start
git clone https://github.com/KellerKev/ducklake-pixi.git
cd ducklake-pixi
cp .env.local .env
pixi run local-up    # starts PostgreSQL + MinIO, creates bucket

# Open a DuckDB shell connected to the lake
pixi run shell
-- Create a table — stored as Parquet in MinIO
CREATE TABLE my_lake.main.sensors (
    ts TIMESTAMP, device_id VARCHAR, reading DOUBLE
);
INSERT INTO my_lake.main.sensors VALUES
    (now(), 'temp-01', 22.5),
    (now(), 'temp-02', 23.1);
SELECT * FROM my_lake.main.sensors;

Load the TPC-H sample dataset (~86K rows, 8 tables) for testing access control:

pixi run guard-load-sample    # load TPC-H data
pixi run guard-pg-setup       # create reader role + RLS policies
pixi run guard-acl-grant      # grant access to specific tables
pixi run reader-shell         # open a restricted DuckDB session

Production on Hetzner
#

Deploy to Hetzner for under 10 euros a month:

ComponentHetzner ServiceCost
PostgreSQL + DuckDBCX22 VPS (2 vCPU, 4 GB RAM)~5.50 EUR/mo
Object StorageHetzner S3 (1 TB included)~3.50 EUR/mo
Total~9 EUR/mo
# Configure production credentials
cp .env.prod.sample .env
# Edit .env with your Hetzner S3 credentials and PostgreSQL connection

# Create the bucket and initialize
pixi run prod-up

# Query production data
pixi run shell

The DuckLake catalog format is identical between local and production. You can develop locally with MinIO, then point to Hetzner S3 with a config change.


Pixi Task Reference
#

Everything is a pixi task — no Makefiles, no Docker Compose, no shell scripts to remember:

TaskWhat It Does
pixi run local-upStart PostgreSQL + MinIO, create bucket
pixi run local-downStop local services
pixi run shellOpen admin DuckDB session
pixi run reader-shellOpen restricted DuckDB session
pixi run statusHealth check all services
pixi run prod-upCreate Hetzner S3 bucket
pixi run guard-load-sampleLoad TPC-H dataset
pixi run guard-pg-setupCreate reader role + RLS
pixi run guard-acl-grantGrant table access
pixi run guard-acl-revokeRevoke table access
pixi run guard-acl-showShow current grants

When to Use This
#

This setup makes sense when:

  • You need a lightweight data lake without the overhead of Iceberg/Delta Lake infrastructure
  • You want SQL-native catalog management rather than file-based metadata
  • You need row-level access control on a data lake without a commercial platform
  • You’re running on a budget — under 10 EUR/mo for a fully governed data lake
  • You want local/production parity with the same catalog format

It’s not a replacement for Snowflake or Databricks for large-scale production workloads. It’s a practical, governed data lake for analytics teams, side projects, and environments where simplicity and cost matter more than scale.


Related#

Kevin Keller
Author
Kevin Keller
Personal blog about AI, Observability & Data Sovereignty. Snowflake-related articles explore the art of the possible and are not official Snowflake solutions or endorsed by Snowflake unless explicitly stated. Opinions are my own. Content is meant as educational inspiration, not production guidance.
Share this article

Related