Storing only the changed rows with Ducklake

I’ve recently been testing the new lakehouse product from the makers of DuckDB called Ducklake. Ducklake is a combination of:

SQL metadata tables in PostgreSQL / SQLite / MySQL / DuckDB

Parquet files stored either locally or in S3 kind of object storage.

I wanted to use Ducklake to track data from public sources and store even the rows which will eventually dissappear. Ducklake has minimal time traveling options too so this looked like a nice tool for me to try out. In this example we don’t actually use any public dataset but you can replace the tables with a public JSON or a CSV. In order to achieve this let’s create self-contained ducklake with a bit data of data in it. Start duckdb from command line and run following SQL statements:


INSTALL ducklake;
ATTACH 'ducklake:my_ducklake.ducklake' AS my_ducklake;

-- This table is stored in ducklake
CREATE TABLE my_ducklake.my_timeseries (
    ts TIMESTAMP,
    id DOUBLE PRECISION,
    y DOUBLE PRECISION,
    is_deleted BOOLEAN
);
-- This table is stored in the duckdb memory
CREATE TABLE my_timeseries_new AS
    FROM my_ducklake.my_timeseries LIMIT 0;
    

-- Let's insert some data in the tables
insert into my_ducklake.my_timeseries VALUES
    ('2025-09-14', 41, 39, FALSE),
    ('2025-09-15', 43, 39, FALSE),
    ('2025-09-16', 44, 40, FALSE);

insert into my_timeseries_new VALUES
    ('2025-09-14', 41, 39, FALSE),
    ('2025-09-15', 43, 20, FALSE),
    ('2025-09-17', 45, 20, FALSE);

Creators of DuckDB recommended to use MERGE INTO statements way to INSERT data into ducklake creates lot of duplicates in the parquet files. This requires lot of maintenance on merging the parquet files and I only wanted to store the changed rows instead of storing everything every single time. You can read more about this in this Github issue.

Finding rows which don’t exist anymore in the new dataset

Let’s explore first how we can find and soft-delete the rows which have been deleted from the public data source using ANTI JOIN:


SELECT *
FROM my_ducklake.my_timeseries
ANTI JOIN my_timeseries_new
USING(id)
┌─────────────────────┬────────┬────────┬────────────┐
│         ts          │   id   │   y    │ is_deleted │
│      timestamp      │ double │ double │  boolean   │
├─────────────────────┼────────┼────────┼────────────┤
│ 2025-09-16 00:00:00 │  44.0  │  40.0  │ false      │
└─────────────────────┴────────┴────────┴────────────┘

In this scenario we don’t actually want to delete the rows completely and we have the separate is_deleted column for soft-delete. Soft-deleting in this scenario would then go like this:


MERGE INTO my_timeseries
    USING (
        SELECT * REPLACE(TRUE AS is_deleted)
        FROM my_timeseries
        ANTI JOIN my_timeseries_new
        USING(id)
    )
    USING (id)
WHEN MATCHED THEN UPDATE;

See the important addition of REPLACE(TRUE AS is_deleted). This will create a parquet file which contains fully duplicated columns but where only the is_deleted has changed.

Updating changed rows and inserting new rows

It’s very easy to just write everything to the parquet files if you don’t care about storage and duplicates (don’t run this):


MERGE INTO my_timeseries
    USING my_timeseries_new
    USING (id)
WHEN MATCHED THEN UPDATE
WHEN NOT MATCHED THEN INSERT;

This creates new rows in the new parquet files even for the data which has not been changed. See more in this ducklake Github issue.

DuckDB supports the useful SQL set operators UNION, EXCEPT and INTERSECT. Let’s use EXCEPT for removing the rows which are exactly the same in old and new data. Note that SELECT * is optional in DuckDB and one can just write FROM without it:


FROM my_timeseries_new;
┌─────────────────────┬────────┬────────┬────────────┐
│         ts          │   id   │   y    │ is_deleted │
│      timestamp      │ double │ double │  boolean   │
├─────────────────────┼────────┼────────┼────────────┤
│ 2025-09-14 00:00:00 │   41.0 │   39.0 │ false      │
│ 2025-09-15 00:00:00 │   43.0 │   20.0 │ false      │
│ 2025-09-17 00:00:00 │   45.0 │   20.0 │ false      │
└─────────────────────┴────────┴────────┴────────────┘
FROM my_ducklake.my_timeseries;
┌─────────────────────┬────────┬────────┬────────────┐
│         ts          │   id   │   y    │ is_deleted │
│      timestamp      │ double │ double │  boolean   │
├─────────────────────┼────────┼────────┼────────────┤
│ 2025-09-14 00:00:00 │   41.0 │   39.0 │ false      │
│ 2025-09-15 00:00:00 │   43.0 │   39.0 │ false      │
│ 2025-09-16 00:00:00 │   44.0 │   40.0 │ false      │
└─────────────────────┴────────┴────────┴────────────┘
FROM my_timeseries_new
EXCEPT
FROM my_ducklake.my_timeseries;
┌─────────────────────┬────────┬────────┬────────────┐
│         ts          │   id   │   y    │ is_deleted │
│      timestamp      │ double │ double │  boolean   │
├─────────────────────┼────────┼────────┼────────────┤
│ 2025-09-17 00:00:00 │   45.0 │   20.0 │ false      │
│ 2025-09-15 00:00:00 │   43.0 │   20.0 │ false      │
└─────────────────────┴────────┴────────┴────────────┘

So again we can run this with the MERGE INTO statement to update the data:


MERGE INTO my_ducklake.my_timeseries
    USING (
        FROM my_timeseries_new
        EXCEPT
        FROM my_ducklake.my_timeseries
    )
    USING (id)
WHEN MATCHED THEN UPDATE
WHEN NOT MATCHED THEN INSERT;

Combining everything into one statement operation

When we used the 2 MERGE INTO statements separately from each other we will cause ducklake to write 2 separate parquet files. This is not ideal so let’s join these by using CTE and more set operators:


WITH soft_delete_missing_rows AS (
    SELECT * REPLACE(TRUE AS is_deleted)
    FROM my_ducklake.my_timeseries
    ANTI JOIN my_timeseries_new
    USING(id)
), new_and_changed_rows AS (
    FROM my_timeseries_new
    EXCEPT
    FROM my_ducklake.my_timeseries
)
MERGE INTO my_ducklake.my_timeseries
    USING (
        FROM soft_delete_missing_rows
        UNION ALL
        FROM new_and_changed_rows
    )
    USING (id)
WHEN MATCHED THEN UPDATE
WHEN NOT MATCHED THEN INSERT;

We have now ran all of the updates and inserts we wanted. Running it like this we also avoid creating extra parquet files into the disk. Let’s explore which kind of parquet files it created:


$ tree -h my_ducklake.ducklake.files/
[  96]  my_ducklake.ducklake.files/
└── [  96]  main
    └── [ 192]  my_timeseries
        ├── [ 600]  ducklake-01998544-56e9-7ce1-ad74-d8c3ab612b93.parquet
        ├── [ 764]  ducklake-01998544-56f7-7370-8bbe-73753d691af4.parquet
        ├── [ 564]  ducklake-01998544-56f7-76c6-9cd8-e7dcfc0f19b6.parquet
        └── [ 822]  ducklake-01998544-56f8-7a8b-adbd-8e0d1e6f0b71-delete.parquet

3 directories, 4 files
$ cd my_ducklake.ducklake.files/main/my_timeseries
# This is the original data
$ duckdb -c "FROM 'ducklake-01998544-56e9-7ce1-ad74-d8c3ab612b93.parquet';"
┌─────────────────────┬───────┬─────────┬────────────┐
│         ts          │  id   │ another │ is_deleted │
│      timestamp      │ int64 │  int64  │  boolean   │
├─────────────────────┼───────┼─────────┼────────────┤
│ 2025-09-14 00:00:00 │    41 │      39 │ false      │
│ 2025-09-15 00:00:00 │    43 │      39 │ false      │
│ 2025-09-16 00:00:00 │    44 │      40 │ false      │
└─────────────────────┴───────┴─────────┴────────────┘
# These are all of the rows which changed in original data
$ duckdb -c "FROM 'ducklake-01998544-56f7-7370-8bbe-73753d691af4.parquet';"
┌─────────────────────┬───────┬─────────┬────────────┬───────────────────────────┐
│         ts          │  id   │ another │ is_deleted │ _ducklake_internal_row_id │
│      timestamp      │ int64 │  int64  │  boolean   │           int64           │
├─────────────────────┼───────┼─────────┼────────────┼───────────────────────────┤
│ 2025-09-15 00:00:00 │    43 │      20 │ false      │                         1 │
│ 2025-09-16 00:00:00 │    44 │      40 │ true       │                         2 │
└─────────────────────┴───────┴─────────┴────────────┴───────────────────────────┘
# This is the newly inserted data
$ duckdb -c "FROM 'ducklake-01998544-56f7-76c6-9cd8-e7dcfc0f19b6.parquet';"
┌─────────────────────┬───────┬─────────┬────────────┐
│         ts          │  id   │ another │ is_deleted │
│      timestamp      │ int64 │  int64  │  boolean   │
├─────────────────────┼───────┼─────────┼────────────┤
│ 2025-09-17 00:00:00 │  45   │   20    │ false      │
└─────────────────────┴───────┴─────────┴────────────┘
# This contains the position of the updated rows parquet files.
# Because we updated them the original rows can be deleted
$ duckdb -c "FROM 'ducklake-01998544-56f8-7a8b-adbd-8e0d1e6f0b71-delete.parquet';"
┌────────────────────────────────────────────────────────────────────────────────────────────────┬───────┐
│                                           file_path                                            │  pos  │
│                                            varchar                                             │ int64 │
├────────────────────────────────────────────────────────────────────────────────────────────────┼───────┤
│ my_ducklake.ducklake.files/main/my_timeseries/ducklake-01998544-56e9-7ce1-ad74-d8c3ab612b93.…  │     1 │
│ my_ducklake.ducklake.files/main/my_timeseries/ducklake-01998544-56e9-7ce1-ad74-d8c3ab612b93.…  │     2 │
└────────────────────────────────────────────────────────────────────────────────────────────────┴───────┘

From this we can see that there are 3 types of parquet files created by Ducklake:

One for inserting new rows

One for updating existing rows. This one has the extra _ducklake_internal_row_id column

One for tracking the rows which have been deleted. (with *-delete.parquet extension).

Pretty interesting at least for me!

Conclusion

Ducklake seems to be pretty nice tool to automatically store things in parquet files where I as developer don’t need to think too much about the storage and it just happens magically in the background.

Let me know if you find even better ways to have a similiar workflow to only track the changes in the datasets 🦆!