I’ve recently been testing the new lakehouse product from the makers of DuckDB called Ducklake. Ducklake is a combination of:
- SQL metadata tables in PostgreSQL / SQLite / MySQL / DuckDB
- Parquet files stored either locally or in S3 kind of object storage.
I wanted to use Ducklake to track data from public sources and store even the rows which will eventually dissappear. Ducklake has minimal time traveling options too so this looked like a nice tool for me to try out.
In this example we don’t actually use any public dataset but you can replace the tables with a public JSON or a CSV.
In order to achieve this let’s create self-contained ducklake with a bit data of data in it. Start
duckdb
from command line and run following SQL statements:INSTALL ducklake; ATTACH 'ducklake:my_ducklake.ducklake' AS my_ducklake; -- This table is stored in ducklake CREATE TABLE my_ducklake.my_timeseries ( ts TIMESTAMP, id DOUBLE PRECISION, y DOUBLE PRECISION, is_deleted BOOLEAN ); -- This table is stored in the duckdb memory CREATE TABLE my_timeseries_new AS FROM my_ducklake.my_timeseries LIMIT 0; -- Let's insert some data in the tables insert into my_ducklake.my_timeseries VALUES ('2025-09-14', 41, 39, FALSE), ('2025-09-15', 43, 39, FALSE), ('2025-09-16', 44, 40, FALSE); insert into my_timeseries_new VALUES ('2025-09-14', 41, 39, FALSE), ('2025-09-15', 43, 20, FALSE), ('2025-09-17', 45, 20, FALSE);
Creators of DuckDB recommended to use
MERGE INTO
statements way to INSERT
data into ducklake creates lot of duplicates in the parquet files. This requires lot of maintenance on merging the parquet files and I only wanted to store the changed rows instead of storing everything every single time.
You can read more about this in this Github issue.Finding rows which don’t exist anymore in the new dataset
Let’s explore first how we can find and soft-delete the rows which have been deleted from the public data source using
ANTI JOIN
:SELECT * FROM my_ducklake.my_timeseries ANTI JOIN my_timeseries_new USING(id) ┌─────────────────────┬────────┬────────┬────────────┐ │ ts │ id │ y │ is_deleted │ │ timestamp │ double │ double │ boolean │ ├─────────────────────┼────────┼────────┼────────────┤ │ 2025-09-16 00:00:00 │ 44.0 │ 40.0 │ false │ └─────────────────────┴────────┴────────┴────────────┘
In this scenario we don’t actually want to delete the rows completely and we have the separate
is_deleted
column for soft-delete.
Soft-deleting in this scenario would then go like this:MERGE INTO my_timeseries USING ( SELECT * REPLACE(TRUE AS is_deleted) FROM my_timeseries ANTI JOIN my_timeseries_new USING(id) ) USING (id) WHEN MATCHED THEN UPDATE;
See the important addition of
REPLACE(TRUE AS is_deleted)
. This will create a parquet file which contains fully duplicated columns but where only the is_deleted
has changed.Updating changed rows and inserting new rows
It’s very easy to just write everything to the parquet files if you don’t care about storage and duplicates (don’t run this):
MERGE INTO my_timeseries USING my_timeseries_new USING (id) WHEN MATCHED THEN UPDATE WHEN NOT MATCHED THEN INSERT;
This creates new rows in the new parquet files even for the data which has not been changed. See more in this ducklake Github issue.
DuckDB supports the useful SQL set operators
UNION
, EXCEPT
and INTERSECT
. Let’s use EXCEPT
for removing the rows which are exactly the same in old and new data. Note that SELECT *
is optional in DuckDB and one can just write FROM
without it:FROM my_timeseries_new; ┌─────────────────────┬────────┬────────┬────────────┐ │ ts │ id │ y │ is_deleted │ │ timestamp │ double │ double │ boolean │ ├─────────────────────┼────────┼────────┼────────────┤ │ 2025-09-14 00:00:00 │ 41.0 │ 39.0 │ false │ │ 2025-09-15 00:00:00 │ 43.0 │ 20.0 │ false │ │ 2025-09-17 00:00:00 │ 45.0 │ 20.0 │ false │ └─────────────────────┴────────┴────────┴────────────┘ FROM my_ducklake.my_timeseries; ┌─────────────────────┬────────┬────────┬────────────┐ │ ts │ id │ y │ is_deleted │ │ timestamp │ double │ double │ boolean │ ├─────────────────────┼────────┼────────┼────────────┤ │ 2025-09-14 00:00:00 │ 41.0 │ 39.0 │ false │ │ 2025-09-15 00:00:00 │ 43.0 │ 39.0 │ false │ │ 2025-09-16 00:00:00 │ 44.0 │ 40.0 │ false │ └─────────────────────┴────────┴────────┴────────────┘ FROM my_timeseries_new EXCEPT FROM my_ducklake.my_timeseries; ┌─────────────────────┬────────┬────────┬────────────┐ │ ts │ id │ y │ is_deleted │ │ timestamp │ double │ double │ boolean │ ├─────────────────────┼────────┼────────┼────────────┤ │ 2025-09-17 00:00:00 │ 45.0 │ 20.0 │ false │ │ 2025-09-15 00:00:00 │ 43.0 │ 20.0 │ false │ └─────────────────────┴────────┴────────┴────────────┘
So again we can run this with the
MERGE INTO
statement to update the data:MERGE INTO my_ducklake.my_timeseries USING ( FROM my_timeseries_new EXCEPT FROM my_ducklake.my_timeseries ) USING (id) WHEN MATCHED THEN UPDATE WHEN NOT MATCHED THEN INSERT;
Combining everything into one statement operation
When we used the 2
MERGE INTO
statements separately from each other we will cause ducklake to write 2 separate parquet files. This is not ideal so let’s join these by using CTE and more set operators:WITH soft_delete_missing_rows AS ( SELECT * REPLACE(TRUE AS is_deleted) FROM my_ducklake.my_timeseries ANTI JOIN my_timeseries_new USING(id) ), new_and_changed_rows AS ( FROM my_timeseries_new EXCEPT FROM my_ducklake.my_timeseries ) MERGE INTO my_ducklake.my_timeseries USING ( FROM soft_delete_missing_rows UNION ALL FROM new_and_changed_rows ) USING (id) WHEN MATCHED THEN UPDATE WHEN NOT MATCHED THEN INSERT;
We have now ran all of the updates and inserts we wanted. Running it like this we also avoid creating extra parquet files into the disk.
Let’s explore which kind of parquet files it created:
$ tree -h my_ducklake.ducklake.files/ [ 96] my_ducklake.ducklake.files/ └── [ 96] main └── [ 192] my_timeseries ├── [ 600] ducklake-01998544-56e9-7ce1-ad74-d8c3ab612b93.parquet ├── [ 764] ducklake-01998544-56f7-7370-8bbe-73753d691af4.parquet ├── [ 564] ducklake-01998544-56f7-76c6-9cd8-e7dcfc0f19b6.parquet └── [ 822] ducklake-01998544-56f8-7a8b-adbd-8e0d1e6f0b71-delete.parquet 3 directories, 4 files $ cd my_ducklake.ducklake.files/main/my_timeseries # This is the original data $ duckdb -c "FROM 'ducklake-01998544-56e9-7ce1-ad74-d8c3ab612b93.parquet';" ┌─────────────────────┬───────┬─────────┬────────────┐ │ ts │ id │ another │ is_deleted │ │ timestamp │ int64 │ int64 │ boolean │ ├─────────────────────┼───────┼─────────┼────────────┤ │ 2025-09-14 00:00:00 │ 41 │ 39 │ false │ │ 2025-09-15 00:00:00 │ 43 │ 39 │ false │ │ 2025-09-16 00:00:00 │ 44 │ 40 │ false │ └─────────────────────┴───────┴─────────┴────────────┘ # These are all of the rows which changed in original data $ duckdb -c "FROM 'ducklake-01998544-56f7-7370-8bbe-73753d691af4.parquet';" ┌─────────────────────┬───────┬─────────┬────────────┬───────────────────────────┐ │ ts │ id │ another │ is_deleted │ _ducklake_internal_row_id │ │ timestamp │ int64 │ int64 │ boolean │ int64 │ ├─────────────────────┼───────┼─────────┼────────────┼───────────────────────────┤ │ 2025-09-15 00:00:00 │ 43 │ 20 │ false │ 1 │ │ 2025-09-16 00:00:00 │ 44 │ 40 │ true │ 2 │ └─────────────────────┴───────┴─────────┴────────────┴───────────────────────────┘ # This is the newly inserted data $ duckdb -c "FROM 'ducklake-01998544-56f7-76c6-9cd8-e7dcfc0f19b6.parquet';" ┌─────────────────────┬───────┬─────────┬────────────┐ │ ts │ id │ another │ is_deleted │ │ timestamp │ int64 │ int64 │ boolean │ ├─────────────────────┼───────┼─────────┼────────────┤ │ 2025-09-17 00:00:00 │ 45 │ 20 │ false │ └─────────────────────┴───────┴─────────┴────────────┘ # This contains the position of the updated rows parquet files. # Because we updated them the original rows can be deleted $ duckdb -c "FROM 'ducklake-01998544-56f8-7a8b-adbd-8e0d1e6f0b71-delete.parquet';" ┌────────────────────────────────────────────────────────────────────────────────────────────────┬───────┐ │ file_path │ pos │ │ varchar │ int64 │ ├────────────────────────────────────────────────────────────────────────────────────────────────┼───────┤ │ my_ducklake.ducklake.files/main/my_timeseries/ducklake-01998544-56e9-7ce1-ad74-d8c3ab612b93.… │ 1 │ │ my_ducklake.ducklake.files/main/my_timeseries/ducklake-01998544-56e9-7ce1-ad74-d8c3ab612b93.… │ 2 │ └────────────────────────────────────────────────────────────────────────────────────────────────┴───────┘
From this we can see that there are 3 types of parquet files created by Ducklake:
- One for inserting new rows
- One for updating existing rows. This one has the extra
_ducklake_internal_row_id
column
- One for tracking the rows which have been deleted. (with
*-delete.parquet
extension).
Pretty interesting at least for me!
Conclusion
Ducklake seems to be pretty nice tool to automatically store things in parquet files where I as developer don’t need to think too much about the storage and it just happens magically in the background.
Let me know if you find even better ways to have a similiar workflow to only track the changes in the datasets 🦆!