Data engineering to find domains pointing to certain CNAMEs

Data engineering to find domains pointing to certain CNAMEs

Tags
Datasets
DuckDB
DNS
Published
March 6, 2025
Author
Onni Hakala
When you’re using custom domains in web platforms you need to typically use CNAME records. For example for Vercel you need to create a CNAME cname.vercel-dns.com. in your DNS to use your custom domain. I wondered recently if there’s a way to easily enumerate all sites pointing to cname.vercel-dns.com. and it turns out there is.

Merklemap DNS records database

Merklemap publishes a large dataset with DNS requests from multiple different domains. It’s available for non-commercial usage for everybody who signs into their platform.
Let’s see how we could query this with my favorite data mangling tool duckdb.

Downloading the dataset

This will require 500Gb of free disk space.
Ensure you have enough space for the files:
if [ $(df --output=avail . | tail -1) -gt $((500 * 1024 * 1024)) ]; then echo "More than 500GB available" else echo "Less than 500GB available" fi
Then let’s start downloading and decompressing the data simultaneously to save sometime:
curl -L --continue-at - \ "https://dns-records-database.merklemap.com/merklemap_dns_records_database_25_10_2024.jsonl.xz" \ | xzcat > dataset.jsonl
I did this on a machine which has 1Gbps connection and 12 threads available and it took ~25 minutes.

Transforming the dataset for CNAMEs and Hostnames

Let’s have a look how the data looks like:
head dataset.jsonl -n1 | jq { "hostname": "www.willylarsen.no", "results": [ { "success": { "query": "www.willylarsen.no IN A", "query_timestamp": "2024-10-22T21:48:44.582445200Z", "records": { "A": [ "194.63.248.52" ] } } }, { "success": { "query": "www.willylarsen.no IN AAAA", "query_timestamp": "2024-10-22T21:48:44.582447640Z", "records": { "AAAA": [ "2a01:5b40:0:248::52" ] } } } ] }
There seems to be either success or error values in the results. It’s pretty annoying to work with these nested JSON structures and we want to have proper columns instead. Let’s see what duckdb and UNNEST can do. UNNEST is easiest to understand by feeding only single json line into DuckDB:
head -n1 dataset.jsonl |\ duckdb -c "SELECT UNNEST(results), hostname FROM read_json('/dev/stdin')"
What you should see in the output is that the 1 line of JSONL created 2 lines of SQL rows:
unnest(results)
hostname
{'success': {'query': www.willylarsen.no IN A, 'query_timestamp': 2024-10-22T21:48:44.582445200Z, 'records': {'A': [194.63.248.52], 'AAAA': NULL}}}
{'success': {'query': www.willylarsen.no IN AAAA, 'query_timestamp': 2024-10-22T21:48:44.582447640Z, 'records': {'A': NULL, 'AAAA': [2a01:5b40:0:248::52]}}}
So UNNEST helps to flatten arrays into multiple rows. This is very helpful because it simplifies our problems a lot but because of this we need to create more subqueries. As you can see even the results→success→records→’A’ contains results as array. But let’s continue to learn about DuckDB syntax for json:
duckdb -c " SELECT result.success.records->'CNAME', hostname FROM ( SELECT UNNEST(results) as result, hostname FROM read_json('dataset.jsonl') ) WHERE result.success.records['CNAME'] IS NOT NULL LIMIT 1 " ┌─────────────────────────────────────┬───────────────────────┐ │ (result.success.records -> 'CNAME') │ hostname │ │ json │ varchar │ ├─────────────────────────────────────┼───────────────────────┤ │ ["77980.bodis.com."] │ all.visionexpress.pro │ └─────────────────────────────────────┴───────────────────────┘
So we have 2 almost similiar ways to achive same thing result.success.records->'CNAME' and result.success.records['CNAME'] .
💡
The arrow operators ->, ->> which are familiar from PostgreSQL are supported in DuckDB but these should be avoided because they convert the “native” DuckDB types into json and then we need to use the json_* functions which are complicated and not as performant. So always use the DuckDB map retrieval syntax when you can.
Next thing we want to do is again unflatten the array in the key CNAME.
duckdb -c " SELECT UNNEST(result.success.records['CNAME']) as cname, hostname FROM ( SELECT UNNEST(results) as result, hostname FROM read_json('dataset.jsonl') ) WHERE result.success.records['CNAME'] IS NOT NULL LIMIT 1 " ┌──────────────────┬───────────────────────┐ │ cname │ hostname │ │ varchar │ varchar │ ├──────────────────┼───────────────────────┤ │ 77980.bodis.com. │ all.visionexpress.pro │ └──────────────────┴───────────────────────┘
And BOOM here we finally have a list of cnames matching to domains. But this list still contains duplicates so let’s remove them by adding GROUP BY ALL. It can’t be directly added to the query so we need 1 more subquery or common table expression. I will teach the CTE next:
duckdb -c " WITH cname_and_hostnames AS ( SELECT UNNEST(result.success.records['CNAME']) as cname, hostname FROM ( SELECT UNNEST(results) as result, hostname FROM read_json('dataset.jsonl') ) WHERE result.success.records['CNAME'] IS NOT NULL LIMIT 15 ) SELECT * FROM cname_and_hostnames GROUP BY ALL " ┌──────────────────────────────┬───────────────────────────────────┐ │ cname │ hostname │ │ varchar │ varchar │ ├──────────────────────────────┼───────────────────────────────────┤ │ ext-cust.squarespace.com. │ www.patrickcat.com │ │ pvriders.com. │ webmail.pvriders.com │ │ www.cti-lösungen.de. │ www6.xn--cti-lsungen-vfb.de │ │ proxy-ssl-geo.webflow.com. │ www.pixelgarden.ai │ │ cdn1.wixdns.net. │ www.randonneurssullylois.fr │ │ td-ccm-neg-87-45.wixdns.net. │ www.axecapital.bg │ │ cname.vercel-dns.com. │ www.usen.pro │ │ cdn1.wixdns.net. │ www.pleasantfinancialservices.com │ │ td-ccm-neg-87-45.wixdns.net. │ www.randonneurssullylois.fr │ │ proxy-ssl.webflow.com. │ www.pixelgarden.ai │ │ 77980.bodis.com. │ all.visionexpress.pro │ │ cdn1.wixdns.net. │ www.axecapital.bg │ │ td-ccm-neg-87-45.wixdns.net. │ www.pleasantfinancialservices.com │ ├──────────────────────────────┴───────────────────────────────────┤ │ 13 rows 2 columns │ └──────────────────────────────────────────────────────────────────┘
See how we limited only 15 results but got 13 instead. This is because we are deduplicating the results. But now we are ready to start querying the data:
duckdb -c " WITH cname_and_hostnames AS ( SELECT UNNEST(result.success.records['CNAME']) as cname, hostname FROM ( SELECT UNNEST(results) as result, hostname FROM read_json('dataset.jsonl') LIMIT 500000 ) WHERE result.success.records['CNAME'] IS NOT NULL ) SELECT cname, hostname FROM cname_and_hostnames GROUP BY ALL "

Trying to read from the large dataset

But wait a minute! When we run this It throw an error 😨:
Invalid Input Error: JSON transform error in file "dataset.jsonl", in line 39002: Object {"TLSA":["3 1 1 9ba436fb47163f51bcba94bbb9f9e5c38d... has unknown key "TLSA" Try increasing 'sample_size', reducing 'maximum_depth', specifying 'columns', 'format' or 'records' manually, setting 'ignore_errors' to true, or setting 'union_by_name' to true when reading multiple files with a different structure.
This is because DuckDB tries to guess the typing of the json we are reading. The default is sample_size=20,480 and this wasn’t enough and in row 39002 it learned about new DNS record type TSLA. Let’s modify our input: read_json('dataset.jsonl', sample_size=100_000) and try again but let’s take a detour first.

Converting the JSONL to Parquet

Because the dataset is so large we don’t really want to go through 418Gb everytime we’ll ask a certain question so it’s better to convert the whole thing into parquet which will be much faster and compact for the next rounds. In order to store the results into parquet we need to wrap our query into a COPY statement. There was also a nice comment in DuckDB github to speedup JSONL to parquet by allowing 1 file per thread and discarding the order of the rows.
This time let’s also time it to see how long it will take to process our 418Gb of data on my old Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz with 12 threads:
time duckdb -c " SET preserve_insertion_order=false; COPY ( WITH cname_and_hostnames AS ( SELECT UNNEST(result.success.records['CNAME']) as cname, hostname FROM ( SELECT UNNEST(results) as result,  hostname FROM read_json('dataset.jsonl', sample_size=100_000) ) WHERE result.success.records['CNAME'] IS NOT NULL ) SELECT * FROM cname_and_hostnames GROUP BY ALL ) TO 'hostnames-and-cnames' (FORMAT PARQUET, CODEC 'zstd', COMPRESSION_LEVEL 1, PER_THREAD_OUTPUT TRUE); " real 45m21.000s user 524m41.212s sys 13m31.101s
After this we have multiple files ready for our ad-hoc queries:
ls -Ss1pq --block-size=M hostnames-and-cnames total 4200M 525M data_1.parquet 525M data_4.parquet 525M data_0.parquet 525M data_2.parquet 263M data_9.parquet 263M data_10.parquet 263M data_11.parquet 263M data_5.parquet 263M data_6.parquet 263M data_3.parquet 263M data_8.parquet 263M data_7.parquet
So let’s check how our data now looks:
duckdb -c "SELECT * FROM 'hostnames-and-cnames/*.parquet' LIMIT 1" ┌─────────────────────┬──────────────────────┐ │ cname │ hostname │ │ varchar[] │ varchar │ ├─────────────────────┼──────────────────────┤ │ [simple4marry.com.] │ www.simple4marry.com │ └─────────────────────┴──────────────────────┘
Oh man we still need to UNNEST the data again 😞. Well let’s also combine the files together to have simpler one-file setup:
time duckdb -c " COPY ( SELECT * FROM ( SELECT UNNEST(cname) as cname, hostname FROM 'hostnames-and-cnames/*.parquet' ) GROUP BY ALL ) TO 'hostnames-and-cnames.parquet' (FORMAT PARQUET, CODEC 'zstd'); " real 2m30.037s user 21m33.857s sys 0m31.440s
This results in 4350M file we can now query to answer our original question.

Querying the parquet to check CNAMEs

Now we finally have our data to be able to query whatever we want. For example how many domains are pointing to to Vercel with CNAME:
time duckdb -c " SELECT COUNT(*) FROM 'hostnames-and-cnames.parquet' WHERE cname = 'cname.vercel-dns.com.' " ┌──────────────┐ │ count_star() │ │ int64 │ ├──────────────┤ │ 778547 │ └──────────────┘ real 0m2.216s user 0m25.240s sys 0m0.758s
This still took quite long time to answer and I started to wonder could I somehow index the parquet file by cname . Binary search should be much faster and when I asked about this from my friend Aarni he pointed I should order the data before outputting it into the parquet files. I hadn’t ordered them at all previously and this lead me into a next rabbit hole.

Ordering the parquet to enhance queries and compression

So I added the ORDER BY ALL into the parquet query above:
time duckdb -c " COPY ( SELECT * FROM ( SELECT UNNEST(cname) as cname, hostname FROM 'hostnames-and-cnames/*.parquet' ) GROUP BY ALL ORDER BY ALL ) TO 'hostnames-and-cnames-sorted.parquet' (FORMAT PARQUET, CODEC 'zstd'); " real 4m1.267s user 34m12.013s sys 2m52.093s
And when I queried it it was 55x times faster 🚀 📈:
time duckdb -c " SELECT COUNT(*) FROM 'hostnames-and-cnames-sorted.parquet' WHERE cname = 'cname.vercel-dns.com.' " ┌──────────────┐ │ count_star() │ │ int64 │ ├──────────────┤ │ 778547 │ └──────────────┘ real 0m0.040s user 0m0.053s sys 0m0.011s
What I also realized was that the file size was much smaller after ordering the data. Before ordering it was 4350M and after ordering 2341M.
💡
By ordering the data before exporting it into parquet you can significantly improve the compression which can result up to 46% reduction in filesize and also speed up the query times up to 5500%.
This is because parquet splits the data into separate row groups with default size of 100_000. Each group is compressed independently so of course if many groups only contain 1 or 2 different cnames compressing them is much more efficient. This is pretty great 🤩. After this I wondered what would be ideal COMPRESSION_LEVEL for parquet and I tested it with different values:
ZSTD Compression level
Parquet File size
Time spent with 12 threads
4
2339M
4m 4.680s
7
2099M
4m 32.578s
11
2055M
5m 20.552s
16
1897M
8m 32.919s
20
1797M
14m 52.059s
💡
Conclusion: If you have excess compute time to spent it’s a good idea to use higher compression levels. Especially the jump from 4→7 seems like time well spent. If you’re sharing your dataset for others in the internet this will save a lot of bandwidth and storage for everyone involved.
You can also test altering the ROW_GROUP_SIZE parameter for Parquet to allow even smaller files. This depends on how your data is clustered. For this dataset the changes were only few megabytes so I didn’t bother to explore it more.

Finding most popular CNAME targets

time duckdb -c " SELECT cname, COUNT(*) as count FROM 'hostnames-and-cnames-sorted.parquet' GROUP BY cname ORDER BY count DESC LIMIT 30 " ┌──────────────────────────────────────────────────────────────────┬──────────┐ │ cname │ count │ │ varchar │ int64 │ ├──────────────────────────────────────────────────────────────────┼──────────┤ │ td-ccm-neg-87-45.wixdns.net. │ 10803597 │ │ cdn1.wixdns.net. │ 10778999 │ │ ext-sq.squarespace.com. │ 4728993 │ │ shops.myshopify.com. │ 3809256 │ │ 77980.bodis.com. │ 3117116 │ │ t1k-shield-92a6a36b6702b4ad.elb.us-east-1.amazonaws.com. │ 2729396 │ │ t1k.poynt.net. │ 2729396 │ │ paylinks.commerce.godaddy.com. │ 2729395 │ │ mps-mda-ic-openresty-prd-weu.trafficmanager.net. │ 1611935 │ │ 7450.bodis.com. │ 1451574 │ │ gcp-us-west1-1.origin.onrender.com.cdn.cloudflare.net. │ 1428675 │ │ gcp-us-west1-1.origin.onrender.com. │ 1428672 │ │ parkingpage.namecheap.com. │ 1407986 │ │ ghs.googlehosted.com. │ 1222992 │ │ ext-cust.squarespace.com. │ 1001407 │ │ cname.vercel-dns.com. │ 778547 │ │ 77026.bodis.com. │ 744565 │ │ pointing.wixdns.net. │ 633087 │ │ mps-mda-ic-openrestyadmindf-prd-weu.trafficmanager.net. │ 608328 │ │ mps-mda-ic-openrestydf-prd-weu.trafficmanager.net. │ 608325 │ │ d1-hitch-eu-nlb-e064e2845fd0c838.elb.eu-central-1.amazonaws.com. │ 599222 │ │ proxy-ssl-geo.webflow.com. │ 560207 │ │ proxy-ssl.webflow.com. │ 560199 │ │ expired.gname.net. │ 546273 │ │ prdp-1-nginxnode-main.trafficmanager.net. │ 521279 │ │ ukw-openresty.mcas.ms. │ 521279 │ │ shed.dual-low.s-part-0039.t-0009.t-msedge.net. │ 497971 │ │ global.multiscreensite.com. │ 463454 │ │ k8s-dp-prod-bl-1-ip.eastus.cloudapp.azure.com. │ 461731 │ │ k8connectatmeu.trafficmanager.net. │ 461731 │ ├──────────────────────────────────────────────────────────────────┴──────────┤ │ 30 rows 2 columns │ └─────────────────────────────────────────────────────────────────────────────┘ real 0m5.237s user 0m49.296s sys 0m3.995s
Well this is pretty fantastic. We have 227 957 515 domains matching into CNAME records and we can query this in 5 seconds from ~2Gb file.
This data is also pretty interesting. I knew that Wix & Squarespace are very popular but I had no clue that there are 3.8 million different domains pointing to Shopify. Some of these domains are probably not working anymore because the dataset contains also historical data and I didn’t filter the latest succesful results. Also Render seems to be more popular than Vercel.

Using parquet for open datasets is awesome

I hope this was a good lesson on the powers of duckdb and parquet. If maintainers of Merklemap are reading this please please consider using something else than JSONL because with a bit more clever design people could skip downloading most of the 41Gb and just query remote parquet files directly which uses very little data.