When you’re using custom domains in web platforms you need to typically use
CNAME
records. For example for Vercel you need to create a CNAME cname.vercel-dns.com.
in your DNS to use your custom domain. I wondered recently if there’s a way to easily enumerate all sites pointing to cname.vercel-dns.com.
and it turns out there is.Merklemap DNS records database
Merklemap publishes a large dataset with DNS requests from multiple different domains. It’s available for non-commercial usage for everybody who signs into their platform.
Let’s see how we could query this with my favorite data mangling tool
duckdb
.Downloading the dataset
This will require 500Gb of free disk space.
Ensure you have enough space for the files:
if [ $(df --output=avail . | tail -1) -gt $((500 * 1024 * 1024)) ]; then echo "More than 500GB available" else echo "Less than 500GB available" fi
Then let’s start downloading and decompressing the data simultaneously to save sometime:
curl -L --continue-at - \ "https://dns-records-database.merklemap.com/merklemap_dns_records_database_25_10_2024.jsonl.xz" \ | xzcat > dataset.jsonl
I did this on a machine which has 1Gbps connection and 12 threads available and it took ~25 minutes.
Transforming the dataset for CNAMEs and Hostnames
Let’s have a look how the data looks like:
head dataset.jsonl -n1 | jq { "hostname": "www.willylarsen.no", "results": [ { "success": { "query": "www.willylarsen.no IN A", "query_timestamp": "2024-10-22T21:48:44.582445200Z", "records": { "A": [ "194.63.248.52" ] } } }, { "success": { "query": "www.willylarsen.no IN AAAA", "query_timestamp": "2024-10-22T21:48:44.582447640Z", "records": { "AAAA": [ "2a01:5b40:0:248::52" ] } } } ] }
There seems to be either
success
or error
values in the results
. It’s pretty annoying to work with these nested JSON structures and we want to have proper columns instead. Let’s see what duckdb
and UNNEST
can do. UNNEST
is easiest to understand by feeding only single json line into DuckDB:head -n1 dataset.jsonl |\ duckdb -c "SELECT UNNEST(results), hostname FROM read_json('/dev/stdin')"
What you should see in the output is that the 1 line of JSONL created 2 lines of SQL rows:
unnest(results) | hostname |
{'success': {'query': www.willylarsen.no IN A, 'query_timestamp': 2024-10-22T21:48:44.582445200Z, 'records': {'A': [194.63.248.52], 'AAAA': NULL}}} | |
{'success': {'query': www.willylarsen.no IN AAAA, 'query_timestamp': 2024-10-22T21:48:44.582447640Z, 'records': {'A': NULL, 'AAAA': [2a01:5b40:0:248::52]}}} |
So
UNNEST
helps to flatten arrays into multiple rows. This is very helpful because it simplifies our problems a lot but because of this we need to create more subqueries. As you can see even the results→success→records→’A’ contains results as array. But let’s continue to learn about DuckDB syntax for json:duckdb -c " SELECT result.success.records->'CNAME', hostname FROM ( SELECT UNNEST(results) as result, hostname FROM read_json('dataset.jsonl') ) WHERE result.success.records['CNAME'] IS NOT NULL LIMIT 1 " ┌─────────────────────────────────────┬───────────────────────┐ │ (result.success.records -> 'CNAME') │ hostname │ │ json │ varchar │ ├─────────────────────────────────────┼───────────────────────┤ │ ["77980.bodis.com."] │ all.visionexpress.pro │ └─────────────────────────────────────┴───────────────────────┘
So we have 2 almost similiar ways to achive same thing
result.success.records->'CNAME'
and result.success.records['CNAME']
.The arrow operators
->
, ->>
which are familiar from PostgreSQL are supported in DuckDB but these should be avoided because they convert the “native” DuckDB types into json
and then we need to use the json_*
functions which are complicated and not as performant. So always use the DuckDB map retrieval syntax when you can.Next thing we want to do is again unflatten the array in the key
CNAME
.duckdb -c " SELECT UNNEST(result.success.records['CNAME']) as cname, hostname FROM ( SELECT UNNEST(results) as result, hostname FROM read_json('dataset.jsonl') ) WHERE result.success.records['CNAME'] IS NOT NULL LIMIT 1 " ┌──────────────────┬───────────────────────┐ │ cname │ hostname │ │ varchar │ varchar │ ├──────────────────┼───────────────────────┤ │ 77980.bodis.com. │ all.visionexpress.pro │ └──────────────────┴───────────────────────┘
And BOOM here we finally have a list of cnames matching to domains. But this list still contains duplicates so let’s remove them by adding
GROUP BY ALL
. It can’t be directly added to the query so we need 1 more subquery or common table expression. I will teach the CTE next:duckdb -c " WITH cname_and_hostnames AS ( SELECT UNNEST(result.success.records['CNAME']) as cname, hostname FROM ( SELECT UNNEST(results) as result, hostname FROM read_json('dataset.jsonl') ) WHERE result.success.records['CNAME'] IS NOT NULL LIMIT 15 ) SELECT * FROM cname_and_hostnames GROUP BY ALL " ┌──────────────────────────────┬───────────────────────────────────┐ │ cname │ hostname │ │ varchar │ varchar │ ├──────────────────────────────┼───────────────────────────────────┤ │ ext-cust.squarespace.com. │ www.patrickcat.com │ │ pvriders.com. │ webmail.pvriders.com │ │ www.cti-lösungen.de. │ www6.xn--cti-lsungen-vfb.de │ │ proxy-ssl-geo.webflow.com. │ www.pixelgarden.ai │ │ cdn1.wixdns.net. │ www.randonneurssullylois.fr │ │ td-ccm-neg-87-45.wixdns.net. │ www.axecapital.bg │ │ cname.vercel-dns.com. │ www.usen.pro │ │ cdn1.wixdns.net. │ www.pleasantfinancialservices.com │ │ td-ccm-neg-87-45.wixdns.net. │ www.randonneurssullylois.fr │ │ proxy-ssl.webflow.com. │ www.pixelgarden.ai │ │ 77980.bodis.com. │ all.visionexpress.pro │ │ cdn1.wixdns.net. │ www.axecapital.bg │ │ td-ccm-neg-87-45.wixdns.net. │ www.pleasantfinancialservices.com │ ├──────────────────────────────┴───────────────────────────────────┤ │ 13 rows 2 columns │ └──────────────────────────────────────────────────────────────────┘
See how we limited only 15 results but got 13 instead. This is because we are deduplicating the results. But now we are ready to start querying the data:
duckdb -c " WITH cname_and_hostnames AS ( SELECT UNNEST(result.success.records['CNAME']) as cname, hostname FROM ( SELECT UNNEST(results) as result, hostname FROM read_json('dataset.jsonl') LIMIT 500000 ) WHERE result.success.records['CNAME'] IS NOT NULL ) SELECT cname, hostname FROM cname_and_hostnames GROUP BY ALL "
Trying to read from the large dataset
But wait a minute! When we run this It throw an error 😨:
Invalid Input Error: JSON transform error in file "dataset.jsonl", in line 39002: Object {"TLSA":["3 1 1 9ba436fb47163f51bcba94bbb9f9e5c38d... has unknown key "TLSA" Try increasing 'sample_size', reducing 'maximum_depth', specifying 'columns', 'format' or 'records' manually, setting 'ignore_errors' to true, or setting 'union_by_name' to true when reading multiple files with a different structure.
This is because DuckDB tries to guess the typing of the json we are reading. The default is
sample_size=20,480
and this wasn’t enough and in row 39002
it learned about new DNS record type TSLA
. Let’s modify our input: read_json('dataset.jsonl', sample_size=100_000)
and try again but let’s take a detour first.Converting the JSONL to Parquet
Because the dataset is so large we don’t really want to go through 418Gb everytime we’ll ask a certain question so it’s better to convert the whole thing into parquet which will be much faster and compact for the next rounds. In order to store the results into parquet we need to wrap our query into a
COPY
statement. There was also a nice comment in DuckDB github to speedup JSONL to parquet by allowing 1 file per thread and discarding the order of the rows.This time let’s also
time
it to see how long it will take to process our 418Gb of data on my old Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz
with 12 threads:time duckdb -c " SET preserve_insertion_order=false; COPY ( WITH cname_and_hostnames AS ( SELECT UNNEST(result.success.records['CNAME']) as cname, hostname FROM ( SELECT UNNEST(results) as result, hostname FROM read_json('dataset.jsonl', sample_size=100_000) ) WHERE result.success.records['CNAME'] IS NOT NULL ) SELECT * FROM cname_and_hostnames GROUP BY ALL ) TO 'hostnames-and-cnames' (FORMAT PARQUET, CODEC 'zstd', COMPRESSION_LEVEL 1, PER_THREAD_OUTPUT TRUE); " real 45m21.000s user 524m41.212s sys 13m31.101s
After this we have multiple files ready for our ad-hoc queries:
ls -Ss1pq --block-size=M hostnames-and-cnames total 4200M 525M data_1.parquet 525M data_4.parquet 525M data_0.parquet 525M data_2.parquet 263M data_9.parquet 263M data_10.parquet 263M data_11.parquet 263M data_5.parquet 263M data_6.parquet 263M data_3.parquet 263M data_8.parquet 263M data_7.parquet
So let’s check how our data now looks:
duckdb -c "SELECT * FROM 'hostnames-and-cnames/*.parquet' LIMIT 1" ┌─────────────────────┬──────────────────────┐ │ cname │ hostname │ │ varchar[] │ varchar │ ├─────────────────────┼──────────────────────┤ │ [simple4marry.com.] │ www.simple4marry.com │ └─────────────────────┴──────────────────────┘
Oh man we still need to UNNEST the data again 😞. Well let’s also combine the files together to have simpler one-file setup:
time duckdb -c " COPY ( SELECT * FROM ( SELECT UNNEST(cname) as cname, hostname FROM 'hostnames-and-cnames/*.parquet' ) GROUP BY ALL ) TO 'hostnames-and-cnames.parquet' (FORMAT PARQUET, CODEC 'zstd'); " real 2m30.037s user 21m33.857s sys 0m31.440s
This results in
4350M
file we can now query to answer our original question.Querying the parquet to check CNAMEs
Now we finally have our data to be able to query whatever we want. For example how many domains are pointing to to Vercel with CNAME:
time duckdb -c " SELECT COUNT(*) FROM 'hostnames-and-cnames.parquet' WHERE cname = 'cname.vercel-dns.com.' " ┌──────────────┐ │ count_star() │ │ int64 │ ├──────────────┤ │ 778547 │ └──────────────┘ real 0m2.216s user 0m25.240s sys 0m0.758s
This still took quite long time to answer and I started to wonder could I somehow index the
parquet
file by cname
. Binary search should be much faster and when I asked about this from my friend Aarni he pointed I should order the data before outputting it into the parquet files. I hadn’t ordered them at all previously and this lead me into a next rabbit hole.Ordering the parquet to enhance queries and compression
So I added the
ORDER BY ALL
into the parquet query above:time duckdb -c " COPY ( SELECT * FROM ( SELECT UNNEST(cname) as cname, hostname FROM 'hostnames-and-cnames/*.parquet' ) GROUP BY ALL ORDER BY ALL ) TO 'hostnames-and-cnames-sorted.parquet' (FORMAT PARQUET, CODEC 'zstd'); " real 4m1.267s user 34m12.013s sys 2m52.093s
And when I queried it it was 55x times faster 🚀 📈:
time duckdb -c " SELECT COUNT(*) FROM 'hostnames-and-cnames-sorted.parquet' WHERE cname = 'cname.vercel-dns.com.' " ┌──────────────┐ │ count_star() │ │ int64 │ ├──────────────┤ │ 778547 │ └──────────────┘ real 0m0.040s user 0m0.053s sys 0m0.011s
What I also realized was that the file size was much smaller after ordering the data. Before ordering it was
4350M
and after ordering 2341M
.By ordering the data before exporting it into parquet you can significantly improve the compression which can result up to 46% reduction in filesize and also speed up the query times up to 5500%.
This is because parquet splits the data into separate row groups with default size of
100_000
. Each group is compressed independently so of course if many groups only contain 1 or 2 different cnames compressing them is much more efficient. This is pretty great 🤩.
After this I wondered what would be ideal COMPRESSION_LEVEL
for parquet and I tested it with different values:ZSTD Compression level | Parquet File size | Time spent with 12 threads |
4 | 2339M | 4m 4.680s |
7 | 2099M | 4m 32.578s |
11 | 2055M | 5m 20.552s |
16 | 1897M | 8m 32.919s |
20 | 1797M | 14m 52.059s |
Conclusion: If you have excess compute time to spent it’s a good idea to use higher compression levels. Especially the jump from 4→7 seems like time well spent. If you’re sharing your dataset for others in the internet this will save a lot of bandwidth and storage for everyone involved.
You can also test altering the
ROW_GROUP_SIZE
parameter for Parquet to allow even smaller files. This depends on how your data is clustered. For this dataset the changes were only few megabytes so I didn’t bother to explore it more.Finding most popular CNAME targets
time duckdb -c " SELECT cname, COUNT(*) as count FROM 'hostnames-and-cnames-sorted.parquet' GROUP BY cname ORDER BY count DESC LIMIT 30 " ┌──────────────────────────────────────────────────────────────────┬──────────┐ │ cname │ count │ │ varchar │ int64 │ ├──────────────────────────────────────────────────────────────────┼──────────┤ │ td-ccm-neg-87-45.wixdns.net. │ 10803597 │ │ cdn1.wixdns.net. │ 10778999 │ │ ext-sq.squarespace.com. │ 4728993 │ │ shops.myshopify.com. │ 3809256 │ │ 77980.bodis.com. │ 3117116 │ │ t1k-shield-92a6a36b6702b4ad.elb.us-east-1.amazonaws.com. │ 2729396 │ │ t1k.poynt.net. │ 2729396 │ │ paylinks.commerce.godaddy.com. │ 2729395 │ │ mps-mda-ic-openresty-prd-weu.trafficmanager.net. │ 1611935 │ │ 7450.bodis.com. │ 1451574 │ │ gcp-us-west1-1.origin.onrender.com.cdn.cloudflare.net. │ 1428675 │ │ gcp-us-west1-1.origin.onrender.com. │ 1428672 │ │ parkingpage.namecheap.com. │ 1407986 │ │ ghs.googlehosted.com. │ 1222992 │ │ ext-cust.squarespace.com. │ 1001407 │ │ cname.vercel-dns.com. │ 778547 │ │ 77026.bodis.com. │ 744565 │ │ pointing.wixdns.net. │ 633087 │ │ mps-mda-ic-openrestyadmindf-prd-weu.trafficmanager.net. │ 608328 │ │ mps-mda-ic-openrestydf-prd-weu.trafficmanager.net. │ 608325 │ │ d1-hitch-eu-nlb-e064e2845fd0c838.elb.eu-central-1.amazonaws.com. │ 599222 │ │ proxy-ssl-geo.webflow.com. │ 560207 │ │ proxy-ssl.webflow.com. │ 560199 │ │ expired.gname.net. │ 546273 │ │ prdp-1-nginxnode-main.trafficmanager.net. │ 521279 │ │ ukw-openresty.mcas.ms. │ 521279 │ │ shed.dual-low.s-part-0039.t-0009.t-msedge.net. │ 497971 │ │ global.multiscreensite.com. │ 463454 │ │ k8s-dp-prod-bl-1-ip.eastus.cloudapp.azure.com. │ 461731 │ │ k8connectatmeu.trafficmanager.net. │ 461731 │ ├──────────────────────────────────────────────────────────────────┴──────────┤ │ 30 rows 2 columns │ └─────────────────────────────────────────────────────────────────────────────┘ real 0m5.237s user 0m49.296s sys 0m3.995s
Well this is pretty fantastic. We have
227 957 515
domains matching into CNAME records and we can query this in 5 seconds from ~2Gb file.This data is also pretty interesting. I knew that Wix & Squarespace are very popular but I had no clue that there are 3.8 million different domains pointing to Shopify. Some of these domains are probably not working anymore because the dataset contains also historical data and I didn’t filter the latest succesful results. Also Render seems to be more popular than Vercel.
Using parquet for open datasets is awesome
I hope this was a good lesson on the powers of
duckdb
and parquet
. If maintainers of Merklemap are reading this please please consider using something else than JSONL
because with a bit more clever design people could skip downloading most of the 41Gb
and just query remote parquet files directly which uses very little data.