Data engineering to find domains pointing to certain CNAMEs

When you’re using custom domains in web platforms you need to typically use CNAME records. For example for Vercel you need to create a CNAME cname.vercel-dns.com. in your DNS to use your custom domain. I wondered recently if there’s a way to easily enumerate all sites pointing to cname.vercel-dns.com. and it turns out there is.

Merklemap DNS records database

Merklemap publishes a large dataset with DNS requests from multiple different domains. It’s available for non-commercial usage for everybody who signs into their platform.

Let’s see how we could query this with my favorite data mangling tool duckdb.

Downloading the dataset

This will require 500Gb of free disk space.

Ensure you have enough space for the files:


if [ $(df --output=avail . | tail -1) -gt $((500 * 1024 * 1024)) ]; then
    echo "More than 500GB available"
else
    echo "Less than 500GB available"
fi

Then let’s start downloading and decompressing the data simultaneously to save sometime:


curl -L --continue-at - \
"https://dns-records-database.merklemap.com/merklemap_dns_records_database_25_10_2024.jsonl.xz" \
| xzcat > dataset.jsonl

I did this on a machine which has 1Gbps connection and 12 threads available and it took ~25 minutes.

Transforming the dataset for CNAMEs and Hostnames

Let’s have a look how the data looks like:


head dataset.jsonl -n1 | jq
{
  "hostname": "www.willylarsen.no",
  "results": [
    {
      "success": {
        "query": "www.willylarsen.no IN A",
        "query_timestamp": "2024-10-22T21:48:44.582445200Z",
        "records": {
          "A": [
            "194.63.248.52"
          ]
        }
      }
    },
    {
      "success": {
        "query": "www.willylarsen.no IN AAAA",
        "query_timestamp": "2024-10-22T21:48:44.582447640Z",
        "records": {
          "AAAA": [
            "2a01:5b40:0:248::52"
          ]
        }
      }
    }
  ]
}

There seems to be either success or error values in the results. It’s pretty annoying to work with these nested JSON structures and we want to have proper columns instead. Let’s see what duckdb and UNNEST can do. UNNEST is easiest to understand by feeding only single json line into DuckDB:


head -n1 dataset.jsonl |\
duckdb -c "SELECT UNNEST(results), hostname FROM read_json('/dev/stdin')"

What you should see in the output is that the 1 line of JSONL created 2 lines of SQL rows:

unnest(results)	hostname
{'success': {'query': www.willylarsen.no IN A, 'query_timestamp': 2024-10-22T21:48:44.582445200Z, 'records': {'A': [194.63.248.52], 'AAAA': NULL}}}	www.willylarsen.no
{'success': {'query': www.willylarsen.no IN AAAA, 'query_timestamp': 2024-10-22T21:48:44.582447640Z, 'records': {'A': NULL, 'AAAA': [2a01:5b40:0:248::52]}}}	www.willylarsen.no

So UNNEST helps to flatten arrays into multiple rows. This is very helpful because it simplifies our problems a lot but because of this we need to create more subqueries. As you can see even the results→success→records→’A’ contains results as array. But let’s continue to learn about DuckDB syntax for json:


duckdb -c "
	SELECT result.success.records->'CNAME', hostname
	FROM (
	    SELECT UNNEST(results) as result, hostname
	    FROM read_json('dataset.jsonl')
	)
	WHERE result.success.records['CNAME'] IS NOT NULL
	LIMIT 1
"
┌─────────────────────────────────────┬───────────────────────┐
│ (result.success.records -> 'CNAME') │       hostname        │
│                json                 │        varchar        │
├─────────────────────────────────────┼───────────────────────┤
│ ["77980.bodis.com."]                │ all.visionexpress.pro │
└─────────────────────────────────────┴───────────────────────┘

So we have 2 almost similiar ways to achive same thing result.success.records->'CNAME' and result.success.records['CNAME'] .

💡

The arrow operators ->, ->> which are familiar from PostgreSQL are supported in DuckDB but these should be avoided because they convert the “native” DuckDB types into json and then we need to use the json_* functions which are complicated and not as performant. So always use the DuckDB map retrieval syntax when you can.

Next thing we want to do is again unflatten the array in the key CNAME.


duckdb -c "
	SELECT UNNEST(result.success.records['CNAME']) as cname, hostname
	FROM (
	    SELECT UNNEST(results) as result, hostname
	    FROM read_json('dataset.jsonl')
	)
	WHERE result.success.records['CNAME'] IS NOT NULL
	LIMIT 1
"
┌──────────────────┬───────────────────────┐
│      cname       │       hostname        │
│     varchar      │        varchar        │
├──────────────────┼───────────────────────┤
│ 77980.bodis.com. │ all.visionexpress.pro │
└──────────────────┴───────────────────────┘

And BOOM here we finally have a list of cnames matching to domains. But this list still contains duplicates so let’s remove them by adding GROUP BY ALL. It can’t be directly added to the query so we need 1 more subquery or common table expression. I will teach the CTE next:


duckdb -c "
	WITH cname_and_hostnames AS (
		SELECT UNNEST(result.success.records['CNAME']) as cname, hostname
		FROM (
		    SELECT UNNEST(results) as result, hostname
		    FROM read_json('dataset.jsonl')
		)
		WHERE result.success.records['CNAME'] IS NOT NULL
		LIMIT 15
	)
	
	SELECT * FROM cname_and_hostnames
	GROUP BY ALL
"
┌──────────────────────────────┬───────────────────────────────────┐
│            cname             │             hostname              │
│           varchar            │              varchar              │
├──────────────────────────────┼───────────────────────────────────┤
│ ext-cust.squarespace.com.    │ www.patrickcat.com                │
│ pvriders.com.                │ webmail.pvriders.com              │
│ www.cti-lösungen.de.         │ www6.xn--cti-lsungen-vfb.de       │
│ proxy-ssl-geo.webflow.com.   │ www.pixelgarden.ai                │
│ cdn1.wixdns.net.             │ www.randonneurssullylois.fr       │
│ td-ccm-neg-87-45.wixdns.net. │ www.axecapital.bg                 │
│ cname.vercel-dns.com.        │ www.usen.pro                      │
│ cdn1.wixdns.net.             │ www.pleasantfinancialservices.com │
│ td-ccm-neg-87-45.wixdns.net. │ www.randonneurssullylois.fr       │
│ proxy-ssl.webflow.com.       │ www.pixelgarden.ai                │
│ 77980.bodis.com.             │ all.visionexpress.pro             │
│ cdn1.wixdns.net.             │ www.axecapital.bg                 │
│ td-ccm-neg-87-45.wixdns.net. │ www.pleasantfinancialservices.com │
├──────────────────────────────┴───────────────────────────────────┤
│ 13 rows                                                2 columns │
└──────────────────────────────────────────────────────────────────┘

See how we limited only 15 results but got 13 instead. This is because we are deduplicating the results. But now we are ready to start querying the data:


duckdb -c "
	WITH cname_and_hostnames AS (
		SELECT UNNEST(result.success.records['CNAME']) as cname, hostname
		FROM (
		    SELECT UNNEST(results) as result, hostname
		    FROM read_json('dataset.jsonl')
		    LIMIT 500000
		)
		WHERE result.success.records['CNAME'] IS NOT NULL
	)
  SELECT cname, hostname
  FROM cname_and_hostnames
  GROUP BY ALL
"

Trying to read from the large dataset

But wait a minute! When we run this It throw an error 😨:


Invalid Input Error: JSON transform error in file "dataset.jsonl", in line 39002: Object {"TLSA":["3 1 1 9ba436fb47163f51bcba94bbb9f9e5c38d... has unknown key "TLSA"
Try increasing 'sample_size', reducing 'maximum_depth', specifying 'columns', 'format' or 'records' manually, setting 'ignore_errors' to true, or setting 'union_by_name' to true when reading multiple files with a different structure.

This is because DuckDB tries to guess the typing of the json we are reading. The default is sample_size=20,480 and this wasn’t enough and in row 39002 it learned about new DNS record type TSLA. Let’s modify our input: read_json('dataset.jsonl', sample_size=100_000) and try again but let’s take a detour first.

Converting the JSONL to Parquet

Because the dataset is so large we don’t really want to go through 418Gb everytime we’ll ask a certain question so it’s better to convert the whole thing into parquet which will be much faster and compact for the next rounds. In order to store the results into parquet we need to wrap our query into a COPY statement. There was also a nice comment in DuckDB github to speedup JSONL to parquet by allowing 1 file per thread and discarding the order of the rows.

This time let’s also time it to see how long it will take to process our 418Gb of data on my old Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz with 12 threads:


time duckdb -c "
    SET preserve_insertion_order=false;
    COPY (
        WITH cname_and_hostnames AS (
            SELECT
                UNNEST(result.success.records['CNAME']) as cname,
                hostname
            FROM (
                SELECT
                    UNNEST(results) as result,
                     hostname
                FROM read_json('dataset.jsonl', sample_size=100_000)
            )
            WHERE result.success.records['CNAME'] IS NOT NULL
        )
        SELECT *
        FROM cname_and_hostnames
        GROUP BY ALL
    ) TO 'hostnames-and-cnames'
    (FORMAT PARQUET, CODEC 'zstd', COMPRESSION_LEVEL 1, PER_THREAD_OUTPUT TRUE);
"
real    45m21.000s
user    524m41.212s
sys     13m31.101s

After this we have multiple files ready for our ad-hoc queries:


ls -Ss1pq --block-size=M hostnames-and-cnames
total 4200M
525M data_1.parquet
525M data_4.parquet
525M data_0.parquet
525M data_2.parquet
263M data_9.parquet
263M data_10.parquet
263M data_11.parquet
263M data_5.parquet
263M data_6.parquet
263M data_3.parquet
263M data_8.parquet
263M data_7.parquet

So let’s check how our data now looks:


duckdb -c "SELECT * FROM 'hostnames-and-cnames/*.parquet' LIMIT 1"
┌─────────────────────┬──────────────────────┐
│        cname        │       hostname       │
│      varchar[]      │       varchar        │
├─────────────────────┼──────────────────────┤
│ [simple4marry.com.] │ www.simple4marry.com │
└─────────────────────┴──────────────────────┘

Oh man we still need to UNNEST the data again 😞. Well let’s also combine the files together to have simpler one-file setup:


time duckdb -c "
	COPY (
	    SELECT *
	    FROM (
	        SELECT UNNEST(cname) as cname, hostname
	        FROM 'hostnames-and-cnames/*.parquet'
	    )
	    GROUP BY ALL
	) TO 'hostnames-and-cnames.parquet'
	(FORMAT PARQUET, CODEC 'zstd');
"
real    2m30.037s
user    21m33.857s
sys     0m31.440s

This results in 4350M file we can now query to answer our original question.

Querying the parquet to check CNAMEs

Now we finally have our data to be able to query whatever we want. For example how many domains are pointing to to Vercel with CNAME:


time duckdb -c "
    SELECT COUNT(*)
    FROM 'hostnames-and-cnames.parquet'
    WHERE cname = 'cname.vercel-dns.com.'
"
┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│       778547 │
└──────────────┘

real    0m2.216s
user    0m25.240s
sys     0m0.758s

This still took quite long time to answer and I started to wonder could I somehow index the parquet file by cname . Binary search should be much faster and when I asked about this from my friend Aarni he pointed I should order the data before outputting it into the parquet files. I hadn’t ordered them at all previously and this lead me into a next rabbit hole.

Ordering the parquet to enhance queries and compression

So I added the ORDER BY ALL into the parquet query above:


time duckdb -c "
	COPY (
	    SELECT *
	    FROM (
	        SELECT UNNEST(cname) as cname, hostname
	        FROM 'hostnames-and-cnames/*.parquet'
	    )
	    GROUP BY ALL
	    ORDER BY ALL
	) TO 'hostnames-and-cnames-sorted.parquet'
	(FORMAT PARQUET, CODEC 'zstd');
"
real    4m1.267s
user    34m12.013s
sys     2m52.093s

And when I queried it it was 55x times faster 🚀 📈:


time duckdb -c "
    SELECT COUNT(*)
    FROM 'hostnames-and-cnames-sorted.parquet'
    WHERE cname = 'cname.vercel-dns.com.'
"
┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│       778547 │
└──────────────┘

real    0m0.040s
user    0m0.053s
sys     0m0.011s

What I also realized was that the file size was much smaller after ordering the data. Before ordering it was 4350M and after ordering 2341M.

💡

By ordering the data before exporting it into parquet you can significantly improve the compression which can result up to 46% reduction in filesize and also speed up the query times up to 5500%.

This is because parquet splits the data into separate row groups with default size of 100_000. Each group is compressed independently so of course if many groups only contain 1 or 2 different cnames compressing them is much more efficient. This is pretty great 🤩. After this I wondered what would be ideal COMPRESSION_LEVEL for parquet and I tested it with different values:

ZSTD Compression level	Parquet File size	Time spent with 12 threads
4	2339M	4m 4.680s
7	2099M	4m 32.578s
11	2055M	5m 20.552s
16	1897M	8m 32.919s
20	1797M	14m 52.059s

💡

Conclusion: If you have excess compute time to spent it’s a good idea to use higher compression levels. Especially the jump from 4→7 seems like time well spent. If you’re sharing your dataset for others in the internet this will save a lot of bandwidth and storage for everyone involved.

You can also test altering the ROW_GROUP_SIZE parameter for Parquet to allow even smaller files. This depends on how your data is clustered. For this dataset the changes were only few megabytes so I didn’t bother to explore it more.

Finding most popular CNAME targets


time duckdb -c "
	SELECT cname, COUNT(*) as count
	FROM 'hostnames-and-cnames-sorted.parquet'
	GROUP BY cname
	ORDER BY count DESC
	LIMIT 30
"
┌──────────────────────────────────────────────────────────────────┬──────────┐
│                              cname                               │  count   │
│                             varchar                              │  int64   │
├──────────────────────────────────────────────────────────────────┼──────────┤
│ td-ccm-neg-87-45.wixdns.net.                                     │ 10803597 │
│ cdn1.wixdns.net.                                                 │ 10778999 │
│ ext-sq.squarespace.com.                                          │  4728993 │
│ shops.myshopify.com.                                             │  3809256 │
│ 77980.bodis.com.                                                 │  3117116 │
│ t1k-shield-92a6a36b6702b4ad.elb.us-east-1.amazonaws.com.         │  2729396 │
│ t1k.poynt.net.                                                   │  2729396 │
│ paylinks.commerce.godaddy.com.                                   │  2729395 │
│ mps-mda-ic-openresty-prd-weu.trafficmanager.net.                 │  1611935 │
│ 7450.bodis.com.                                                  │  1451574 │
│ gcp-us-west1-1.origin.onrender.com.cdn.cloudflare.net.           │  1428675 │
│ gcp-us-west1-1.origin.onrender.com.                              │  1428672 │
│ parkingpage.namecheap.com.                                       │  1407986 │
│ ghs.googlehosted.com.                                            │  1222992 │
│ ext-cust.squarespace.com.                                        │  1001407 │
│ cname.vercel-dns.com.                                            │   778547 │
│ 77026.bodis.com.                                                 │   744565 │
│ pointing.wixdns.net.                                             │   633087 │
│ mps-mda-ic-openrestyadmindf-prd-weu.trafficmanager.net.          │   608328 │
│ mps-mda-ic-openrestydf-prd-weu.trafficmanager.net.               │   608325 │
│ d1-hitch-eu-nlb-e064e2845fd0c838.elb.eu-central-1.amazonaws.com. │   599222 │
│ proxy-ssl-geo.webflow.com.                                       │   560207 │
│ proxy-ssl.webflow.com.                                           │   560199 │
│ expired.gname.net.                                               │   546273 │
│ prdp-1-nginxnode-main.trafficmanager.net.                        │   521279 │
│ ukw-openresty.mcas.ms.                                           │   521279 │
│ shed.dual-low.s-part-0039.t-0009.t-msedge.net.                   │   497971 │
│ global.multiscreensite.com.                                      │   463454 │
│ k8s-dp-prod-bl-1-ip.eastus.cloudapp.azure.com.                   │   461731 │
│ k8connectatmeu.trafficmanager.net.                               │   461731 │
├──────────────────────────────────────────────────────────────────┴──────────┤
│ 30 rows                                                           2 columns │
└─────────────────────────────────────────────────────────────────────────────┘

real    0m5.237s
user    0m49.296s
sys     0m3.995s

Well this is pretty fantastic. We have 227 957 515 domains matching into CNAME records and we can query this in 5 seconds from ~2Gb file.

This data is also pretty interesting. I knew that Wix & Squarespace are very popular but I had no clue that there are 3.8 million different domains pointing to Shopify. Some of these domains are probably not working anymore because the dataset contains also historical data and I didn’t filter the latest succesful results. Also Render seems to be more popular than Vercel.

Using parquet for open datasets is awesome

I hope this was a good lesson on the powers of duckdb and parquet. If maintainers of Merklemap are reading this please please consider using something else than JSONL because with a bit more clever design people could skip downloading most of the 41Gb and just query remote parquet files directly which uses very little data.