How to add 3 million real companies to your empty Cloudflare D1 database?

How to add 3 million real companies to your empty Cloudflare D1 database?

Tags
Startups
SaaS
Datasets
Published
April 23, 2024
Author
Onni Hakala

Getting started with empty database for a new product

I have recently been building a new tool to help people to find new workplaces. In order to make it easier to add current workplaces and show related links for the jobs I have found I needed to find a dataset which contains company’s name and main domain and other interesting facts like social media links.
I found out that the CrunchBase is perfect for this and they offer very basic data set for free which is amazing. By importing this dataset into my application I can then build features like auto-complete later on and make it easier to match the open jobs to the actual companies. During this process I also learned that Cloudflare D1 doesn’t support importing large (>500Mb) database dumps but this can be circumvented by splitting the dataset into smaller patches.

The steps I used with my MacBook

You probably need to change few lines if you use something else than MacOS
  1. Create a CrunchBase user and obtain an API key under Integrations -> CrunchBase API. The basic API is free which is amazing.
  1. You should now have a CrunchBase API key and you can export it in command line:
    1. # Replace XXXXXX... with your key export CB_API_KEY=XXXXXX
  1. Download organizations.csv & people.csv files from CrunchBase CSV export:
    1. mkdir company-data cd company-data curl -O "https://api.crunchbase.com/odm/v4/odm.tar.gz?user_key=$CB_API_KEY" tar -xvf odm.tar.gz
  1. Install duckdb & jq and query the organizations.csv for an example
    1. brew install duckdb jq duckdb --json -c "SELECT * FROM 'organizations.csv' WHERE domain = 'meta.com'" | jq [ { "uuid": "df662812-7f97-0b43-9d3e-12f64f504fbb", "name": "Meta", "type": "organization", "primary_role": "company", "cb_url": "https://www.crunchbase.com/organization/facebook?utm_source=crunchbase&utm_medium=export&utm_campaign=odm_csv", "domain": "meta.com", "homepage_url": "https://meta.com", "logo_url": "https://images.crunchbase.com/image/upload/t_cb-default-original/whm4ed1rrc8skbdi3biv", "facebook_url": "https://www.facebook.com/Meta", "twitter_url": "https://www.twitter.com/Meta", "linkedin_url": "https://www.linkedin.com/company/meta", "combined_stock_symbols": "nasdaq:META", "city": "Menlo Park", "region": "California", "country_code": "USA", "short_description": "Meta is a social technology company that enables people to connect, find communities, and grow businesses." } ]
  1. Optional: Remove extra details which are not needed and convert the large csv to ZSTD compressed parquet formatted binary file
    1. # Removes extra data to make this data set smaller. You don't need to do this and can just directly use the `organizations.csv` as source in the step 6. duckdb -c "   COPY (       SELECT           name,           domain,           short_description,           combined_stock_symbols,           REPLACE(SPLIT_PART(cb_url, '?', 1),'https://www.crunchbase.com/organization/','') as crunchbase_id,           REPLACE(logo_url,'https://images.crunchbase.com/image/upload/t_cb-default-original/','') as crunchbase_logo_key,           RTRIM(REGEXP_REPLACE(facebook_url,'https?://www.facebook.com/',''),'/') as facebook_page_name,           SPLIT_PART(REGEXP_REPLACE(linkedin_url,'https?://www.linkedin.com/company/',''),'/',1) as linkedin_company_page,           REGEXP_REPLACE(twitter_url,'https?://(www\.)?twitter.com/','') as twitter_username,           city,           region,           country_code       FROM 'organizations.csv'       WHERE linkedin_url IS NOT NULL       AND primary_role = 'company'   ) TO 'companies-min.zstd.parquet'   (FORMAT PARQUET, CODEC 'zstd'); "
  1. Convert the parquet format into sqlite database and name the dataset as crunchbase_companies
    1. # Convert dataset into sqlite database duckdb -c "   INSTALL sqlite;   LOAD sqlite;   ATTACH 'companies.db' (TYPE SQLITE);   USE companies;   CREATE TABLE crunchbase_companies AS FROM 'companies-min.zstd.parquet'; " # Dump the sqlite file into SQL without transactions and PRAGMA commands sqlite3 companies.db .dump > companies.sql
  1. Install GNU utils and split the file to multiple files. This also removes all SQL transactions because Cloudflare D1 doesn’t support them out of the box. I had few problems with Cloudflare D1 when I tried to import the 1.1GB dataset directly and splitting it into separate files helped.
    1. brew install coreutils # Take all table creating commands into separate file grep -E '^CREATE TABLE' companies.sql > create-tables.sql # Split all insert into commands into multiple separate files with 100k lines grep -E '^INSERT INTO' companies.sql | gsplit -l 100000 --additional-suffix=.sql -d - insert-into-
  1. Import SQL which creates all of the tables first
    1. wrangler d1 create your-database wrangler d1 execute your-database --yes --remote --file=create-tables.sql
  1. Import all of the created data into the database in separate processes using xargs
    1. find . -name 'insert-into-*.sql' | xargs -P $(sysctl -n hw.ncpu) -I{} wrangler d1 execute your-database --yes --remote --file={}

Conclusion

I hope this is going to be useful for others who want to import company data or who will encounter the import size limitations of Cloudflare D1 💪. If you know better way to upload large dataset or other great public datasets regarding public company data let me know.