Getting started with empty database for a new product
I have recently been building a new tool to help people to find new workplaces. In order to make it easier to add current workplaces and show related links for the jobs I have found I needed to find a dataset which contains company’s name and main domain and other interesting facts like social media links.
I found out that the CrunchBase is perfect for this and they offer very basic data set for free which is amazing. By importing this dataset into my application I can then build features like auto-complete later on and make it easier to match the open jobs to the actual companies.
During this process I also learned that Cloudflare D1 doesn’t support importing large (>500Mb) database dumps but this can be circumvented by splitting the dataset into smaller patches.
The steps I used with my MacBook
You probably need to change few lines if you use something else than MacOS
- Create a CrunchBase user and obtain an API key under Integrations -> CrunchBase API. The basic API is free which is amazing.
- You should now have a CrunchBase API key and you can export it in command line:
# Replace XXXXXX... with your key export CB_API_KEY=XXXXXX
- Download
organizations.csv
&people.csv
files from CrunchBase CSV export:
mkdir company-data cd company-data curl -O "https://api.crunchbase.com/odm/v4/odm.tar.gz?user_key=$CB_API_KEY" tar -xvf odm.tar.gz
- Install
duckdb
&jq
and query theorganizations.csv
for an example
brew install duckdb jq duckdb --json -c "SELECT * FROM 'organizations.csv' WHERE domain = 'meta.com'" | jq [ { "uuid": "df662812-7f97-0b43-9d3e-12f64f504fbb", "name": "Meta", "type": "organization", "primary_role": "company", "cb_url": "https://www.crunchbase.com/organization/facebook?utm_source=crunchbase&utm_medium=export&utm_campaign=odm_csv", "domain": "meta.com", "homepage_url": "https://meta.com", "logo_url": "https://images.crunchbase.com/image/upload/t_cb-default-original/whm4ed1rrc8skbdi3biv", "facebook_url": "https://www.facebook.com/Meta", "twitter_url": "https://www.twitter.com/Meta", "linkedin_url": "https://www.linkedin.com/company/meta", "combined_stock_symbols": "nasdaq:META", "city": "Menlo Park", "region": "California", "country_code": "USA", "short_description": "Meta is a social technology company that enables people to connect, find communities, and grow businesses." } ]
- Optional: Remove extra details which are not needed and convert the large csv to ZSTD compressed parquet formatted binary file
# Removes extra data to make this data set smaller. You don't need to do this and can just directly use the `organizations.csv` as source in the step 6. duckdb -c " COPY ( SELECT name, domain, short_description, combined_stock_symbols, REPLACE(SPLIT_PART(cb_url, '?', 1),'https://www.crunchbase.com/organization/','') as crunchbase_id, REPLACE(logo_url,'https://images.crunchbase.com/image/upload/t_cb-default-original/','') as crunchbase_logo_key, RTRIM(REGEXP_REPLACE(facebook_url,'https?://www.facebook.com/',''),'/') as facebook_page_name, SPLIT_PART(REGEXP_REPLACE(linkedin_url,'https?://www.linkedin.com/company/',''),'/',1) as linkedin_company_page, REGEXP_REPLACE(twitter_url,'https?://(www\.)?twitter.com/','') as twitter_username, city, region, country_code FROM 'organizations.csv' WHERE linkedin_url IS NOT NULL AND primary_role = 'company' ) TO 'companies-min.zstd.parquet' (FORMAT PARQUET, CODEC 'zstd'); "
- Convert the parquet format into sqlite database and name the dataset as
crunchbase_companies
# Convert dataset into sqlite database duckdb -c " INSTALL sqlite; LOAD sqlite; ATTACH 'companies.db' (TYPE SQLITE); USE companies; CREATE TABLE crunchbase_companies AS FROM 'companies-min.zstd.parquet'; " # Dump the sqlite file into SQL without transactions and PRAGMA commands sqlite3 companies.db .dump > companies.sql
- Install GNU utils and split the file to multiple files. This also removes all SQL transactions because Cloudflare D1 doesn’t support them out of the box. I had few problems with Cloudflare D1 when I tried to import the 1.1GB dataset directly and splitting it into separate files helped.
brew install coreutils # Take all table creating commands into separate file grep -E '^CREATE TABLE' companies.sql > create-tables.sql # Split all insert into commands into multiple separate files with 100k lines grep -E '^INSERT INTO' companies.sql | gsplit -l 100000 --additional-suffix=.sql -d - insert-into-
- Import SQL which creates all of the tables first
wrangler d1 create your-database wrangler d1 execute your-database --yes --remote --file=create-tables.sql
- Import all of the created data into the database in separate processes using
xargs
find . -name 'insert-into-*.sql' | xargs -P $(sysctl -n hw.ncpu) -I{} wrangler d1 execute your-database --yes --remote --file={}
Conclusion
I hope this is going to be useful for others who want to import company data or who will encounter the import size limitations of Cloudflare D1 💪. If you know better way to upload large dataset or other great public datasets regarding public company data let me know.