import duckdb
# connect to a local DuckDB database.
= duckdb.connect("mls.db") quack
This post is part of a series about Mass Layoff Statistics data collected by the Bureau of Labor Statistics. Check out the main post here
Working with obscure federal government data can be tricky. There are always plenty of gotchas. This Mass Layoff Statistics data is certainly no exception. It’s downloadable as a series of tab-delimited text files. As you’ll see later, some of the tables have column names and some don’t. Luckily there’s some documentation included in the data release.
Downloading the data
The final data from this program resides at this url. At the time of writing, this data cannot be downloaded programmatically. This means that the BLS has some sort of protection against referencing the tables where they exist. For this reason, I downloaded each text file manually and put them in a folder. I’ve included a zip file here in case the original data source goes away for any reason.
Reading the data
This data is heavily normalized and designed to be used in a database. There are a bunch of coded columns and lookup tables. This is pretty typical with this kind of data. It was produced at a time when every kilobyte of data mattered for storage cost purposes. It does add a bit of complexity to reading the data though.
For the task of reading and storing this data, I’ve decided to use DuckDB. DuckDB is excellent at reading csvs and other structured text files. It actually can read them in place without importing them to a database, which is super powerful.
Let’s start by importing DuckDB and connecting to a database.
Now let’s get an inventory of the csv files.
import os
# folder containing the MLS text files
= "./raw_mls"
folder
= os.listdir(folder) all_text_files
Now we can iterate through each of the text files and use DuckDB to read the data.
for text_file in all_text_files:
print(text_file)
# create a full path for the text file
= os.path.join(folder, text_file)
text_file_path
# format a sql statement to read from text file
= f"""
sql SELECT
*
FROM
read_csv('{text_file_path}')
"""
# execute the sql
= quack.sql(sql)
query_result
# print the column names that DuckDB found
print(query_result.columns)
ml.contacts
['Mass Layoff Statistics Contacts']
ml.data.0.Current
['series_id', 'year', 'period', 'value', 'footnote_codes']
ml.data.1.AllData
['series_id', 'year', 'period', 'value', 'footnote_codes']
ml.dataelement
['003', 'Layoff events']
ml.dataseries
['M', 'Mass Layoff']
ml.footnote
['footnote_code', 'footnote_text']
ml.industrybase
['N', 'Industry data from 1995 to present (NAICS)']
ml.irc
['irc_code', 'irc_text', 'display_level', 'selectable', 'sort_sequence']
ml.period
['Q01', 'QTR1', '1st Quarter']
ml.series
['series_id', 'dataseries_code', 'srd_code', 'industryb_code', 'irc_code', 'dataelement_code', 'footnote_codes', 'begin_year', 'begin_period', 'end_year', 'end_period']
ml.srd
['column0', 'column1', 'column2', 'column3', 'column4']
The query result object from DuckDB has a lot of interesting properties. One that we can use is the .columns
property. This will let us know if each table has a header or not.
Writing to a database
As you probably noticed, some of the text files don’t have column names or headers. You can see that duckdb read the first row and assumed it was the header. In this case, we’ll have to specify column names. We can create a dictionary of known column names based on the documentation that comes with the MLS data. Note that for the srd table, there are actually five columns but only two in the documentation.
= {
table_columns "dataelement": ["dataelement_code", "dataelement_text"],
"dataseries": ["dataseries_code", "dataseries_text"],
"industrybase": ["industrybase_code", "industrybase_text"],
"period": ["period", "period_abbr", "period_name"],
"srd": ["srd_code", "srd_text", "col1", "col2", "col3"]
}
Now that we’ve created this dictionary with table names and columns, we can pass those columns into our sql statement where necessary. We’ll also modify the sql statement to create a table in our database instead of just querying the data.
In the DuckDB read_csv
function, we can use the following parameters: - sample_size
- this is helpful for determining data types. The more rows that the function can sample, the more likely that it will guess the right data types for each column - header
- This is true by default. That’s great for tables that have a header and column names. For the other tables that don’t, we’ll have to set this to False. - names
- When the header is set to False, we can use this to provide column names.
We’ll pass those parameters into the sql statement as a string.
for text_file in all_text_files:
print(text_file)
# create a full path for the text file
= os.path.join(folder, text_file)
text_file_path
# generate a table name based on the file name
= text_file.split('.')[-1]
table_name
# use the table column dictionary
if table_name in table_columns:
= f", header=False, names={table_columns[table_name]}"
params else:
= ""
params
# format the sql string
= f"""
sql CREATE OR REPLACE TABLE {table_name} AS
SELECT
*
FROM
read_csv('{text_file_path}', sample_size = 100_000_000 {params})
"""
# execute the sql statement
= quack.sql(sql) query_results
ml.contacts
ml.data.0.Current
ml.data.1.AllData
ml.dataelement
ml.dataseries
ml.footnote
ml.industrybase
ml.irc
ml.period
ml.series
ml.srd
Now the data has been loaded into the DuckDB file and is ready for some analysis.