Extracting programmer employment data from BLS

by Chris Erin on May 14, 2019

This is the description of a technical journey to better understand the labor market for computer programmers by using numbers from the Bureau of Labor Statistics. As a government agency, I expected the BLS to provide a readily available, significant amount of raw historical data, and while true, this data is provided in a number of ways each of which is unsatisfying in its own way. The BLS has an api, a large cache of public flat files, and a set of spreadsheets for each year dating back to 1997. In this blog post I explore each of these methods of extracting data from BLS.

The growth of the labor market for Computer Programmers

As a consultant that gets to experience a number of different project environments either through my own experiences or the experiences of my co-workers, I'm interested in how the rate of growth in the industry affects the ability to produce quality software. But how fast is the industry growing? The United States Bureau of Labor Statistics (BLS) produces employment numbers per employment type and industry in every field on a yearly basis. I should be able to access those numbers to estimate the generic levels of experience across the computer programming industry.

The BLS Text Files

https://download.bls.gov/pub/

The above address points to a file server where a number of text files are available for download. The particular set of files that I care about are the occupational employment (oe) files, located at https://download.bls.gov/pub/time.series/oe/.

To gain understanding, you must learn file structure and cross reference them as described in the oe.txt file. Here, you'll learn about the series_id an id you can use to both query the text files and later the api. It looks like this:

OEUN000000000000015113201

The above code can be looked up in the oe.series file:

> cat oe.series | grep "OEUN000000000000015113201"
... Employment for Software Developers, Applications in All Industries in the United States ...

The important part of the series_id to me is the occupation_code, in this case: 151132.

The occupation codes for Computer occupations look like this (check them out here):

15-1100     Computer Occupations
15-1111     Computer and Information Research Scientists
15-1120     Computer and Information Analysts
15-1121     Computer Systems Analysts
15-1122     Information Security Analysts
15-1130     Software Developers and Programmers
15-1131     Computer Programmers
15-1132     Software Developers, Applications
15-1133     Software Developers, Systems Software
15-1134     Web Developers
15-1140     Database and Systems Administrators and Network Architects
15-1141     Database Administrators
15-1142     Network and Computer Systems Administrators
15-1143     Computer Network Architects
15-1150     Computer Support Specialists
15-1151     Computer User Support Specialists
15-1152     Computer Network Support Specialists
15-1199     Computer Occupations, All Other

There are some codes here that don't relate closely to what I do, the ones that do are all under 15-1130:

15-1130     Software Developers and Programmers
15-1131     Computer Programmers
15-1132     Software Developers, Applications
15-1133     Software Developers, Systems Software
15-1134     Web Developers

Using the occupation_code 15-1130 and the series_id OEUN000000000000015113001 we can grep through the oe.data.1.AllData text file to find the employment numbers.

> cat oe.data.1.AllData | grep OEUN000000000000015113001
OEUN000000000000015113001       2018    A01     1666270

And as of May 2018, the 15-1130 code and its subcategories accounted for the employment of 1,666,270 people!

This is just 2018. In fact if you grep through the entire oe.data.1.AllData file for something that isn't 2018, you won't find anything. For my goal of finding employment data for the last 20 years this isn't good enough.

The BLS API

The documentation for the api is here. The first example is missing a couple of things. First, while it details an Example Payload, it doesn't detail how to communicate that payload in a request. Second, the example says nothing about api keys or registration. Without registering you will hit the daily rate limit in 10 requests. You can register for an api key here.

Here is an example curl request using the Computer Programmers series_id and a registration key:

curl https://api.bls.gov/publicAPI/v2/timeseries/data/OEUN000000000000015113001?registrationkey=<registration-key> | jq
{
  "status": "REQUEST_SUCCEEDED",
  "responseTime": 146,
  "message": [
    "No Data Available for Series OEUN000000000000015113001 Year: 2016",
    "No Data Available for Series OEUN000000000000015113001 Year: 2017"
  ],
  "Results": {
    "series": [
      {
        "seriesID": "OEUN000000000000015113001",
        "data": [
          {
            "year": "2018",
            "period": "A01",
            "periodName": "Annual",
            "latest": "true",
            "value": "1666270",
            "footnotes": [
              {}
            ]
          }
        ]
      }
    ]
  }
}

Again, we discover a value of 1,666,270 for this series, but also again, we don't get any data from before the year 2018.

The XLS files

There is one page on the BLS Occupational Employment Statistics site that clearly provides historical data.

https://www.bls.gov/oes/tables.htm

This page does have the data I'm looking for, in a series of inconsistently named zip files containing with inconsistent directory structures supporting both xls and xlsx files. As a programmer, I can either write a scraper with a lot of special cases, or I can just start downloading and unzipping.

There are two naming conventions for the national files, oes\<year\>nat.zip or oesm\<year\>nat.zip. The m in the second file convention was introduced in 2003 to accommodate producing the file in both May(m) and November(n). That lasted 2 years, but the m persisted.

Here is a script to get all files (requires wget / brew install wget)

for x ('97' '98' '99' '00' '01' '02' 'm03', 'm04' 'm05' 'm06' 'm07' 'm08' 'm09' 'm10' 'm11' 'm12' 'm13' 'm14' 'm15' 'm16' 'm17'); do
│       wget https://www.bls.gov/oes/special.requests/oes${x}nat.zip;
│       yes | unzip oes${x}nat.zip;
done

This next command will convert all of the files into csv files and place them into a directory called csvfiles (requires csvkit / pip install csvkit):

mkdir csvfiles && find . -name "*xls*" | grep -v field_descriptions |  xargs -I% sh -c "echo % && basename %" | xargs -L2 bash -c 'in2csv $0 > csvfiles/$1.csv'

Now, you can grep through all the csv files to find the employment numbers for the occupational code you are interested in (requires csvkit / pip install csvkit):

cat csvfiles/*.csv | grep '15-1130'  | csvcut -c 1,4

Which produces:

15-1130,1397780.0
15-1130,1442500.0
15-1130,1492040
15-1130,1554960
15-1130,1604570
15-1130,1617400
15-1130,1666270

Fortunately, all the files sort nicely which leads to sorted by year output, and at the bottom you'll again see the number "1,666,270" but before that you'll see 6 numbers that represent the 6 previous years. We had 20 csv files, why did we only get 6 years?

Well, the codes keep changing:

1999 - 2009:

15-1021 Computer Programmers
15-1031 Computer Software Engineers, Applications
15-1032 Computer Software Engineers, Systems Software

2010 - 2011:

15-1131 Computer Programmers
15-1132 Software Developers, Applications
15-1133 Software Developers, Systems Software

2012 - 2018:

15-1130     Software Developers and Programmers
15-1131     Computer Programmers
15-1132     Software Developers, Applications
15-1133     Software Developers, Systems Software
15-1134     Web Developers

Conclusion

Now, understanding how to get the data and understanding that I have to follow codes through time, I have a better chance of creating a table of the changing levels of experience in the industry over time. This data was not straightforward to get and understand so I hope you can use some of this in your own BLS data journeys.

Photo by: Armand Khoury