Introduction

This repository contains code for an JSON API to the Buisness Dynamics Statistics. It is composed of Python code that downloads, transforms and deploys the CSV files hosted on the Census Bureau’s website to a Postgres database and a sails application that connects to that database and dynamically serves content. The Python code relies on the Fabric library which streamlines the use of SSH for application deployment and system administration tasks. The fabric script that deploys the database also contains code for deploying to a Docker data volume. This data volume can be used in conjunction with the docker-compose tool to launch local, disposable, development copies of the postgres database and the sails API.

Dynamic Routes

BDS data is aggregate yearly time series job birth and job death information derived from the Longitudinal Buisness Database. It is broken down by firms and establishments, and then by increasingly specific sub-groups such as state, age of firm, top level sic codes and metropolitan statistical areas. Data is provided in ~50 CSV files that begin with the Economy Wide statistics and at their most specific are broken down by Firm Age, Firm Size, Metro/Non-Metro and State.

Dynamic Routes in the bdsAPI allow clients to query this information based on increasingly specific subgroups given that the subgroup combination is made available by the census bureau. Routes return a JSON object. Unless otherwise specified by the flat parameter routes will return an object that groups row data in the order of the dynamic route. This means clients may request job birth/death broken down by firm age (/firm/age) and then based on user actions, request job birth/death broken down by firm age and size (/firm/age/sz). Conversely /firm/sz/age will provide breakdowns of job birth/deaths based on firm size, then firm age. This is supported by the fact that the census bureau provides an AGExSZ CSV file that contains this information. It is not, for instance, possible to request establishment breakdowns by state and SIC code (/establishment/st/sic) - The census bureau does not provide this information.

Here are the lists of the valid URI segments that may be combined along with their coorisponding database table and descriptions (keep in mind that URI segments may be combined in any order):

Establishment Data

Database Tabel URI Segments Description
EW ew Economy Wide
SIC sic Sector
SZ sz Establishment Size
ISZ isz Initial Establishment Size
AGE age Establishment Age
ST st State
AGExSZ age, sz Establishment Age by Establishment Size
AGExISZ age, isz Establishment Age by Initial Establishment Size
AGExSIC age, sic Establishment Age by Sector
SZxSIC sz, sic Establishment Size by Sector
ISZxSIC isz, sic Initial Establishment Size by Sector
AGExST age, st Establishment Age by State
SZxST sz, st Establishment Size by State
ISZxST isz, st Initial Establishment Size by State
AGExSZxSIC age, sz, sic Establishment Age by Establishment Size by Sector
AGExSZxST age, sz, st Establishment Age by Establishment Size by State
AGExISZxSIC age, isz, sic Establishment Age by Initial Establishment Size by Sector
AGExISZxST age, isz, st Establishment Age by Initial Establishment Size by State

Firm Data

Database Table URI Segments Description
EW ew Economy Wide
SIC sic Sector
SZ sz Firm Size
ISZ isz Initial Firm Size
AGE age Firm Age
ST st State
MET met Metro/Non-Metro
AGExSZ age, sz Firm Age by Firm Size
AGExISZ age, isz Firm Age by Initial Firm Size
AGExSIC age, sic Firm Age by Sector
AGExMET age, met Firm Age by Metro/Non-Metro
AGExMSA age, msa Firm Age by MSA
AGExST age, st Firm Age by State
SZxSIC sz, sic Firm Size by Sector
SZxMET sz, met Firm Size by Metro/Non-Metro
SZxMSA sz, msa Firm Size by MSA
SZxST sz, st Firm Size by State
ISZxSIC isz, sic Initial Firm Size by Sector
ISZxMET isz, met Initial Firm Size by Metro/Non-Metro
ISZxST isz, st Initial Firm Size by State
AGExSZxSIC age, sz, sic Firm Age by Firm Size by Sector
AGExSZxST age, sz, st Firm Age by Firm Size by State
AGExSZxMET age, sz, met Firm Age by Firm Size by Metro/Non-Metro
AGExSZxMSA age, sz, msa Firm Age by Firm Size by MSA
AGExISZxSIC age, isz, sic Firm Age by Initial Firm Size by Sector
AGExISZxST age, isz, st Firm Age by Initial Firm Size by State
AGExISZxMET age, isz, met Firm Age by Initial Firm Size by Metro/Non-Metro
AGExSZxMETxST age, sz, met, st Firm Age by Firm Size by Metro/Non-Metro by State
AGExISZxMETxST age, isz, met, st Firm Age by Initial Firm Size by Metro/Non-Metro by State

Additional Route Segments

BDS data is panel data beginning in 1977. By default the leafs of the hierarchical response object will be lists containing objects for each year. This may not be the desired functionality and so it is possible to treat ‘yr’ as a URI segment that may be added at any level to any of the previous dynamic routes. e.g. /firm/age/yr/sz will return a breakdown of firms by age, then by year, then by size.

The Census bureau does not provide an msa table for firms, though it does provide breakdowns for MSA by age, by size and by age and size. The fabric deployment script will aggregate the AGExMSA table values to create an MSA table which is available at /firm/msa. It is recommended that clients subject this route conditions (See Next Section).

Route Conditions

Endpoints may (and in most cases should) be subjected to conditions which reduce the over all size of the data returned (AGExSZxMSA for instance is over 100Mb of text). This can by done by including zero filled numbers to the end of any URI segment that correspond to the codes of that URI element (codes may be found in the Code section). For example the URI:

/firm/age01/msa0102301024/sz020310

will return an object of the form:

{"01": 
  {"01023":
    { "02": [ ROW_DATA ],
      "03": [ ROW_DATA ],
      "10": [ ROW_DATA ] },
   "01024":
    "02": [ ROW_DATA ],
    "03": [ ROW_DATA ],
    "10": [ ROW_DATA ]
  }
}

Where age is subjected to the condition ‘age = “01”’, msa is subject to the condition ‘msa in (“01023”, “01024”)’ and ‘size in (“02”, “03”, “10”).’ All condition values are fixed width according to their data type where, years (‘yr’) are of length 4, MSA (‘msa’) are of length 5, and all others (“age”, “sz”, “isz”, “sic”, “st”) are of length 2.

Codes

States follow FIPS two digit codes. MSA codes follow the numeric codes for metropolitan statistical areas as established by the Office of Management and Budget (OMB), version 2009. Codes may be pulled dynamically at the following routes:

  • /codes
  • /firm/codes
  • /establishment/codes
  • /firm/codes/:type
  • /establishment/codes:type

For reference codes are as follows:

Firm Age

code value
00 a) 0
01 b) 1
02 c) 2
03 d) 3
04 e) 4
05 f) 5
06 g) 6 to 10
07 h) 11 to 15
08 i) 16 to 20
09 j) 21 to 25
10 k) 26+
11 l) Left Censored

Firm Size & Initial Firm Size

code value
00 a) 1 to 4
01 b) 5 to 9
02 c) 10 to 19
03 d) 20 to 49
04 e) 50 to 99
05 f) 100 to 249
06 g) 250 to 499
07 h) 500 to 999
08 i) 1000 to 2499
09 j) 2500 to 4999
10 k) 5000 to 9999
11 l) 10000+

Establishment Age

code value
00 a) 0
01 b) 1
02 c) 2
03 d) 3
04 e) 4
05 f) 5
06 g) 6 to 10
07 h) 11 to 15
08 i) 16 to 20
09 j) 21 to 25
10 k) 26+
11 l) Left Censored

Establishment Size and Initial Establishment Size

code value
00 a) 1 to 4
01 b) 5 to 9
02 c) 10 to 19
03 d) 20 to 49
04 e) 50 to 99
05 f) 100 to 249
06 g) 250 to 499
07 h) 500 to 999
08 i) 1000+

Additional Parameters

If this hierarchical behavior is not desirable data may be returned as a list of rows by passing the GET argument ‘flat’ equal to any non falsy value. For example:

firm/age01/msa0102301024/sz020310?flat=true

will return an array of rows. By default each row contains all the columns available in that database. If a user only wants specific columns they may be selected with the ‘fields’ GET variable. For example:

firm/age01/msa0102301024/sz020310?fields=job_creation

Multiple fields may be selected by adding additional fields variables, e.g.

firm/age01/msa0102301024/sz020310?fields=job_creation&fields=job_death&fields=job_creation_rate

PLEASE NOTE: the current parse function does not support paging ( it is not clear the best way to page through a hierarchical object). This means it is possible to request data that takes a very long time to return and may CRASH the browser if loaded into memory. It is (currently) the client’s responsibility to request reasonable amounts of data!

Notes for Developers

The source code contains extensive documentation on each function and developers are encouraged to look through the source code for more information. Consider begining with the parse() function in the api/controllers/GenericController.js file.

Scattered through the comments are TODO and CONSIDER statements. TODO statements suggest a code area that should be improved, usually with minimal effort. These often include better error checking. CONSIDER statements are more design oriented and suggest ways in which the API or deployment code could be usefully improved or extended.

By convention functions that lead with an underscore are intended for within-module use. You are welcome to use them if you wish but their internal functionality and signatures are not intended to be stable.

The dynamic routing of field elements circumvents the traditional routing and model design of a sails application. The BDS data does not lend itself to traditional object hierarchies that map well to relational databases. Because of this models are used only for their generic SQL query() method. Generating this SQL is the router responsibility of the router, who converts all URI segments after the /firm and /establishment parts to this custom SQL query. The guts of this transformation have been abstracted into a Sails service FieldService.js. This file contains more documentation, but consider starting by looking at the route_table() and the route_query() methods. Once SQL rows have been returned the API will group them hierarchically based on the order of the URI segments. This code is relatively small, but is managed from a separate file GroupService.js.

Fabric deployment of the database relies heavily on the Pandas python library for data I/O and munging. Pandas provides methods for reading from CSV files and writing SQL to databases. Database connections are managed through SQLAlchemy engines. To ensure library support for the munging operations files, census bureau files will be downloaded into the temporary directory of the launching computer, rather than the remote host that is being deployed too.