"

Chapter 9: Collecting and Cleaning the Data

Getting the Data

Data are provided for the purpose of these exercises, but in this section, we will walk through what data we need and how to find it.

Data we will collect includes:

  • Income (source: US Census)
  • Housing characteristics (source: US Census)
  • Demographics (source: US Census)
  • Housing assessments (source: St. Louis County Open Data)
  • Zoning information (source: St. Louis County Open Data)
  • Schools (source: MSDIS)
  • Floodways (source: St. Louis County Open Data)
  • Sinkholes (source: MSDIS)
  • Voting results (source: St. Louis County Elections)
  • Bakeries (source: Data Axel)

Getting Census Data

Screenshot of the Tigerline website to get the 2010, Missouri Census tracts.
Screenshot from Census Tiger/Line site

The Census provides two types of data we need, geospatial and tabular. The exercises in this book will use Census Tract Level Data for St. Louis County.

Getting Census Tract Tigerline Files

Go to the Census TIGER/Line Shapefiles page.

  • Select year: 2010
  • Select layer type: Census Tract
  • Submit
  • State: Missouri
    • County: St. Louis County
    • Download

The Census tables can be retrieved from the Census’s Data and Maps Page. Census.gov provides the Decennial Census (10yr) as well as the American Community Survey (estimates between the decennial census).

If you are unable to get historical data from the Census site, another useful tool is IPUMS NHGIS.

Using the advanced search in Census.gov, filter for the following.

1. Go to the Explore Census Data page and do a separate Advanced Searches for tables for housing, income, and race.

Screenshot of Census dialog
Screenshot of filter from Census Data Portal
Geographies
  • Census Tracts
    • Missouri
      • St. Louis County
        • All Census Tracts in St. Louis County Missouri

I recommend gathering these topic tables separately.

Topics
  • Housing
    • Download 2010 table DPO2 (zip file)
    • Remove topic filters, keep geography filters
Topics
  • Income and Poverty
    • Download 2010 table S1901 (zip file)
    • Remove topic filters, keep geography filters
Topics
  • Race
    • Download 2010 table B02001 (zip file)

2. Getting Floodways

These are made available through St. Louis County Open Data Portal.

Search: flood

  • Select Flood Hazard Areas, 1% Annual Chance
    • Select FEMA Floodways
    • Download shapefile dataset

3. Getting Sinkhole Data and School Locations

MSDIS  stands for Missouri Spatial Data Information Service, and it houses a variety of data sources for the state.

You can find schools by searching for schools.

  • Download the .shp file

You can find sinkholes by searching for sinkholes

  • Download the 2018 sinkholes .shp file.

4. Datasets can be found here.

Cleaning Data

The open refine symbol.Software for cleaning data:

First go to Open Refine and download software specific to your operating system (e.g. OpenRefine 3.9.3 for MacOS)

The cleaning we will need to do is on the Census tables. Our tables require a few basic actions:

  1. Prepare the data so it will successfully join.
  2. Ensure numerical columns will be read as numbers.
  3. Remove any unnecessary information.
  4. Document our changes

You will remember from earlier chapters that we can join tabular data to geospatial data. However, we can only accomplish this if we have a field that meets all the criteria for joining.

Requirements for a successful join

Screenshot from QGIS application
  • Join fields must contain matching values (a unique identifier is an ideal value for joining)
  • Join fields must be the same data format (e.g., string, integer, etc.)
  • Field headers must not contain spaces or special characters
  • Field values are clean and standardized
  • Field headers should not use reserved words (e.g., date,” “day,” “month,” “table,” “text,” “user,” “when,” “where,” “year,” or “zone”)
  • Table is in an incompatible format

Identifier values match:

Our project will require us to join our Census tables to our Tigerline Files. The Census uses a field called Geoid to accomplish this.

A Tigerline GeoID looks like this: 29510127600; the id breaks down like this:

  • 29 is the Census ID for the State of Missouri
  • 510 is the County Number (in this case it represents the county of St. Louis City)
  • 127600 is the Tract number
  • On their own, none of these numbers are unique, but when amalgamated, it can be unique.

Census tables GeoID’s often look like this when downloaded: 1400000US29510127600

Screenshot of Census table geoid in Excel app

1400000US in a national code. The rest of the id is the same as the id in the Tigerline file

Steps to clean the data:

  1. Launch OpenRefine
  2. Create Project
  3. Upload, navigate to one of the downloaded Census tables
  4. Next
Screenshot of the upload dialog in OpenRefine
Screenshot of the OpenRefine upload screen

When prompted to parse the data you will be asked:

  • Columns are Separated By: the column separation method, which is based on the type of file you are working with our data is in CSV format (comma separated values), so our separation method is a comma.
  • Use Character: because sometimes your data will contain the character used for separation, we need to tell OpenRefine how to know when it’s serving its separation function and when it’s part of the value. We will accept the default (“).
  • Trim leading and trailing white space from string. Spaces are hard to clean out of a dataset because you can’t really see them with your eyes when they’re at the end of a value, but they can cause big problems for analyses. Open refine can remove the white when this is selected. We will select this box.
  • Ignore first line: we want our first line treated as a header, so do not choose this option.
  • Parse next: If our dataset has multiple headers we can tell OpenRefine how many lines are reserved as headers. In our case we will choose one.
  • Column names: if you have ignored the first line and want to provide column names you can list them here, for our purposes we will use the provided column names in the dataset, so we will not fill out this field.
  • Discard initial rows of data: this can be helpful if there are rows of information preceding the dataset. For our purposes we will not use this.
  • Load at most: if you don’t want to load all the rows of the dataset you can indicate how many to load. For our purposes we will not be using this.
    Screen shot of parsing in open refine. Select trim leading and trailing whitespace.
    Screenshot of the OpenRefine parsing screen
  • Disable auto preview: If you don’t need to see the parsing, check this. We will not check this
  • Attempt to parse all text into number: This can be useful if you know all the number character should. be text. In our case, we want our variable numbers to read as numbers, but our unique identifier, which is also numbers should read as a string, so we will not check this.
  • Store blank rows: Unless there is a good reason, I would recommend removing blank rows because they will read as null value. We will deselect store blank rows.
    Screenshot form the All menu in OpenRefine, the Facet menu is expanded and Facet by Star is selected.
    Screenshot of the OpenRefine star row action
  • Store blank columns: Unless there is a good reason, I would recommend removing blank columns because they will read as null value. We will deselect store blank columns.
  • Store file source: This would store in OpenRefine, which is not necessary, but you want to document your source.
  • Store file archive: This would store in OpenRefine, which is not necessary, but you want to document your changes and keep a copy of the original file.

Remove Row 2

To clean up this data we need to remove row 2 so that we are working with number values only. To do this we will:

  • Click the star symbol next to row two.
  • Toggle down the “All” menu
  • Choose “Facet by Star”
    Screenshot of the faceted row.
    Screenshot of the OpenRefine create facet by star action
  • Select the values that are “true”

Split Column

Toggle down the “All”

  • Go to edit rows
  • Remove matching rows
  • Next step is to remove the extra numbers (1400000US) from the geoid.
  • Toggle down in the geoid column, click “edit column”
    Screenshot of matching rows in facet menu.
    Screenshot of the OpenRefine matching rows

Choose the option to “split into several columns”

  • How to split the column: by separator. We have to tell it what the separator, in our case the separator will be “S” because that’s the last character in the string we want to remove.
  • Split into 2 columns
  • After splitting, “deselect guess cell type” be
    Dialog for splitting columns by a separator.
    Screenshot of the OpenRefine split columns dialog

    cause may guess number for our geoid and we want it to remain a string.

  • Deselect remove this column” because in case we need to redo it it’s good to have it available.

If it parsed successfully, there are now three columns: the original column “geoid”, geoid1 has the content we want to remove (1400000US), and geoid2 has the tract id that will match our geospatial data.

Remove un-needed columns
All menu with edit columns selected, the reorder/remove columns option is selected.
Screenshot of OpenRefine reorder/split columns dialog

To remove one or two columns; you can toggle down to

  • “Edit column”
    • select “remove column”

To remove several columns, toggle down the All column, Edit columns, “re-order/remove columns.” All the columns will be listed in the left pane, and the right pane will be empty.

You can drag and drop columns 1 by 1 OR you can select remove all columns, and all the columns will move to the left pane, then drag the ones you want to keep back into the left pane.

Using the metadata table for each of our tables, you can identify which columns to keep.

For the ACSST5Y2010_S1901, keep:

  • GEO_ID_2
  • S1901_C01_001E
  • S1901_C02_001E
  • S1901_C02_0012E
  • S1901_C02_0013E

For the ACSDP5Y2010_DP04, keep

  • GEO_ID_2
  • DP04_0001E
  • DP04_0045E
  • DP04_0045PE
  • DP04_0088E
  • DP04_0088PE

For the ACSDT5Y2010_B02001, keep:

  • B02001_001E
  • B02001_002E
  • B02001_003E
  • B02001_004E
  • B02001_005E
  • B02001_006E
  • B02001_007E
  • B02001_008E
  • B02001_009E
  • B02001_0010E
Rename Columns

In each column, toggle down the column.

For the ACSST5Y2010_S1901, keep GEO_ID_2.

  • S1901_C01_001E, change to totpop
  • S1901_C02_001E, change to totfam
  • S1901_C02_0012E, change to medfam
  • S1901_C02_0013E, change to meanfam

For the ACSDP5Y2010_DP04, keep GEO_ID_2.

  • DP04_0001E, change to totoccs
  • DP04_0045E, change to estoo
  • DP04_0045PE, change to prctoo
  • DP04_0088E, change to medval

For the ACSDT5Y2010_B02001

  • B02001_001E, change to totpop
  • B02001_002E, change to ewa
  • B02001_003E, change to eba
  • B02001_004E, change to eaiana
  • B02001_005E, change to eaa
  • B02001_006E, change to enhopa
  • B02001_007E, change to esora
  • B02001_008E, change to e2mra
  • B02001_009E, change to e2mior
  • B02001_0010E, change to em2reor3mr

Convert strings to Numbers

The variable columns are reading as strings because before row 2 was removed, there was text in those columns. Because this is a popular transform another option is to go to “Common Transforms” and select “to number”.

After changing a few columns one-by-one, you might notice that there are A LOT of columns in this dataset.  If you go to the “All” menu, choose “Transform”, “To Number”, you will get a checked list of columns. Deselect “Geoid” and “Name” and select ok.

Screenshot of a single column menu, common transformations are expanded and to number is selected.
Screenshot of the OpenRefine convert multiple columns to number

The final step in this dataset is to export the dataset. You can do that using the export button in the upper-right corner, chose to “export to csv”. You will need to rename OpenRefine output as using the table number because QGIS will adopt the name of the table as a part of the variable name. For example rename:

  • ACSST5Y2010_S1901 = S1901.csv
  • ACSDP5Y2010_DP04 = DP04.csv
  • ACSDT5Y2010_B02001  = B02001.csv

Make sure you clean and rename all the census tables.