Chapter 9: Collecting and Cleaning the Data

Jennifer Moore

Chapter 9: Collecting and Cleaning the Data

Getting the Data

Data are provided for the purpose of these exercises, but in this section, we will walk through what data we need and how to find it.

Data we will collect includes:

Income (source: US Census)
Housing characteristics (source: US Census)
Demographics (source: US Census)
Housing assessments (source: St. Louis County Open Data)
Zoning information (source: St. Louis County Open Data)
Schools (source: MSDIS)
Floodways (source: St. Louis County Open Data)
Sinkholes (source: MSDIS)
Voting results (source: St. Louis County Elections)
Bakeries (source: Data Axel)

Getting Census Data

Screenshot of the Tigerline website to get the 2010, Missouri Census tracts. — Screenshot from Census Tiger/Line site

The Census provides two types of data we need, geospatial and tabular. The exercises in this book will use Census Tract Level Data for St. Louis County.

Getting Census Tract Tigerline Files

Go to the Census TIGER/Line Shapefiles page.

Select year: 2010
Select layer type: Census Tract
Submit
State: Missouri
- County: St. Louis County
- Download

The Census tables can be retrieved from the Census’s Data and Maps Page. Census.gov provides the Decennial Census (10yr) as well as the American Community Survey (estimates between the decennial census).

If you are unable to get historical data from the Census site, another useful tool is IPUMS NHGIS.

Using the advanced search in Census.gov, filter for the following.

1. Go to the Explore Census Data page and do a separate Advanced Searches for tables for housing, income, and race.

Screenshot of Census dialog — Screenshot of filter from Census Data Portal

Geographies

Census Tracts
- Missouri
  - St. Louis County
    - All Census Tracts in St. Louis County Missouri

I recommend gathering these topic tables separately.

Topics

Housing
- Download 2010 table DPO2 (zip file)
- Remove topic filters, keep geography filters

Topics

Income and Poverty
- Download 2010 table S1901 (zip file)
- Remove topic filters, keep geography filters

Topics

Race
- Download 2010 table B02001 (zip file)

2. Getting Floodways

These are made available through St. Louis County Open Data Portal.

Search: flood

Select Flood Hazard Areas, 1% Annual Chance
- Select FEMA Floodways
- Download shapefile dataset

3. Getting Sinkhole Data and School Locations

MSDIS stands for Missouri Spatial Data Information Service, and it houses a variety of data sources for the state.

You can find schools by searching for schools.

Download the .shp file

You can find sinkholes by searching for sinkholes

Download the 2018 sinkholes .shp file.

4. Datasets can be found here.

Cleaning Data

Software for cleaning data:

First go to Open Refine and download software specific to your operating system (e.g. OpenRefine 3.9.3 for MacOS)

The cleaning we will need to do is on the Census tables. Our tables require a few basic actions:

Prepare the data so it will successfully join.
Ensure numerical columns will be read as numbers.
Remove any unnecessary information.
Document our changes

You will remember from earlier chapters that we can join tabular data to geospatial data. However, we can only accomplish this if we have a field that meets all the criteria for joining.

Requirements for a successful join

Join fields must contain matching values (a unique identifier is an ideal value for joining)
Join fields must be the same data format (e.g., string, integer, etc.)
Field headers must not contain spaces or special characters
Field values are clean and standardized
Field headers should not use reserved words (e.g., date,” “day,” “month,” “table,” “text,” “user,” “when,” “where,” “year,” or “zone”)
Table is in an incompatible format

Identifier values match:

Our project will require us to join our Census tables to our Tigerline Files. The Census uses a field called Geoid to accomplish this.

A Tigerline GeoID looks like this: 29510127600; the id breaks down like this:

29 is the Census ID for the State of Missouri
510 is the County Number (in this case it represents the county of St. Louis City)
127600 is the Tract number
On their own, none of these numbers are unique, but when amalgamated, it can be unique.

Census tables GeoID’s often look like this when downloaded: 1400000US29510127600

Screenshot of Census table geoid in Excel app

1400000US in a national code. The rest of the id is the same as the id in the Tigerline file

Steps to clean the data:

Launch OpenRefine
Create Project
Upload, navigate to one of the downloaded Census tables
Next

Screenshot of the upload dialog in OpenRefine — Screenshot of the OpenRefine upload screen

When prompted to parse the data you will be asked:

Columns are Separated By: the column separation method, which is based on the type of file you are working with our data is in CSV format (comma separated values), so our separation method is a comma.
Use Character: because sometimes your data will contain the character used for separation, we need to tell OpenRefine how to know when it’s serving its separation function and when it’s part of the value. We will accept the default (“).
Trim leading and trailing white space from string. Spaces are hard to clean out of a dataset because you can’t really see them with your eyes when they’re at the end of a value, but they can cause big problems for analyses. Open refine can remove the white when this is selected. We will select this box.
Ignore first line: we want our first line treated as a header, so do not choose this option.
Parse next: If our dataset has multiple headers we can tell OpenRefine how many lines are reserved as headers. In our case we will choose one.
Column names: if you have ignored the first line and want to provide column names you can list them here, for our purposes we will use the provided column names in the dataset, so we will not fill out this field.
Discard initial rows of data: this can be helpful if there are rows of information preceding the dataset. For our purposes we will not use this.
Load at most: if you don’t want to load all the rows of the dataset you can indicate how many to load. For our purposes we will not be using this.

Screenshot of the OpenRefine parsing screen
Disable auto preview: If you don’t need to see the parsing, check this. We will not check this
Attempt to parse all text into number: This can be useful if you know all the number character should. be text. In our case, we want our variable numbers to read as numbers, but our unique identifier, which is also numbers should read as a string, so we will not check this.
Store blank rows: Unless there is a good reason, I would recommend removing blank rows because they will read as null value. We will deselect store blank rows.

Screenshot of the OpenRefine star row action
Store blank columns: Unless there is a good reason, I would recommend removing blank columns because they will read as null value. We will deselect store blank columns.
Store file source: This would store in OpenRefine, which is not necessary, but you want to document your source.
Store file archive: This would store in OpenRefine, which is not necessary, but you want to document your changes and keep a copy of the original file.

Remove Row 2

To clean up this data we need to remove row 2 so that we are working with number values only. To do this we will:

Click the star symbol next to row two.
Toggle down the “All” menu
Choose “Facet by Star”

Screenshot of the OpenRefine create facet by star action
Select the values that are “true”

Split Column

Toggle down the “All”

Go to edit rows
Remove matching rows
Next step is to remove the extra numbers (1400000US) from the geoid.
Toggle down in the geoid column, click “edit column”

Screenshot of the OpenRefine matching rows

Choose the option to “split into several columns”

How to split the column: by separator. We have to tell it what the separator, in our case the separator will be “S” because that’s the last character in the string we want to remove.
Split into 2 columns
After splitting, “deselect guess cell type” be

Screenshot of the OpenRefine split columns dialog

cause may guess number for our geoid and we want it to remain a string.
“Deselect remove this column” because in case we need to redo it it’s good to have it available.

If it parsed successfully, there are now three columns: the original column “geoid”, geoid1 has the content we want to remove (1400000US), and geoid2 has the tract id that will match our geospatial data.

Remove un-needed columns

All menu with edit columns selected, the reorder/remove columns option is selected. — Screenshot of OpenRefine reorder/split columns dialog

To remove one or two columns; you can toggle down to

“Edit column”
- select “remove column”

To remove several columns, toggle down the All column, Edit columns, “re-order/remove columns.” All the columns will be listed in the left pane, and the right pane will be empty.

You can drag and drop columns 1 by 1 OR you can select remove all columns, and all the columns will move to the left pane, then drag the ones you want to keep back into the left pane.

Using the metadata table for each of our tables, you can identify which columns to keep.

For the ACSST5Y2010_S1901, keep:

GEO_ID_2
S1901_C01_001E
S1901_C02_001E
S1901_C02_0012E
S1901_C02_0013E

For the ACSDP5Y2010_DP04, keep

GEO_ID_2
DP04_0001E
DP04_0045E
DP04_0045PE
DP04_0088E
DP04_0088PE

For the ACSDT5Y2010_B02001, keep:

B02001_001E
B02001_002E
B02001_003E
B02001_004E
B02001_005E
B02001_006E
B02001_007E
B02001_008E
B02001_009E
B02001_0010E

Rename Columns

In each column, toggle down the column.

For the ACSST5Y2010_S1901, keep GEO_ID_2.

S1901_C01_001E, change to totpop
S1901_C02_001E, change to totfam
S1901_C02_0012E, change to medfam
S1901_C02_0013E, change to meanfam

For the ACSDP5Y2010_DP04, keep GEO_ID_2.

DP04_0001E, change to totoccs
DP04_0045E, change to estoo
DP04_0045PE, change to prctoo
DP04_0088E, change to medval

For the ACSDT5Y2010_B02001

B02001_001E, change to totpop
B02001_002E, change to ewa
B02001_003E, change to eba
B02001_004E, change to eaiana
B02001_005E, change to eaa
B02001_006E, change to enhopa
B02001_007E, change to esora
B02001_008E, change to e2mra
B02001_009E, change to e2mior
B02001_0010E, change to em2reor3mr

Convert strings to Numbers

The variable columns are reading as strings because before row 2 was removed, there was text in those columns. Because this is a popular transform another option is to go to “Common Transforms” and select “to number”.

After changing a few columns one-by-one, you might notice that there are A LOT of columns in this dataset. If you go to the “All” menu, choose “Transform”, “To Number”, you will get a checked list of columns. Deselect “Geoid” and “Name” and select ok.

Screenshot of a single column menu, common transformations are expanded and to number is selected. — Screenshot of the OpenRefine convert multiple columns to number

The final step in this dataset is to export the dataset. You can do that using the export button in the upper-right corner, chose to “export to csv”. You will need to rename OpenRefine output as using the table number because QGIS will adopt the name of the table as a part of the variable name. For example rename:

ACSST5Y2010_S1901 = S1901.csv
ACSDP5Y2010_DP04 = DP04.csv
ACSDT5Y2010_B02001 = B02001.csv

Make sure you clean and rename all the census tables.

License

Icon for the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License

An Introduction to Geospatial Thinking and Open Source GIS Copyright © 2025 by Jennifer Moore is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, except where otherwise noted.

Getting the Data

Data we will collect includes:

Getting Census Data

Getting Census Tract Tigerline Files

1. Go to the Explore Census Data page and do a separate Advanced Searches for tables for housing, income, and race.

Geographies

Topics

Topics

Topics

2. Getting Floodways

3. Getting Sinkhole Data and School Locations

4. Datasets can be found here.

Cleaning Data

Requirements for a successful join

Steps to clean the data:

Remove Row 2

Split Column

Remove un-needed columns

Rename Columns

Convert strings to Numbers

License

Share This Book