Chapter 5 Getting OEPS Data from BigQuery

Opioid Environment Policy Scan data is also available on Google BigQuery. In this notebook, we’ll go over how to interact with the data using bigrquery. We go over two of the bigrquery APIs – one for readers familiar with SQL, and one for readers who want to avoid SQL. Lastly, readers who are already familiar with Google BigQuery will likely want to skip to Make a Query.

5.1 Overview

When making queries against a BigQuery dataset, we do not directly query the dataset. Instead, we connect to a BigQuery profile and submit a job, which tells the profile to make the query in our stead and return the data. You can think of this like connecting to another computer to middleman the exchange.

The setup allows users to work with multiple BigQuery datasets within a single profile, and also allows for billing to be separated so that data providers only pay to store the data instead of also paying for all usage of their data.

The OEPS data warehouse itself is named oeps-391119 on BigQuery, and is divided into two datasets: tabular and spatial. The tabular dataset consists of 16 tables of attribute data at the state, county, tract, and ZCTA scales from 1980 to 2020. The spatial dataset contains the 2010 TIGER/Line geometries for each of these scales. The primary key for the datasets are HEROP_ID. A full dataset schema can be found on the OEPS BigQuery reference linked here.

5.2 Setting up BigQuery

You can set up BigQuery for usage in R in three broad steps:

  1. Enabling BigQuery on your Google Account

  2. Grabbing the name of your BigQuery resource

  3. Connecting everything to BigRQuery

Let’s start at step one. Sign into a Google account on your browser of choice before navigating to this link, where you will be prompted to “Enable BigQuery.” Do so to enable your account to access BigQuery and data through BigQuery.

Once BigQuery is enabled, you’ll be taken to the BigQuery studio page. This page is a hub for BigQuery interaction on the cloud, and technically also a place from which you can test out SQL queries and manage connections to external databases. In the BigQuery diagram, it’s the computer on the cloud that you submit jobs to.

For our purposes, we’re interested in the resources under the explorer. By default, Google creates a default resource by mashing together random words and numbers. You can proceed using this resource, or create a new, more memorably named resource through the “+ ADD” button at the top of the Explorer pane.

Whichever route you take, we need to store the name of your BigQuery project in a variable for use. As it’s the project that gets billed for the queries, it’s conventional to refer to this project as “billing.”

billing <- "oeps-tutorial" # replace this with your project name!

“Will I be charged money for using BigQuery?”

It’s unlikely outside of unreasonably intense usage of BigQuery.

As of September 2024, the free tier for BigQuery allows for 1 TiB of data querying. The entire OEPS dataset is less than 1 GiB in size, so you would need to pull the entire dataset over 1,000 times in a month to leave the free tier. Unless you have an automated pipeline pulling from OEPS or are pulling other datasets on BigQuery, this is a hard limit to reach!

For more on BigQuery billing, see the Google BigQuery pricing page.

Lastly, we need to establish that we actually have permission to create jobs on the account we created. To do that, we can use bigrquery::bq_auth(), and then grant the Tidyverse API a few permissions on our Google Account. Note that this command will prompt you to open a new window in your browser.

# Opens your browser to authenticate your account
bigrquery::bq_auth()

Now that bigrquery is set up, we can explore using two of its interfaces to interact with BigQuery: the low-level API that uses SQL, and a higher level API using dplyr.

5.3 The low-level API

The low-level API offers a series of methods that can be used to interact with BigQuery’s REST API. While bigrquery offers quite a few commands, it’s usually sufficient to use two: bq_project_query and bq_table_download.

Using these commands, we can create and submit SQL queries to pull data tables from the OEPS data warehouse:

library(bigrquery)

# Our query
sql <- 'SELECT HEROP_ID, TotPop, PovP FROM oeps-391119.tabular.C_1990'

# Submit a job to grab the data
tb <- bq_project_query(billing, query=sql)
# Download the results of that query to our system
results <- bq_table_download(tb)

head(results)

We can also use more complex queries:

sql <- '
    SELECT
      C_1990.HEROP_ID,
      (C_2000.PovP - C_1990.PovP) AS ChangeInPovP,
      (C_2000.TotPop - C_1990.TotPop) AS ChangeInPop
    FROM oeps-391119.tabular.C_1990 INNER JOIN oeps-391119.tabular.C_2000
      ON C_1990.HEROP_ID=C_2000.HEROP_ID
'

tb <- bq_project_query(billing, sql)
results <- bq_table_download(tb)

head(results)

If we want to plot this data, we need to query the spatial database. This is doable, but R interactive environments are not always a fan of the result, so we’re forced to turn results into an sf object before attempting to preview it. That is, for the following setup:

library(sf)

sql <- 'SELECT HEROP_ID, geom FROM oeps-391119.spatial.counties2010'
tb <- bq_project_query(billing, sql)

this breaks:

# bad
results <- bq_table_download(tb)
head(results) 

And this works:

# good
results <- bq_table_download(tb) |> 
  st_as_sf(wkt='geom', crs='EPSG:4326') # convert geom to sf
head(results)

5.3.1 A full low-level pipeline:

Putting this all together, we can create a quick map of how county level poverty changed from 1990 to 2000:

library(tmap)
tmap_mode('view')
sql <- '
    SELECT
      C_1990.HEROP_ID,
      (C_2000.PovP - C_1990.PovP) AS ChangeInPovP,
      counties2010.name, counties2010.geom
    FROM oeps-391119.tabular.C_1990 INNER JOIN oeps-391119.tabular.C_2000
      ON C_1990.HEROP_ID=C_2000.HEROP_ID
    INNER JOIN oeps-391119.spatial.counties2010 
      ON C_1990.HEROP_ID=counties2010.HEROP_ID
'

tb <- bq_project_query(billing, sql)
results <- bq_table_download(tb) |> st_as_sf(wkt='geom', crs='EPSG:4326')

tm_shape(results) + 
  tm_fill('ChangeInPovP', 
          style='sd', 
          midpoint=0,
          title='Change in Poverty\n 1990 to 2000',
          palette='-RdYlBu')

5.4 The dplyr API

For users with less SQL familiarity, it’s also possible to use dplyr to interact with BigQuery. We’ll need the help of DBI, a library for interacting with databases in R.

library(dplyr)
library(DBI)
library(bigrquery)

For this pipeline, we use DBI to connect to a given dataset (e.g. tabular), before picking a table within the dataset to interact with and then manipulate that table using dplyr.

# Connect to the tabular dataset
conn <- dbConnect(
  bigquery(),
  project = 'oeps-391119',
  dataset = 'tabular',
  billing = billing
)

# List off available tables
dbListTables(conn)

We can then pick a table from the above and interact with it using dplyr.

C_1990 <- tbl(conn, 'C_1990')

C_1990 |> 
  select(HEROP_ID, TotPop, PovP) |>
  collect() |>
  head()

As with the low-level API, we can also do more complex tasks, albeit with a few more lines of code:

C_1990 <- tbl(conn, 'C_1990') |>
  select(HEROP_ID, PovP1990=PovP, TotPop1990=TotPop) |>
  collect()
C_2000 <- tbl(conn, 'C_2000') |>
  select(HEROP_ID, PovP2000=PovP, TotPop2000=TotPop) |>
  collect()

changes <- merge(C_2000, C_1990, on='HEROP_ID') |>
  mutate(ChangeInPovP = PovP2000-PovP1990,
         ChangeInTotPop = TotPop2000-TotPop1990) |>
  select(HEROP_ID, ChangeInPovP, ChangeInTotPop) |>
  collect()

head(changes)

We can also interact with spatial data. This also requires the same hack as above: we cannot preview our results until after converting them to an sf object, at least within interactive R environments.

In other words, with this connection to the spatial dataset:

# Connect to the spatial dataset
spatial_conn <- dbConnect(
  bigquery(),
  project = 'oeps-391119',
  dataset = 'spatial',
  billing = billing
)

This breaks:

# breaks
counties2010 <- tbl(spatial_conn, 'counties2010') |> 
  collect()

head(counties2010)

and this works:

# works
counties2010 <- tbl(spatial_conn, 'counties2010') |>
  collect() |>
  st_as_sf(wkt='geom', crs='EPSG:4326')

head(counties2010)

5.4.1 A full dplyr pipeline:

Putting all the pieces together, we can make our poverty map with the following code:


# Make database connections
spat_conn <- dbConnect(
  bigquery(),
  project = 'oeps-391119',
  dataset = 'spatial',
  billing = billing
)
tab_conn <- dbConnect(
  bigquery(),
  project = 'oeps-391119',
  dataset = 'tabular',
  billing = billing
)

# Grab tables
C_1990 <- tbl(tab_conn, "C_1990")
C_2000 <- tbl(tab_conn, "C_2000")
counties2010 <- tbl(spat_conn, "counties2010")

# Data wrangling
C_1990 <- C_1990 |>
  select(HEROP_ID, PovP1990=PovP) |>
  collect()
C_2000 <- C_2000 |>
  select(HEROP_ID, PovP2000=PovP) |>
  collect()

change_in_pov <- merge(C_2000, C_1990, on='HEROP_ID') |>
  mutate(ChangeInPovP=PovP2000-PovP1990) |>
  select(HEROP_ID, ChangeInPovP)

counties2010 <- counties2010 |> 
  collect() |> 
  st_as_sf(wkt='geom', crs='EPSG:4326')
  
change_in_pov <- merge(counties2010, change_in_pov, on='HEROP_ID')

tm_shape(results) + 
  tm_fill('ChangeInPovP', 
          style='sd', 
          midpoint=0,
          title='Change in Poverty\n1990 to 2000',
          palette='-RdYlBu')