BDEEP
Big Data in Environmental Economics and Policy
Research Group

Our team is focused on problems in public economics and public policy, with particular focus on cities and the environment. Current projects include: climate change policy, urban growth and expansion, transportation policy, urban drinking water, housing discrimination, and environmental amenities in cities. We utilize a combination of applied microeconomics and data science methods and have developed a software stack that enables the use of high performance computing in observational research and experimental trials. We regularly partner with data and technology companies to bring new technological platforms and data sources into economic research. Please take a look at our current projects and our GitHub for examples of our research. Our work is made possible through generous funding support from the National Science Foundation, the US Environmental Protection Agency (EPA), the Sloan Foundation, the Russell Sage Foundation, and Uber Technologies.

You can find us on the 3rd Floor of the National Center For Supercomputing Applications, at 1205 W Clark St, Urbana, IL 61801. We invite interested candidates to drop in on our weekly meetings to find out more about what we are working on.
new_bldg3.-1.jpg

Infrastructure Stack

We employ a set of cutomized tools to enable the acquisition and analysis of large datasets in observational research and experimental trials. Our infrastructure is designed to support a continuously integrated pipeline that includes the following primary components:
  • Acquire large datasets.
  • Store large datasets in a fully secure system.
  • Pre-process and analyze data in virtual computing environments.
  • Host replicable and continuously updating datasets and applications.
  • Compile and continuously update documents for publication.

Here is a simple diagram of the BDEEP Infrastructure Pipeline.

Acquisition

Traffic delays, housing transactions, social media posts, pollution concentrations, and satellite observations are examples of data we regularly query. We build tools to facilitate the acquisition of these datasets. Different data sources are queried at different rates and using different protocols. For example, data are downloaded from monitoring websites, at discrete intervals and sometimes in real-time. Scripts are built to acquire data through downloading, scraping, crawling, or directly engaging with users of applications.

Scripts run in Docker containers. Scripts are scheduled to execute in accordance with the research design of a project. Docker containers allow developers and system administrators to isolate applications and allow them to run in a consistent environment regardless of which machine they are running on. BDEEP uses Docker for the vast majority of its applications.

Store

In some cases, we store individual files (JSON, CSV, RDS, SHP, TIFF, etc.), but more often computational efficiencies or other empirical protocols require database storage (e.g. MongoDB, SQL). Our infrastructure allows us to store large datasets and allow group members to access and / or query these datasets. BDEEP uses samba to host a shared network as well as an Active Directory server. The shared network allows members of our team to collaborate through a common network-mounted directory.

The Active Directory server uses samba. Active Directory allows us to maintain a credential server. This credential server allows us to add, delete, or modify user credentials in the same place. Currently, it is only used to access the shared network, but it also has a range of other applications.

We recently added a Postgres database to our infrastructure. Since R is an in-memory operation language, many of large datasets encounter “out-of-RAM” errors during operations on larger data objects. Our PostGres Database can handle the partitioning, subsetting, and merging operations for larger datasets. This allows us to work more efficiently by targeting an analysis using database queries rather than loading entire datasets into R. We developing a warehouse for the datasets that are stored using our Postgres database.

Finally, we use AWS to back up our files on a weekly basis.

Analyze

All BDEEP team members are able to access BDEEP project files on our shared network. BDEEP members perform data analysis using a RStudio server which they can access through a web browser window. R is an increasingly widespread programming language in economics and data science. For information on how we set up RStudioServer on our cloud see: Installing RStudioServer on Ubuntu 15.04.

Communicate

The results of our research are directly compiled and updated in for-publication documents and presentations using a system that is based on Latex. Key graphs, tables, maps and other figures can also be hosted directly on our website more interactive formats (ex. d3.js, shiny). These tools can be valuable for public engagement and transparency.

In keeping with standards for reproducible research, all BDEEP team members are expected to maintain their code in BDEEP’s Github repositories. Full documentation of our research methodologies is also available in the associated repositories.

Other Services/Requirements

We use a combination of cloud-based and computing infrastructure to manage ongoing projects and push the computational frontier of empirical research. We make use of both cloud-based (OpenStack - a cloud orchestration platform and AWS) and industry-standard computing clusters (the iForge/aForge Supercomputer Cluster at the National Center for Supercomputing Applications).

Here is an overview of all the platforms used in BDEEP infrastructure: Platforms