Open Data Resources (OLD)

Introduction

While free and publicly-accessible datasets are increasingly abundant online, it can still feel overwhelming trying to find the data you need. There's a lot available to you out there, but it can seem scattered all over the place. To make things easier, we can think of open data coming from 3 major sources: Government, Organizations/Communities, and Academia/Research.

The US government has made a concerted effort to share its collected data with the general public by publishing them online in easily-accessible, machine-readable formats (see OPEN Government Data Act). The datasets cover quite widespread and diverse topics, making it an extremely useful resource for data-searchers from practically all disciplines. Major data areas of interest include:

  • Business
  • Education
  • Energy & Environment
  • Finance & Economics
  • Health
  • Human Services
  • Public Safety
  • Recreation
  • Transportation

>> Search by DATA.GOV

Data.gov logo DATA.GOV, the US government's open data portal, currently contains over 200,000+ datasets collected from its many federal agencies and departments. It may not include every government dataset available, but it does serve as a great starting point. The data hosted here also benefit from being typically well-maintained and curated.

>> Search by City/State-level data portals

Most city and state governments maintain their own data portals similar to DATA.GOV, hosting homegrown datasets from their local agencies.

opendata.cityofnewyork.us

The City of New York's open data portal offers many interesting datasets on the life of New Yorkers.

Interested in where Citi Bikers go? Check out their Citi Bike data

Curious how restaurant businesses grade on their health inspections? Go to the New York City Restaurant Inspection Results dataset by the Department of Health and Mental Hygiene.

How about crimes and violations reported in NYC? There's the NYPD Complaint dataset by the New York Police Department.

data.ny.gov

New York State also produces a data portal with data collected from the state-level departments.

The NYS Department of Health's tracking of COVID-19 data, for example, has encouraged transparency of coronavirus testing and infection rates.

The above are New York examples, but each state and many major cities have their own data portals as well. You can find the complete listing here: US Local Government Open Data Portals

Want a few more good examples? Visit the Boston (data.boston.gov), San Francisco (datasf.org/opendata), and Seattle (data.seattle.gov) data portals.


>> Search by Government Agency/Department data portals

This method may be useful if you already have a clear idea of the data you need. Try to think if there is a government agency or department that may collect data on your topic, then visit their website to see what data they have published online: Full listing of government agencies and departments

It may take a little more effort and detective work to discover, but the data found in these sources can be quite robust. Here are a few interesting examples that do an especially excellent job sharing their data with the public.

census.gov/data

The Census Bureau focuses on collecting data on its nation's people and economy. If you're interested in demographics of the US population, this is the source for you.

fred.stlouisfed.org

The Federal Reserve hosts over 500,000 financial and economic data series in its FRED portal, including datasets on economic indicators, banking & finance, labor markets, employment, and national and international accounts.

healthdata.gov

From the US Department of Health & Human Services, this data portal covers wide range of health topics, such as environmental health, medical devices, Medicare & Medicaid, social services, community health, mental health, and substance abuse.

data.transportation.gov

The US Department of Transportation publishes data on all the ways we get around, from driving to public transit, and even bicycling and walking.

 

>> Major Organizations/Associations

who.int/gho/data

WHO is a worldwide leader in collecting health data. While they have many useful data products, the Global Health Observatory (GHO) project is especially good at providing data on major global health indicators like disease spread, vaccinations, and rates of other illnesses/conditions.

I linked their old GHO interface here since I feel it works better, but they do have a newer version available as well.

databank.worldbank.org

The World Bank publishes global development data in their Data Bank platform. You can find data on major world development indicators like GDP, population, life expectancy, and education levels to name a few. 

pewresearch.org

The Pew Research Center provides data from their polling and survey studies on social issues, attitudes, and trends. Covers topics on politics, media, culture, religion, and internet/tech.  

[list of topics] [list of dataset categories]


>> Data Science Communities

The data scientists analyze and build predictive models from datasets. As such, data portals have been popping up by the data science community on a wide variety of datasets to use or test machine learning algorithms on. They can be quite interesting for other purposes as well.

kaggle.com/datasets

Kaggle is a popular data science competition site, where users can upload their collected datasets for fellow community members to compete in building accurate predictive models.

The datasets are incredibly varied and interesting, but also highly specific. A few examples: NYC AirBnB data, Youtube trending videos data, video game sales data, and Netflix movies & shows data

A word of caution: Be sure to check the dataset documentation to see how the owner collected the data. Sometimes it may simply be simulated play data for the sake of practice. Since any user in the Kaggle community can submit datasets, the quality may not always be high-quality.

archive.ics.uci.edu/ml/datasets

Hosted by the University of California--Irvine, the UCI Machine Learning Repository contains many famous datasets common for practicing data analysis and machine learning techniques. 

You may even have seen some of the datasets before in your Statistics courses, such as the famous Challenger Space Shuttle dataset and Wine Quality dataset.

The benefit of this resource is that the data is typically prepared for the purposes of analysis. Datasets are even categorized by methods of analyses they are conducive for.

reddit.com/r/datasets

The r/datasets subreddit is an active community of data users to share interesting datasets or request data they are seeking. Like Kaggle, be mindful that content can be uploaded by anonymous members who have not been vetted.


>> Dataset Aggregators

datasetsearch.research.google.com

Google has a search engine designed specifically for finding datasets. Many of the government and association-provided datasets described elsewhere in this guide can be found through Google's Dataset Search platform as well.

github.com/awesomedata/awesome-public-datasets

There is a well-curated list of open data resources on GitHub (by the AwesomeData community), with entries categorized neatly by subject area. It includes some of the data portals listed in this guide.

 

Researchers commonly produce data as part of the research process. Depending on whether they received government NIH/NSF funding for their study, they may even be required to publish their datasets publicly as well. Many academic institutions have open repositories in which researchers can store their data and make it visible.

>> Scholarly Data Repositories

dataverse.harvard.edu

Harvard's Dataverse contains over 100,000+ multidisciplinary datasets from published studies by researchers within and outside the Harvard community. 

icpsr.umich.edu

Hosted by the University of Michigan, ICPSR is a consortium of over 750 academic institutions and research organizations.

It contains over 250,000+ datasets from research in the social, political, and behavioral sciences, which includes topics of education, voting, criminal justice, substance abuse, and terrorism, for example.

[list of major topic areas]

datadryad.org

Dryad is a non-profit initiative organized by several academic institutions, research societies, and publishers to archive and share data. The collection policy is "subject-agnostic", intending to cover a broad range of research topics. However, it does seem to lean toward the natural and life sciences at the moment.

ieee-dataport.org

IEEE is a prominent publisher in the electrical engineering and technology space. Their DataPort platform serves as an interface to datasets from studies published in their many scholarly journals.


>> Directory of Data Repositories

The scholarly/research data repositories listed above are only a small subset that I chose to highlight as good examples. There are in fact many out there to choose from, and it's a selection that's growing rapidly. Below are excellent resources that carefully curate a list of high-quality open data repositories and serve as a directory to them. While they may not point to the datasets themselves, they direct you to data repositories that would host the data you're looking for.

re3data.org

Over 2,000 research data repositories are indexed in this directory, which includes a search interface to help you locate relevant repositories based on subject. It is a multidisciplinary collection, covering Humanities & Social Sciences, Natural Sciences, Life Sciences, and Engineering Sciences.

oad.simmons.edu/oadwiki/Data_repositories

The Open Access Directory (OAD) is an initiative maintained by a community of researchers and scholars to list open access scholarly resources. They also publish a list of open access data repositories organized by subject area.

If you're interested in browsing general datasets, start from the data repositories listed in their multidisciplinary section.

Suggest a dataset

Got an interesting dataset or data source to recommend? Let me know about it!

E-mail: charles.terng@baruch.cuny.edu