Open Data Resources

MAJOR SOURCES OF DATA

The US government shares much of its collected data with the general public by publishing them online in easily-accessible, machine-readable formats (OPEN Government Data Act). Datasets produced by the government can cover quite widespread and diverse topics, including:

  • Business
  • Education
  • Energy & Environment
  • Finance & Economics
  • Health
  • Human Services
  • Public Safety
  • Recreation
  • Transportation

>> Search by DATA.GOV

Data.gov logo

DATA.GOV

  • The US government's open data portal. Contains over 200,000+ datasets collected from its many federal agencies and departments.
  • It may not include every government dataset available, but serves as a great starting point.

>> Search by City/State-level data portals

Most city and state governments maintain their own data portals similar to DATA.GOV, hosting datasets from their local departments.

NYC Open Data logo

opendata.cityofnewyork.us

The City of New York's open data portal offers many interesting datasets on the life of New Yorkers. Examples:

New York State data portal logo

data.ny.gov

  • New York State also produces a data portal with data collected from the state-level departments.
  • Example: The NYS Department of Health's COVID-19 data tracking coronavirus testing and infection rates.

The above are examples for New York, but each state and many major cities have their own data portals as well. You can find the complete listing here: US Local Government Open Data Portals

Want a few more good examples? Visit the Boston (data.boston.gov), San Francisco (datasf.org/opendata), and Seattle (data.seattle.gov) data portals.


>> Search by Government Agency/Department data portals

This method may be useful if you already have a clear idea of the data you need. Try to think if there is a government agency or department that may collect data on your topic, then visit their website to see what data they have published online: Full listing of government agencies and departments

It may take a little more effort and detective work to discover, but the data found in these sources can be quite robust. Here are a few good examples:

US Census Bureau logo

census.gov/data

  • The Census Bureau collects data on its nation's people and economy.
  • Excellent source of demographic data on US population.
FRED economic data portal logo

fred.stlouisfed.org

  • The Federal Reserve hosts over 500,000 financial and economic data series in its FRED portal.
  • Includes datasets on economic indicators, banking & finance, labor markets, employment, and national and international accounts.
healthdata.gov logo

healthdata.gov

  • From the US Department of Health & Human Services, this data portal covers wide range of health topics.
  • Datasets on environmental health, medical devices, Medicare & Medicaid, social services, community health, mental health, and substance abuse.
transportation.gov logo

data.transportation.gov

  • The US Department of Transportation publishes data on all the ways we get around, from driving to public transit, and even bicycling and walking.

 

>> Major Organizations/Associations

world health organization logo

who.int/gho/data

  • WHO's Global Health Observatory (GHO) project provides data on major global health indicators like disease spread, vaccinations, and rates of other illnesses/conditions.
the world bank logo

databank.worldbank.org

  • The World Bank publishes global development data in their Data Bank platform.
  • Data on major world development indicators like GDP, population, life expectancy, and education levels.
pew research center logo

pewresearch.org

  • The Pew Research Center provides data from their polling and survey studies on social issues, attitudes, and trends.
  • Covers topics on politics, media, culture, religion, and internet/tech.  

[list of topics] [list of dataset categories]


>> Data Science Communities

Data science communities upload and share interesting datasets for the purposes of model building and analysis.

kaggle logo

kaggle.com/datasets

  • Kaggle is a popular data science competition site, where community members compete in building predictive models from uploaded datasets.
  • The datasets are incredibly varied and interesting, but also highly specific. Examples: NYC AirBnB data, Youtube trending videos data, video game sales data, and Netflix movies & shows data
  • A word of caution: Be sure to check the dataset documentation to see how the owner collected the data. Sometimes it may simply be simulated play data for the sake of practice. Since any user in the Kaggle community can submit datasets, the quality may vary.
UCI machine learning repository logo

archive.ics.uci.edu/ml/datasets

  • Hosted by the University of California--Irvine, the UCI Machine Learning Repository contains many famous datasets common for machine learning techniques. 
  • Examples: Challenger Space Shuttle dataset and Wine Quality dataset.
  • Datasets are typically already cleaned and preprocessed.
  • Categorized by methods of analyses they were designed for.
reddit datasets subreddit logo

reddit.com/r/datasets

  • The r/datasets subreddit is an active community of data users to share interesting datasets or request data they are seeking.
  • Like Kaggle, be mindful that content can be uploaded by anonymous members who have not been vetted.
538 logo

data.fivethirtyeight.com

  • FiveThirtyEight, an analytics-focused politics, economics, and sports website, provides the data behind their articles and visualizations.

>> Dataset Aggregators

google dataset search logo

datasetsearch.research.google.com

  • Google's search engine designed specifically for finding datasets.
  • Many of the government and association-provided datasets discussed in this guide can be found through this platform as well.
github logo for awesome public datasets page

github.com/awesomedata/awesome-public-datasets

  • Curated list of open data resources on GitHub (by the AwesomeData community)
  • Entries categorized neatly by subject area.

 

Many academic institutions provide open repositories in which researchers can store data produced in their studies and make it accessible to the community.

>> Scholarly Data Repositories

harvard dataverse logo

dataverse.harvard.edu

  • Harvard's Dataverse contains over 100,000+ multidisciplinary datasets from published studies by researchers within and outside the Harvard community. 
icpsr logo

icpsr.umich.edu

  • ICPSR is a consortium of over 750 academic institutions and research organizations.
  • Contains over 250,000+ datasets from research in the social, political, and behavioral sciences.
  • Includes topics on education, voting, criminal justice, substance abuse, and terrorism.

[list of major topic areas]

dryad logo

datadryad.org

  • Dryad is a non-profit initiative organized by several academic institutions, research societies, and publishers to archive and share data.
  • Their policy is to cover a broad range of research topics. However, it does seem to lean toward the natural and life sciences at the moment.
ieee data port logo

ieee-dataport.org

  • IEEE is a prominent publisher in the electrical engineering and technology space.
  • Their DataPort platform serves as an interface to datasets from studies published in their many scholarly journals.

>> Directory of Data Repositories

Open data directories curate and organize lists of high-quality scholarly/research data repositories by subject area.

registry of research data repositories logo

re3data.org

  • Over 2,000 research data repositories indexed.
  • Multidisciplinary list covering Humanities & Social Sciences, Natural Sciences, Life Sciences, and Engineering Sciences.

[Breakdown of subject areas]

open access directory logo

oad.simmons.edu/oadwiki/Data_repositories

  • Maintained by a community of researchers and scholars.
  • If you're interested in browsing general datasets, start from their multidisciplinary section.

Suggest a dataset

Got an interesting dataset or data source to recommend? Let me know about it!

E-mail: charles.terng@baruch.cuny.edu