Open Data Resources (OLD)

PUBLIC DATASETS

datasetsearch.research.google.com

Google has a search engine designed specifically for finding datasets. It is a curated collection of datasets submitted to Google (not to be mistaken with automatically crawling the web for datasets).

github.com/awesomedata/awesome-public-datasets

There is a well-curated list of open data resources on GitHub (by the AwesomeData community), with entries categorized neatly by subject area.

kaggle.com/datasets

Kaggle is a popular data science competition site, where users can upload their collected datasets for fellow community members to compete in building accurate predictive models.

The datasets are incredibly varied and interesting, but also highly specific. A few examples: NYC AirBnB data, Youtube trending videos data, video game sales data, and Netflix movies & shows data

A word of caution: Be sure to check the dataset documentation to see how the owner collected the data. Sometimes it may simply be simulated play data for the sake of practice. Since any user in the Kaggle community can submit datasets, the quality may not always be high-quality.

reddit.com/r/datasets

The r/datasets subreddit is an active community of data users to share interesting datasets or request data they are seeking. Like Kaggle, be mindful that content can be uploaded by anonymous members who have not been vetted.

 

RESEARCH DATASETS

re3data.org

Over 2,000 research data repositories are indexed in this directory, which includes a search interface to help you locate relevant repositories based on subject. Covers Humanities & Social Sciences, Natural Sciences, Life Sciences, and Engineering Sciences.

[Breakdown of subject areas]

oad.simmons.edu/oadwiki/Data_repositories

The Open Access Directory (OAD) is an initiative maintained by a community of researchers and scholars to list open access scholarly resources. They also publish a list of open access data repositories organized by subject area.

General data portals are listed in their multidisciplinary section.

dataverse.harvard.edu

Harvard's Dataverse contains over 100,000+ multidisciplinary datasets from published studies by researchers within and outside the Harvard community. 

archive.ics.uci.edu/ml/datasets

Hosted by the University of California--Irvine, the UCI Machine Learning Repository contains many famous datasets common for practicing data analysis and machine learning techniques. 

You may even have seen some of the datasets before in your Statistics courses, such as the famous Challenger Space Shuttle dataset and Wine Quality dataset.

The benefit of this resource is that the data is typically prepared for the purposes of analysis. Datasets are even categorized by methods of analyses they are conducive for.

GOVERNMENT DATASETS

data.gov

The US government's open data portal contains datasets from the local city/state level and its many federal departments and agencies. Some topics include businessdemographics, education, energy & environment, finance & economics, health, human services, public safety, recreation, and transportation.


SUBJECT AREAS DATA
Businesses

General data on US businesses -- including types, finances, investments, profits, expenditures, sales, and inventory [Source: US Census Bureau

Annual number of firms, establishments, employment, and payroll by US geographic location, industry, and company size [Source: Statistics of US Businesses (SUSB)]

Production & business activity across industries such as retail, manufacturing, services, and tech. [Source: Federal Reserve]

Consumer

Consumer Expenditure Surveys (CE) -- data on expenditures, income, and characteristics of US consumers [Source: US Bureau of Labor Statistics]

Consumer Price Index (CPI) -- Average change over time in prices paid by consumers for goods and services [Source: US Bureau of Labor Statistics]

Demographics of labor force Gender, age, race, ethnic origin, education, etc. of the labor market [Source: US Bureau of Labor Statistics]
Economic indicators

GDP and other indicators by geographical location and industry [Source: US Bureau of Economic Analysis]

Employment

US employment data -- including occupational employment and wages, labor demand and turnover, and state of the labor market [Source: US Bureau of Labor Statistics]

Employment change by industry, company size, and geographic area [Source: Statistics of US Businesses (SUSB)]

International economic accounts

Data on international trade (exports and imports) in goods and services, investments, transactions (balance of payments) and activities of multinational enterprises (employment, sales, expenditures, R&D, etc.) [Source: US Bureau of Economic Analysis

Money statistics

Data on national finances, interest rates, exchange rates, consumer credit, and loans [Source: US Federal Reserve

Prices

Producer Price Index (PPI) data by industry and commodity [Source: US Bureau of Labor Statistics]

Commodity prices, house price indexes, health care indexes, PPI & CPI, etc. [Source: Federal Reserve]

Historical pricing data (up to 2012): Indexes of producer and consumer prices, actual prices for selected commodities, food costs, and energy/fuel prices [Source: see "Prices" section of Statistical Abstracts of the United States]

Tax

Statistical tables for individual and corporate taxes. Tax data at the organization-level for charities and private foundations. [Source: Internal Revenue Service]


FROM THE BARUCH LIBRARY

Wharton Research Data Services (WRDS) is an excellent data source for many business areas including Accounting, Banking, Economics, ESG, Finance, Healthcare, Insurance, and Marketing.

It is NOT a source of open data like the others highlighted in this guide. However, Baruch students/faculty have access to this platform through the Library's subscription. You can access it here and more info on registering an account can be found here.


SUBJECT AREAS DATA
Arts & Culture

Research data (typically from surveys) on arts participation and opinions. [Source: National Archive of Data on Arts & Culture (by ICPSR)]

Crime & Justice

Research data produced by studies in criminology, including guns/weapon-related crimes, homicides, white-collar crimes, drugs, gangs, and sex crimes [Source: National Archive of Criminal Justice Data (by ICPSR)]

Minority

Data on US minority populations regarding discrimination, education, crime, housing, public opinion, poverty and income, political participation, and more [Source: Resource Center for Minority Data (by ICPSR)]

Politics & Voting

Public opinion survey data on political issues, presidential performance, and voting trends [Source: ICPSR

Political opinion survey data [Source: Pew Research Center]

Social media / internet

Raw content data from social media networks (Facebook, Twitter, etc.), online communities (Reddit), media (YouTube), and e-commerce (Amazon) [Source: Stanford Network Analysis Project]

Social trends

Public opinion survey data on American cultural and social trends, media, political climate, technology, and current issues [Source: Pew Research Center]

Women in the labor force

Data on occupations, earnings, unemployment, and demographics for women [Source: US Department of Labor, Women's Bureau]


HIGHLIGHTED RESOURCES

icpsr.umich.edu

Hosted by the University of Michigan, ICPSR is a consortium of over 750 academic institutions and research organizations.

It contains over 250,000+ datasets from research in the social, political, and behavioral sciences, which includes topics of education, voting, criminal justice, substance abuse, and terrorism, for example.

[list of major topic areas]

pewresearch.org

The Pew Research Center provides data from their polling and survey studies on social issues, attitudes, and trends. Covers topics on politics, media, culture, religion, and internet/tech.  

[list of topics] [list of dataset categories]


Suggest a dataset

Got an interesting dataset or data source to recommend? Let me know about it!

E-mail: charles.terng@baruch.cuny.edu