PUBLIC DATASETS | |
datasetsearch.research.google.com Google has a search engine designed specifically for finding datasets. It is a curated collection of datasets submitted to Google (not to be mistaken with automatically crawling the web for datasets). |
|
github.com/awesomedata/awesome-public-datasets There is a well-curated list of open data resources on GitHub (by the AwesomeData community), with entries categorized neatly by subject area. |
|
Kaggle is a popular data science competition site, where users can upload their collected datasets for fellow community members to compete in building accurate predictive models. The datasets are incredibly varied and interesting, but also highly specific. A few examples: NYC AirBnB data, Youtube trending videos data, video game sales data, and Netflix movies & shows data. A word of caution: Be sure to check the dataset documentation to see how the owner collected the data. Sometimes it may simply be simulated play data for the sake of practice. Since any user in the Kaggle community can submit datasets, the quality may not always be high-quality. |
|
The r/datasets subreddit is an active community of data users to share interesting datasets or request data they are seeking. Like Kaggle, be mindful that content can be uploaded by anonymous members who have not been vetted. |
RESEARCH DATASETS | |
Over 2,000 research data repositories are indexed in this directory, which includes a search interface to help you locate relevant repositories based on subject. Covers Humanities & Social Sciences, Natural Sciences, Life Sciences, and Engineering Sciences. |
|
oad.simmons.edu/oadwiki/Data_repositories The Open Access Directory (OAD) is an initiative maintained by a community of researchers and scholars to list open access scholarly resources. They also publish a list of open access data repositories organized by subject area. General data portals are listed in their multidisciplinary section. |
|
Harvard's Dataverse contains over 100,000+ multidisciplinary datasets from published studies by researchers within and outside the Harvard community. |
|
archive.ics.uci.edu/ml/datasets Hosted by the University of California--Irvine, the UCI Machine Learning Repository contains many famous datasets common for practicing data analysis and machine learning techniques. You may even have seen some of the datasets before in your Statistics courses, such as the famous Challenger Space Shuttle dataset and Wine Quality dataset. The benefit of this resource is that the data is typically prepared for the purposes of analysis. Datasets are even categorized by methods of analyses they are conducive for. |
SUBJECT AREAS | DATA |
---|---|
Businesses |
General data on US businesses -- including types, finances, investments, profits, expenditures, sales, and inventory [Source: US Census Bureau] Annual number of firms, establishments, employment, and payroll by US geographic location, industry, and company size [Source: Statistics of US Businesses (SUSB)] Production & business activity across industries such as retail, manufacturing, services, and tech. [Source: Federal Reserve] |
Consumer |
Consumer Expenditure Surveys (CE) -- data on expenditures, income, and characteristics of US consumers [Source: US Bureau of Labor Statistics] Consumer Price Index (CPI) -- Average change over time in prices paid by consumers for goods and services [Source: US Bureau of Labor Statistics] |
Demographics of labor force | Gender, age, race, ethnic origin, education, etc. of the labor market [Source: US Bureau of Labor Statistics] |
Economic indicators |
GDP and other indicators by geographical location and industry [Source: US Bureau of Economic Analysis] |
Employment |
US employment data -- including occupational employment and wages, labor demand and turnover, and state of the labor market [Source: US Bureau of Labor Statistics] Employment change by industry, company size, and geographic area [Source: Statistics of US Businesses (SUSB)] |
International economic accounts |
Data on international trade (exports and imports) in goods and services, investments, transactions (balance of payments) and activities of multinational enterprises (employment, sales, expenditures, R&D, etc.) [Source: US Bureau of Economic Analysis] |
Money statistics |
Data on national finances, interest rates, exchange rates, consumer credit, and loans [Source: US Federal Reserve] |
Prices |
Producer Price Index (PPI) data by industry and commodity [Source: US Bureau of Labor Statistics] Commodity prices, house price indexes, health care indexes, PPI & CPI, etc. [Source: Federal Reserve] Historical pricing data (up to 2012): Indexes of producer and consumer prices, actual prices for selected commodities, food costs, and energy/fuel prices [Source: see "Prices" section of Statistical Abstracts of the United States] |
Tax |
Statistical tables for individual and corporate taxes. Tax data at the organization-level for charities and private foundations. [Source: Internal Revenue Service] |
FROM THE BARUCH LIBRARY | |
Wharton Research Data Services (WRDS) is an excellent data source for many business areas including Accounting, Banking, Economics, ESG, Finance, Healthcare, Insurance, and Marketing. It is NOT a source of open data like the others highlighted in this guide. However, Baruch students/faculty have access to this platform through the Library's subscription. You can access it here and more info on registering an account can be found here. |
SUBJECT AREAS | DATA |
---|---|
Arts & Culture |
Research data (typically from surveys) on arts participation and opinions. [Source: National Archive of Data on Arts & Culture (by ICPSR)] |
Crime & Justice |
Research data produced by studies in criminology, including guns/weapon-related crimes, homicides, white-collar crimes, drugs, gangs, and sex crimes [Source: National Archive of Criminal Justice Data (by ICPSR)] |
Minority |
Data on US minority populations regarding discrimination, education, crime, housing, public opinion, poverty and income, political participation, and more [Source: Resource Center for Minority Data (by ICPSR)] |
Politics & Voting |
Public opinion survey data on political issues, presidential performance, and voting trends [Source: ICPSR] Political opinion survey data [Source: Pew Research Center] |
Social media / internet |
Raw content data from social media networks (Facebook, Twitter, etc.), online communities (Reddit), media (YouTube), and e-commerce (Amazon) [Source: Stanford Network Analysis Project] |
Social trends |
Public opinion survey data on American cultural and social trends, media, political climate, technology, and current issues [Source: Pew Research Center] |
Women in the labor force |
Data on occupations, earnings, unemployment, and demographics for women [Source: US Department of Labor, Women's Bureau] |
Got an interesting dataset or data source to recommend? Let me know about it!
E-mail: charles.terng@baruch.cuny.edu