While free and publicly-accessible datasets are increasingly abundant online, it can still feel overwhelming trying to find the data you need. There's a lot available to you out there, but it can seem scattered all over the place. To make things easier, we can think of open data coming from 3 major sources: Government, Organizations/Communities, and Academia/Research.
The US government has made a concerted effort to share its collected data with the general public by publishing them online in easily-accessible, machine-readable formats (see OPEN Government Data Act). The datasets cover quite widespread and diverse topics, making it an extremely useful resource for data-searchers from practically all disciplines. Major data areas of interest include:
|
|
DATA.GOV, the US government's open data portal, currently contains over 200,000+ datasets collected from its many federal agencies and departments. It may not include every government dataset available, but it does serve as a great starting point. The data hosted here also benefit from being typically well-maintained and curated. |
>> Search by City/State-level data portals
Most city and state governments maintain their own data portals similar to DATA.GOV, hosting homegrown datasets from their local agencies.
The City of New York's open data portal offers many interesting datasets on the life of New Yorkers. Interested in where Citi Bikers go? Check out their Citi Bike data. Curious how restaurant businesses grade on their health inspections? Go to the New York City Restaurant Inspection Results dataset by the Department of Health and Mental Hygiene. How about crimes and violations reported in NYC? There's the NYPD Complaint dataset by the New York Police Department. |
|
New York State also produces a data portal with data collected from the state-level departments. The NYS Department of Health's tracking of COVID-19 data, for example, has encouraged transparency of coronavirus testing and infection rates. |
The above are New York examples, but each state and many major cities have their own data portals as well. You can find the complete listing here: US Local Government Open Data Portals
Want a few more good examples? Visit the Boston (data.boston.gov), San Francisco (datasf.org/opendata), and Seattle (data.seattle.gov) data portals.
>> Search by Government Agency/Department data portals
This method may be useful if you already have a clear idea of the data you need. Try to think if there is a government agency or department that may collect data on your topic, then visit their website to see what data they have published online: Full listing of government agencies and departments
It may take a little more effort and detective work to discover, but the data found in these sources can be quite robust. Here are a few interesting examples that do an especially excellent job sharing their data with the public.
>> Major Organizations/Associations
>> Data Science Communities
The data scientists analyze and build predictive models from datasets. As such, data portals have been popping up by the data science community on a wide variety of datasets to use or test machine learning algorithms on. They can be quite interesting for other purposes as well.
Kaggle is a popular data science competition site, where users can upload their collected datasets for fellow community members to compete in building accurate predictive models. The datasets are incredibly varied and interesting, but also highly specific. A few examples: NYC AirBnB data, Youtube trending videos data, video game sales data, and Netflix movies & shows data. A word of caution: Be sure to check the dataset documentation to see how the owner collected the data. Sometimes it may simply be simulated play data for the sake of practice. Since any user in the Kaggle community can submit datasets, the quality may not always be high-quality. |
|
archive.ics.uci.edu/ml/datasets Hosted by the University of California--Irvine, the UCI Machine Learning Repository contains many famous datasets common for practicing data analysis and machine learning techniques. You may even have seen some of the datasets before in your Statistics courses, such as the famous Challenger Space Shuttle dataset and Wine Quality dataset. The benefit of this resource is that the data is typically prepared for the purposes of analysis. Datasets are even categorized by methods of analyses they are conducive for. |
|
The r/datasets subreddit is an active community of data users to share interesting datasets or request data they are seeking. Like Kaggle, be mindful that content can be uploaded by anonymous members who have not been vetted. |
>> Dataset Aggregators
datasetsearch.research.google.com Google has a search engine designed specifically for finding datasets. Many of the government and association-provided datasets described elsewhere in this guide can be found through Google's Dataset Search platform as well. |
|
github.com/awesomedata/awesome-public-datasets There is a well-curated list of open data resources on GitHub (by the AwesomeData community), with entries categorized neatly by subject area. It includes some of the data portals listed in this guide. |
Researchers commonly produce data as part of the research process. Depending on whether they received government NIH/NSF funding for their study, they may even be required to publish their datasets publicly as well. Many academic institutions have open repositories in which researchers can store their data and make it visible.
>> Scholarly Data Repositories
>> Directory of Data Repositories
The scholarly/research data repositories listed above are only a small subset that I chose to highlight as good examples. There are in fact many out there to choose from, and it's a selection that's growing rapidly. Below are excellent resources that carefully curate a list of high-quality open data repositories and serve as a directory to them. While they may not point to the datasets themselves, they direct you to data repositories that would host the data you're looking for.
Over 2,000 research data repositories are indexed in this directory, which includes a search interface to help you locate relevant repositories based on subject. It is a multidisciplinary collection, covering Humanities & Social Sciences, Natural Sciences, Life Sciences, and Engineering Sciences. |
|
oad.simmons.edu/oadwiki/Data_repositories The Open Access Directory (OAD) is an initiative maintained by a community of researchers and scholars to list open access scholarly resources. They also publish a list of open access data repositories organized by subject area. If you're interested in browsing general datasets, start from the data repositories listed in their multidisciplinary section. |
Got an interesting dataset or data source to recommend? Let me know about it!
E-mail: charles.terng@baruch.cuny.edu