Open Data Resources (OLD)

Services

Need help with any of the following data-related issues below? 
Feel free to reach out to me for further assistance: charles.terng@baruch.cuny.edu


DATA FILE FORMATS

Unable to open a data file? Typically, datasets are provided in commonly readable formats (like .csv, which is fairly straightforward to open with Excel). However, you may sometimes encounter files stored in formats only accessible by specific programs (like SPSS .sav files). In such cases, you'll want to convert those files into a format you can actually open.

Many statistical programs allow you to make these conversions. R, a statistical programming language, for example, typically has packages for this very purpose. 

The Pew Research Center is a good example, as they only provide datasets in .sav files. They have documentation on how the file can be converted into a more readable format via R. [Source: Pew Research Center]


WEB SCRAPING

You may come upon interesting data online that isn't necessarily structured or provided to you in a readable file. For example, let's say you were looking to collect college rankings data from US News & World Report's annual online publication of Best National Universities [view it here].

The table contains data you may be interested in, such as college name, ranking, location, tuition, and enrollment. However, the site does not have the table available for download as a spreadsheet or file:

 

Ideally, you may want this data stored as an Excel spreadsheet, like so:

 

So, what are your options in translating this data from the webpage to an Excel file? Manually recording this data is time-consuming and error prone. Fortunately, we have web scraping to help us automate this task.

Web scraping is the process of extracting data from the webpage's HTML and formatting it according to your preferences. The prerequisites to accomplish this require some proficiency in programming and understanding of how websites are structured in HTML. There are quite a few guides online, but I will highlight this one as a good intro for beginners: Web scraping with Python using Beautiful Soup by Vik Paruchuri

For those familiar with the Python programming language, there are some excellent web scraping libraries to get started with:

  • Requests - extremely simple method for making HTTP requests
  • Beautiful Soup (BS4) - for parsing HTML and navigating to tags where the desired data are located
  • Selenium - useful library for getting around websites that require logins by simulating text entry and mouse clicks

Suggest a dataset

Got an interesting dataset or data source to recommend? Let me know about it!

E-mail: charles.terng@baruch.cuny.edu