(Originally posted on my old blog on Jan 9, 2014)
As part of the content analysis I am doing for my dissertation, I’ve started to look at using Python to scrape documents from the web, as well as clean them up for analysis. In theory this should save me lots of man-hours of work; in reality, well, we’ll see.
Anyway, I though I would share my (very limited) experience and sources here. This is meant for total newbies, so if you know anything about coding, go hang out at stack overflow and make fun of me there.
For learning how to use python there are several great resources out there. Specifically, Neal Caren and the Programming Historian have great how to guides on data scraping and APIs and the like. Neal Caren’s page is definitely more beginner friendly.
Noob Note: One thing the online sources tend not to cover very well is the difference between using the terminal and the python program itself. If you have a Mac, which I do, python already comes on your computer ready to run on the terminal. Any time you see $ in sample code, that means you should type it directly into the terminal, not in python itself. Go learn about the terminal. It is your friend. See here for some more info on the terminal, as well as how to use python in windows.
Another good resource is code academy where the Sunlight Foundation has a class on how to use its Capitol Words API. This is a pretty good quick and dirty introduction into how to pull info from the congressional register, but there are a few issues with it.
- First, the code is not meant for the newest versions of Python (the print function changed). In Python 3.x, you need to use parenthesis around what you plan to print.
- A second problem is with the response.json line. While the simulator will run this, in real life you are going to have to put parenthesis behind that so it reads response.json().
Hope that helps. If not, there’s always getting tenure and then forcing your graduate students to do this for you.