Team Creates Open Source Data-Scraping Toolkit for Journalists
When online investigative journal ProPublica wanted to figure out just how much doctors are being paid by pharmaceutical companies to promote their drugs, reporters Dan Nguyen, Charles Ornstein, and Tracy Weber naturally turned to the internet for research. They quickly realized that, though the data exists on the Web, fashioning it into a comprehensive and useable format was nothing short of headache-inducing. Rather than give up, the journalists created their own data-scraping software using open source tools.
Now the team has generously published "Scraping for Journalism: A Guide for Collecting Data," a complete guidebook that explains everything you need to know about creating your own data-scraping tools for research and numbers crunching. "If you are a complete novice and have no short-term plan to learn how to code, it may still be worth your time to find out about what it takes to gather data by scraping web sites -- so you know what you’re asking for if you end up hiring someone to do the technical work for you," writes Nguyen.
Coded in Ruby, the team used four additional open source tools to create the software. Google Refine for data cleaning, developer toolkit Firebug, Ruby library Nokogiri, and Google's optical recognition software, Tesseract, which turns scanned text into something searchable. Commercial software Adobe Acrobat was used to convert PDFs into HTML when the need arose.
Five separate guides take you through the step-by-step process of everything from how to read data from Flash sites to using Ruby code to scrape HTML. Using the documents as a how-to manual, it's possible build your own tool to pull data from public records and collect it for your own research and study. In addition to supplying a huge amount of technical information, the guides also document the difficulties and obstacles the team encountered while developing the tool. Kudos to the team at ProPublica for their hard work and generosity sharing what they learned so other investigative journalists can benefit as well.