Thing 21: Tools of the trade

Dig in to dirty data.  What is it?  Why should we care? Try your hand at using an open source data cleansing tool.

  • Getting started: Data horror stories! Getting down and dirty with data
  • Learn more: Turn a PDF ‘tabula rasa’ into usable data with Tabula
  • Challenge me: OpenRefine is a powerful tool for cleaning up lots of  dirty data

Getting started

Dirty data stories

Data horror stories: how did it happen?

Why is “clean” data important? Public policy, changes to medical protocols and economic decisions all depend on accurate and complete data. This thing looks at the why and what of “dirty data.”

  1. Read some real data horror stories about dirty data.
  2. How does data get dirty in the first place? This short video (5 min) from the ASPCA (American Society for Prevention of Cruelty to Animals) shows simply how incomplete, inaccurate data can occur. You can imagine how the resultant problems would multiply exponentially the bigger the dataset.
  3. Browse down the Bad Data Guide’s list of commonly encountered data quality issues (with possible solutions). This list is aimed at journalists but it shows who is responsible for cleaning up dirty data. The Bad Data Handbook also contains views of experts and practitioners in this area.

Click into a few of the causes and solutions to dirty data - many of us contribute information to reports or do our home accounts in spreadsheets, and maybe it’s time to think about how clean our own data is.

If you have time: for a quick guide to working with spreadsheets, check out one of the School of Data’s Data Fundamentals course. The modules use real data from e.g. the World Bank.

Consider: the wide ranging implications of how dirty data can impact on your life.

Learn more

Extracting and scraping data

How often have you found some data that looks interesting, but they’re in PDFs or on a webpage… how do you get the data into spreadsheets so you can work with them?

The School of Data has fantastic, easy to follow tutorials working with real data.

Option 1:

Let’s start extracting tabular data from text-based PDFs. The Extracting Data From PDFs module provides a brief overview of the different techniques used to extract data from PDFs, with a focus on introducing Tabula, a free open-source tool build for this specific task.

  1. Get ready: go to Extracting Data From PDFs and download the correct version of Tabula for your operating system, and java runtime if required.

    (Note: this tutorial doesn’t work on scanned pdfs)

  2. Work through as much of the Tabula tutorial as you can and remember this tutorial for the next time you get a PDF with valuable (and hard-to-extract) data.

Option 2:

As much as we wish everything was available in CSV or the format of our choice – most data on the web is published in different forms. How do you extract data from HTML? Use a Scraper.

  1. Got to Making data on the web useful: scraping and follow the two ‘recipes’ to learn code-free Scraping in 5-10 minutes using Google Spreadsheets & Google Chrome

(Note: Use the Google Chrome Extension “Scraper, by dvhtn”)

A warning about web scraping: some websites have strict terms and conditions on the use of their content. In case of copyrighted materials, a user must obtain permission from copyright holders to reuse materials.

If you have time or just love data dabbling:

Extracting data from PDFs will inevitably result in some dirty data creeping into your dataset. The School of Data have some really interesting Data Cleansing modules.

Consider: strategies for encouraging data to be published in more re-usable formats rather than PDF.

Challenge me

Roll up your sleeves: clean up dirty data

OpenRefine (formerly Google Refine) is a valuable open source tool that is similar to Excel but more powerful. You can use it to: record data; manipulate data; clean up dirty data; and to transform datasets.

Option 1: If you are new to OpenRefine

  1. Start by watching introduction to OpenRefine (6.48mins) to learn how it can be used to clean up messy data.
  2. Now get hands on!  You will need to download OpenRefine and the dataset for this activity. Work your way through as much as this tutorial as you can. The sections covering Facets and Transforming Values will give you a flavour for what the OpenRefine tool offers.

Option 2: if you’re already familiar with OpenRefine and might be ready to share what you know

Take a look through the lesson materials used to teach OpenRefine in either:

  1. OpenRefine in Library Carpentry
  2. OpenRefine in Data Carpentry for Ecology

Consider: Carpentry course Trainers and Helpers are often in high demand. Would you consider becoming a trainer to teach tools such as Open Refine?

Do you have a question? Want to share a resource?

Keep on going to the next thing: What's in a name? or return to all the things