Tuesday, 8 August 2017

How to Automatically Clean Up Spreadsheet Data with OpenRefine

Ever had to manually edit dusty, messy, years-old information from some obsolete software?

I once worked for a company that stored paperwork offsite for 60 years. Materials were indexed in a document table. Most records had a box number, storage date, storage vendor receipt number, and a rough idea of the contents. Most, mind you.

Over 60 years the list got … messy. Storage contracts changed several times—so the box codes and vendor receipts varied over time. Add in the random mistakes that added up over time, and you had quite a mess.

My job was transferring everything to yet another contractor—which meant cleaning up thousands of records to play nice with the new vendor’s fancy online inventory. It was quite the chore—a chore many of us face when trying to organize data.

The good news is, if you can get your messy data into a spreadsheet, you can clean up and reformat it. My favourite tool for this is called OpenRefine, and its specialty is “reconciling” or “normalizing”—making it easy to find typos, variations on phrases, formatting errors, extra spaces, and other things that are hard to spot in rows upon rows of information.

What is OpenRefine?

OpenRefine bills itself, simply, as “a powerful tool for working with messy data.” Originally released 2010 as “Freebase Gridworks,” it was later called “Google Refine” after being acquired by the search giant. Today it's a community-run, open-source project to, well, refine your data.

To you, this could mean a number of things. You sales team could want to export old store data, reorganize it, and import it into a new eCommerce app. Your accounting staff might have legacy data floating around from years ago. Your PR staff could have multiple email lists from campaigns past you want to merge, modify, or de-duplicate.

Maybe your survey results are messy, your app exports are confusing, or your analytics data needs combined from multiple sources.

OpenRefine was built especially with those types of bulk operations in mind. It may just be what you need to finally finish that data project you’ve been putting off.

Getting Started With OpenRefine

Getting started is easy. Just download OpenRefine—it works on Windows, Mac, and Linux—and start the program. It’ll open up a browser tab that looks much like other Google Apps, and will ask you to create a project, or open a project you’ve already started.

You'll need some data for OpenRefine to work with—and it open any data in a spreadsheet format: CSV, XLS, or even a Google Sheets spreadsheet online. It can also take XML and JSON files, if that’s your jam.

OpenRefine Import Data
OpenRefine can directly import your spreadsheet files from the web

Let's start a new project. This exercise is going to use a set of publicly available data from the Government of Ontario—which, like much public data, is a bit messy. Let’s go with a subject near and dear to my heart: Beer. Copy the link to the XLSX file, which includes details about Ontario microbrewers and brands. Switch to your OpenRefine tab, start a new project, select the Web Address option, and paste in your spreadsheet link.

As soon as you input a dataset, OpenRefine generates a preview to ensure it’s displayed properly. You can do some preliminary cleanup—remove empty rows, set the first row as a header with column names, or convert columns into specific data types (dates, integers, and so on).

Click “Create Project” when you’ve made sure the data is displaying correctly, and you’ll be brought to the screen where all the magic happens.

Data imported into OpenRefine

The first thing you’ll notice is that OpenRefine doesn’t display your data like a spreadsheet with a long list of rows. Instead, it shows a maximum of 50 rows at a time, essentially just enough of a preview for you to think about what you’re working with. You can page through your data if you need to, but I think you’ll soon get comfortable with being less overwhelmed.

Clean Up Data with OpenRefine Facets

The first step is to learn about facets. These show precisely which values are used in a column, so you can find typos or variations in things that are supposed to be identical. Let’s start with the manufacturer’s name. Click the dropdown button next to the header, select Facet, then Text Facet. You’ll be presented with a column like this, showing a count of the times each item appears in the dataset:

OpenRefine Facet

We can see, for example, that Big Rig Brewery has 13 different beers; Big Rock Brewery, 6 different beers. We can already see some messy data here—“Black Swan Brewing Company” and “BLACK SWAN BREWING COMPANY INC.” are the same company, but with slightly different names in this spreadsheet.

Change name in OpenRefine

To fix this, hover your mouse over the name you want to change, click “edit,” and type in the new name. Click Apply and it automatically edits all the matching entries in the dataset.

Let’s speed up the process by automatically identifying all of the facets that are similar and merging them—without any typing—by clustering the data. Click the Cluster button at the top of the facet display, and you’ll see all of the similar entries identified by OpenRefine:

Cluster in OpenRefine

For some of these, it’s just an extra space (as at the end of “Square Timber Brewing Company”) or an extra comma (as in Blood Brothers Brewing), or liberal use of capslock. As you can see in the “Bevin Palmateer” entry, OpenRefine also identifies words that are out of order.

Check the Merge boxes for anything you want to fix. If you don’t like the suggested new value—for example, the capitalized name suggested for NITA BEER—you can just click the lowercase option and it will change that field. If you don’t like any of the options, just type in your preferred name.

Click Merge Selected & Re-Cluster to do another check. When the check finds no results, try another clustering method to look for more (you should find “Walkervile” and “Walkerville”).

It's data-mining, but you don’t have to learn advanced data-mining theory to get results: Just click through all the options. You’ll start to see false positives (for example, “Bell City” isn’t “River City”), which you can just ignore.

Common transforms in OpenRefine

There are also some common transform tools you can use to clean stuff up, like eliminating all the spaces before and after text. Let’s also get rid of all the uppercase brewery names by transforming the whole column to Titlecase. Click again on the dropdown menu for the column, go to Edit cells, and read through all the possibilities.

Categorize Data Automatically in OpenRefine

The next step is to do clever things with all this data. Let’s pretend these beers are our product data, and we want to add categories of beer to our catalogue. We don’t want to manually label each entry, so let’s save some time by identifying beer types from the beers’ names.

We can do a quick check for one type of beer using a Custom Text Facet. We’ll look for all cell values that contain “Porter” (this is also case-sensitive, but now that we’ve put everything in titlecase, the capital P should catch everything). A Custom Text Facet on the Manufacturer’s Brand column brings up this window, into which we enter a filter:

value.contains("Porter")

Use Custom Text Facets in OpenRefine

This function returns true and false—and true here means 25 beers are porters in the list. (There are also 79 breweries without any actual beers available—the (blank) category—but let’s ignore that for now.)

These filters are great when you want to manipulate a subset of your spreadsheet without having to delete the rest, or keep your focus rows selected. You can apply a filter, do a bunch of operations, and then remove it later. OpenRefine even includes some common recipes to format data, such as standarizing date formats or transforming “Firstname Lastname” into “Lastname, Firstname.”

Let's use that to transform our data into something useful. We'll add a new column based on the “Manufacturer’s Brand” column, using text analysis to guess what type of beer it is. It won’t work on all entries, but for beers that have “IPA”, “lager”, “stout”, “lime”, “red”, “wheat”, and so on right in their name, we’ll have some success.

if statements in OpenRefine

Start by clicking on “Manufacturer’s Brand.” Select Edit Column then choose Create column based on this column. To look for “lager” and replace the entirety of the Beer types value with “lager” where applicable, we use an if statement:

if(value.contains(“Lager”),”lager”,value)

If statements here are straightforward: If the first part is true, transform the whole value to “lager;” otherwise, replace the cell value with itself (or, do nothing).

If we want to categorize a big set of beer types at once, we nest a series of if statements inside each other. It looks a bit silly, but gets the job done:

if(value.contains("Lager"),"Lager",if(value.contains("IPA"),"IPA",if(value.contains("Wheat"),"Wheat",if(value.contains("Pilsner"),"Pilsner",if(value.contains("Brown"),"Brown",if(value.contains("Kolsch"),"Kolsch",if(value.contains("Light"),"Light",if(value.contains("Red"),"Red",if(value.contains("English"),"English",if(value.contains("Stout"),"Stout",if(value.contains("Porter"),"Porter",value)))))))))))

Essentially, if “Lager” wasn’t found, then try “IPA,” then try “Wheat,” then try “Pilsner,” etc., etc. It's not standard programming syntax, but gets the job done.

Apply that transformation, then check the facets of the column to see our progress.

OpenRefine automated facet categories

While we’re at it, let’s clean the results up. Reconcile “I.P.A.” and “India Pale Ale” to “IPA” with the steps you learned above. Also keep in mind that the operations work in order: You’ll want to convert “India Pale Ale” before you reformat “Pale Ale.” Because these transformations are also case-sensitive, transforming to lowercase “India pale ale” would also protect your work when you search for “Pale Ale” later on.

With a bit of categorizing, we can start to see the spread of beer types in Ontario. (Try them all today!) This is definitely faster than labelling them all by hand, and it should give you an idea how to make OpenRefine filters work for you.

If this was a product list for our online store, we’d want to export our cleaned-up and value-added spreadsheet from OpenRefine and import it into our eCommerce store. The Export button's your friend. You can export your data as a spreadsheet with a range of options and data forms. You can also upload the data directly to a new Google Sheets spreadsheet or Google Fusion table.


Do More with OpenRefine

There are a few other useful OpenRefine tools. The Undo/Redo option gives you detailed information about all your activities instead of just undoing your mistakes—which is super helpful in learning how to get more out of OpenRefine. Also remember: OpenRefine is designed around databases so you can use its records and rows seperately to organize your data.

Now it's your turn to try it out. Have messy data from an app export, or an old spreadsheet full of confusing data?

One great way is to use OpenRefine to organize your contacts: Find typos and formatting errors in email addresses, phone numbers, or company names before importing the data into a new app. I’ve used it to reformat old Mailchimp data when we changed the designs of our signup forms—super handy.

Don't spend hours formatting your data again. OpenRefine can do it for you in minutes.

We'd love to hear how you use OpenRefine to clean up data in the comments below!

Continue Reading

Header photo via Pexels.



source https://zapier.com/blog/openrefine-guide/

No comments:

Post a Comment