LibGuides: Digital Tools for Research: Data Mining & Discovery

Data Mining & Discovery

Text Filter

The text filter option works like the “Find” function in Excel, allowing you to search a column for values containing a specific string.

To display the text filter function:

Go to the column you would like to search and click the arrow button on the column header.
Choose “Text filter.”
A window with a search box will appear in the pane to the left side of the grid view.

In addition to the basic text filter function, OpenRefine 3.3 includes an "invert" function with will return all rows/records that DO NOT include the term in the text filter box. To invert a text filter:

Follow steps 1-3 above.
Enter your term in the text box and select "invert" in the top right corner of the search box.
The top bar of the search box will turn orange and the results will change to display rows/records without the specified search term

Tips

To narrow your search further, check the “case sensitive” box below the search box to return only results that exactly match your search term.
You can combine a text filter with a facet to narrow results. Either create a facet and then use the text filter to search within a desired column, or narrow results with a text filter and then facet a desired column.

Adapted from OpenRefine LibGuide (2023). University of Illinois Urbana-Champaign.

Text Facet

Faceting allows you to quickly view unique values in a column, make edits to those values, and narrow your display to show results containing a specific facet.

Go to the column you would like to analyse and click the arrow button on the column header.
Choose “Facet” from the drop-down menu, and then select “Text facet.”
A facet window will appear in the pane to the left side of the grid view.

Tips

Text facets will automatically display in alpha-numeric order, but you can sort them by count instead. Just select "Count" at the top of the facet window where it says "Sort by."
If you click on a facet, the grid view will only display rows or records that match this facet. Any number of facets and/or filters can be combined to narrow your data. To combine facets, hover over the facet you would like to add and select “include” at the right.
You can edit values using the facet window by hovering over the facet you want to edit and selecting “edit” at the right. Any edits made will be automatically applied to all rows with that value.
You can export a list of facets and their counts by selecting "[#] choices" at the top of the facet window. Simply copy the text in the pop-up window and paste into Excel or a text editor.

Custom Text Facets

There are two types of custom facets that are applicable to text specifically: word and text length.

Go to the column you would like to facet and click the arrow button on the column header.
Choose “Facet” and then select “Customized facets.”
Under “Customized facets,” select either “Word facet” or “Text length facet” and a window will appear in the pane to the left side of the grid view

Tips

The “Word facet” option functions like the “Text facet” but with the individual words within the column values.
The “Text length facet” allows you to narrow results based on the length of a value using a sliding scale.
Both of these can be helpful in identifying missing or misspelled values.

Facet by Blank

The facet by blank function allows you to narrow your data based on whether or not the value in a particular column is blank.

Go to the column you would like to facet and click the arrow button on the column header.
Choose “Facet” and then select “Customized facets.”
Under “Customized facets,” select “Facet by blank (null or empty string)” and a window will appear in the pane to the left side of the grid view.

Tips

The “Facet by blank” function returns true or false values where “true” means the value is blank and “false” means the value is not blank.
This function is useful for identifying potentially missing sections of data.

Duplicate Facet

The duplicates facet allows you to narrow your data based on whether or not the value in a particular column is unique.

Go to the column you would like to facet and click the arrow button on the column header.
Choose “Facet” and then select “Customized facets.”
Under “Customized facets,” select “Duplicates facet” and a window will appear in the pane to the left side of the grid view.

Tips

The “Duplicates facet” function returns true or false values where “true” means the value is not unique and “false” means the value is unique.
This function is useful for isolating data with unique values.

Adapted from OpenRefine LibGuide (2023). University of Illinois Urbana-Champaign.

Changing Cell Format

When you import a project into OpenRefine, the cells will automatically be given a format: text, number, or date. To change this format:

Go to the column you would like to change the format of and click the arrow button on the column header.
Choose “Edit cells” and then select “Common transformations.”
Under “Common transformation,” choose the desired format.

NB! Columns with values in green are either in date or number format. This makes it easier to identify what types of facets and filters you can use.

Numeric Facet

The Numeric Facet allows you to sort columns with numeric values and to use a sliding scale to adjust the range of number values displayed in the grid view. To display this facet:

Go to the column you would like to facet and click the arrow button on the column header.
Choose “Facet” and then select “Numeric facet.”
A window with a sliding scale will appear in the pane to the left side of the grid view.

Tips

The “Numeric Facet” function is useful for narrowing data to exclude outlying numeric values.
This function can also be used to isolate high or low values.

Timeline Facet

The Timeline Facet allows you to sort columns with date values and to use a sliding scale to adjust the range of date values to displayed in the grid view. To display this facet:

Go to the column you would like to facet and click the arrow button on the column header.
If the data is not read as "dates", it needs to covert this column to dates.
Choose “Facet” and then select “Timeline facet.”
A window with a sliding scale will appear in the pane to the left side of the grid view.

Tips

The “Timeline facet” function is useful for narrowing data to exclude outlying date values.
This function can also be used to isolate specific dates, times, or date ranges.

Adapted from OpenRefine LibGuide (2023). University of Illinois Urbana-Champaign.

What is Clustering?

One of OpenRefine’s most powerful features is the “Clustering” function. With the support of several types of key collision and nearest neighbor algorithms, the Clustering function can help you to identify inconsistencies in your data from misspellings, to non-standardized value formatting, or input error.

Clustering works by using what is called “fuzzy matching” on the values within a chosen column using the algorithm of your choice to determine if possible cell values “look similar” enough to be possible matches. The algorithms supported by OpenRefine are of two types:

Key collision
Nearest neighbor

For more information on the specific types of algorithms you can choose from, see the OpenRefine documentation on Clustering In Depth.

How to Cluster

There are two ways to open the clustering window:
1. On the column of your choice, perform a “Text facet.” At the top of the facet window, select the “Cluster” option. OR
2. Go to the column you would like to cluster and click the arrow button on the column header, then select the “Edit cells” option and choose “Cluster and edit.”
In the Clustering window, you will see several options:
1. At the top of the window is where you can choose the type of algorithm to run.
2. In the center of the window is a list of the suggested clusters, the current values, and suggested new value.
3. On the right are several sliding scales which can be used to narrow the criteria for clustering: by number of choices in the cluster, number of total rows in the cluster, average length of the values in a cluster, or the variation in length of the values in a cluster.

Choose the algorithm which best suits your needs and then consider the suggested clusters.
1. For clusters you would like to keep, select the “Merge?” box and confirm that the text box in the “New Cell Value” column is consistent with the value you want to change the clustered values to.
2. Repeat for each suggested cluster.

When you are ready, select either “Merge Selected & Re-Cluster” or “Merge Selected & Close.”
1. “Merge Selected & Re-Cluster” edits the selected values and then automatically re-runs the clustering algorithm on the same column.
2. “Merge Selected & Close” edits the selected values and then closes out of the Clustering window.

Tips

It can be helpful to have a subject specialist assist in this part of the data cleaning to account for possible errors. For example:

A data set includes a “Location” column which has the values “Savoy Hotel” and “Hotel Savoy.” A clustering algorithm might suggest merging these two values, but a subject specialist would be able to identify that these values actually refer to two different establishments, Hotel Savoy in New York and Savoy Hotel in London.

Adapted from OpenRefine LibGuide (2023). University of Illinois Urbana-Champaign.

Digital Tools for Research

Data Mining & Discovery

Text Filter

Text Facet

Custom Text Facets

Facet by Blank

Duplicate Facet

Changing Cell Format

Numeric Facet

Timeline Facet

What is Clustering?

How to Cluster

Library

CONTACT

Connect