The text filter option works like the “Find” function in Excel, allowing you to search a column for values containing a specific string.
To display the text filter function:
In addition to the basic text filter function, OpenRefine 3.3 includes an "invert" function with will return all rows/records that DO NOT include the term in the text filter box. To invert a text filter:
Tips
Adapted from OpenRefine LibGuide (2023). University of Illinois Urbana-Champaign.
Faceting allows you to quickly view unique values in a column, make edits to those values, and narrow your display to show results containing a specific facet.
Tips
There are two types of custom facets that are applicable to text specifically: word and text length.
Tips
The facet by blank function allows you to narrow your data based on whether or not the value in a particular column is blank.
Tips
The duplicates facet allows you to narrow your data based on whether or not the value in a particular column is unique.
Tips
Adapted from OpenRefine LibGuide (2023). University of Illinois Urbana-Champaign.
When you import a project into OpenRefine, the cells will automatically be given a format: text, number, or date. To change this format:
NB! Columns with values in green are either in date or number format. This makes it easier to identify what types of facets and filters you can use.
The Numeric Facet allows you to sort columns with numeric values and to use a sliding scale to adjust the range of number values displayed in the grid view. To display this facet:
Tips
The Timeline Facet allows you to sort columns with date values and to use a sliding scale to adjust the range of date values to displayed in the grid view. To display this facet:
Tips
Adapted from OpenRefine LibGuide (2023). University of Illinois Urbana-Champaign.
One of OpenRefine’s most powerful features is the “Clustering” function. With the support of several types of key collision and nearest neighbor algorithms, the Clustering function can help you to identify inconsistencies in your data from misspellings, to non-standardized value formatting, or input error.
Clustering works by using what is called “fuzzy matching” on the values within a chosen column using the algorithm of your choice to determine if possible cell values “look similar” enough to be possible matches. The algorithms supported by OpenRefine are of two types:
For more information on the specific types of algorithms you can choose from, see the OpenRefine documentation on Clustering In Depth.
Tips
It can be helpful to have a subject specialist assist in this part of the data cleaning to account for possible errors. For example:
A data set includes a “Location” column which has the values “Savoy Hotel” and “Hotel Savoy.” A clustering algorithm might suggest merging these two values, but a subject specialist would be able to identify that these values actually refer to two different establishments, Hotel Savoy in New York and Savoy Hotel in London.
Adapted from OpenRefine LibGuide (2023). University of Illinois Urbana-Champaign.
The Library proactively supports and enhances the learning, teaching, and research activities of the University. The Library acts as a catalyst for your success as University of Galway’s hub for scholarly information discovery, sharing, and publication.
Library
University of Galway
University Road,
Galway, Ireland
T. +353 91 493399