Skip to Main Content

Digital Tools for Research

This guide provides information about digital tools that can be useful for research data management and analysis.

Data Preparation & Normalisation

Common Transformations

OpenRefine features several functions for performing commonly needed transformations to your data. Many of these transformations are intended for data cleaning purposes and can also be done by using regular expressions. All these transformations are logged in your tracked history.

  1. Go to the column you would like to make edits to and click the arrow button on the column header.
  2. Select the “Edit cells” and then “Common transforms” options.
  3. Depending upon your needs, select the relevant transformation from the options listed.

For quick reference, consult the list of transformations and their function below.

Transformation Function
Trim leading and trailing whitespace Removes extra spaces before and after the value in each cell in the selected column.
Collapse consecutive whitespace Removes extra spaces between individual strings within a cell value in the selected column.
Unescape HTML entities Removes HTML strings indicating particular characters (e.g., “&” vs “&”).
Replace Smart quotes with ascii Replaces Smart quotes to their ASCII (e.g. ' and ")
To titlecase Changes the first letter of all words within a cell value to an uppercase letter.
To uppercase Changes all letters of all words within a cell value to uppercase letters.
To lowercase Changes all letters of all words within a cell value to lowercase letters.
To number Changes the cell format to number.
To date Changes the cell format to date.
To text Changes the cell format to text.
To null Changes the cell format to null.
To empty string Changes the cell format to empty string.

Adapted from OpenRefine LibGuide (2023). University of Illinois Urbana-Champaign.

Removing Duplicates

Your data may include duplicate information which is unnecessary to keep as part of the data set.

  1. Run the sort function on the column which contains duplicates. For instructions on sorting, see the Sorting Data section.
  2. After you have sorted the column, choose "Sort" and then "reorder rows permanently." 
  3. Go to the column with duplicates and click on the arrow button in the column header.
  4. Choose “Edit cells” and then select “Blank down.” “Blank down” will detect if two rows following each other have the same content. If they do, the second row will be “blanked out” and the cell values removed.
  5. After you have used the “Blank down” function, use the “Facet by blank” to identify rows with blank cell values for that column.
  6. From the facet window, select the “true” option.
  7. Go to the column labeled “All” and click on the arrow button, then select “Edit rows” and choose “Remove all matching rows.”
  8. All rows with the identified duplicates will be removed. To restore the full data view, simply reset the facets

Tips

  • Make sure that you are in ROW mode for best results.
  • It is important to make sure that all cells in the column you are applying this transformation to have values. If there are cells that were originally blank before applying the “Blank down” function, they will be deleted along with the duplicate rows.
  • It is important to be aware that the entire row of values will be deleted along with the duplicate value in a given column. Be careful and check before deleting rows or applying the "Blank down" function to make sure that data meant to be kept is not accidentally deleted.

Adapted from OpenRefine LibGuide (2023). University of Illinois Urbana-Champaign.

Splitting Cell Values

Sometimes you will have columns with lots of pieces of data all bunched together. There are several ways OpenRefine allows you to separate these pieces of data into more useful divisions, either into new columns or into a multi-row record.

Creating a New Row Based on a Separator

  1. Go to the column you would like to split and click the arrow button on the column header.
  2. Select the “Edit cells” option and choose “Split multi-valued cells.”
  3. In the pop-up window, type in the separator (often a semi-colon, comma, or some other special character), and select “OK.”
  4. Cells with the values indicated as separators will be split into different rows based on those separators.

NB! The new rows will still be part of the same record. For more information on rows and records see the Layout section

The pop-up window allows for several options and control when splitting cells. The options are:

  1. “By separator” allows you to input the separator value (often a semi-colon, comma, or other special character).
  2. “By field lengths” allows you to input a string of integers to indicate where in the original column value you would like OpenRefine to split the values.

Example

A column contains dates in the YYYY/MM/DD format. You wish to divide this into three columns (Year, Month, Day). Under “by field lengths,” use the list “4, 1, 2, 1, 2” indicating “year, slash, month, slash, day.” The result will be five columns, two with only slashes which you can remove, and three with the Year, Month, and Day respectively.

Tips

  • This method of splitting cell values is most useful for data you wish to remain as one column, such as multiple authors, for future faceting or transformation purposes.
  • There are many characters and strings that can be used as "separators," including: spaces, semi-colons, pipes, slashes, commas, dashes, and full words.
  • It is important to reset all facets on the column you wish to work with before beginning this process.

Creating a New Column Based on a Separator or Field Length

  1. Go to the column you would like to split and click the arrow button on the column header.
  2. Select the “Edit column” option and choose “Split into several columns.”
  3. In the pop-up window, select the method by which you would like to split the column:
    1. “By separator” allows you to input the separator value (often a semi-colon, comma, or other special character) as well as to limit the number of additional columns are created.
    2. “By field lengths” allows you to input a string of integers to indicate where in the original column value you would like OpenRefine to split the values.
  4. When you have selected and identified all your criteria, select “OK.”

Example

A column contains dates in the YYYY/MM/DD format. You wish to divide this into three columns (Year, Month, Day). Under “by field lengths,” use the list “4, 1, 2, 1, 2” indicating “year, slash, month, slash, day.” The result will be five columns, two with only slashes which you can remove, and three with the Year, Month, and Day respectively.

Tips

  • The new columns will automatically be named after the column they were created from. You can change the column titles by going to the column header, clicking the arrow button, selecting “Edit column,” and then choosing “Rename this column.”
  • This transformation is useful for dividing out data you wish to analyze is separate sections. This may include address information, dates, and names.

Adapted from OpenRefine LibGuide (2023). University of Illinois Urbana-Champaign.

Combining Cell Values

OpenRefine allows to combine information from multiple columns. This method allows you to combine the contents of two columns, add a specific string or characters to a column's values, or a combination of both. There are several ways to combine contents from multiple columns.

Joining Columns

  1. Go to the one of columns you would like to join.
  2. Select "Edit column" and click "Join columns."
  3. Select columns you would like to join and click "OK."
    1. You can add separator between the contents of each column.
    2. You can replace null values.
    3. You can overwrite combining information into the original column or create a new column for the combining contents.

The new column will be placed after the column you chose. You can reorder the columns after.

Concatenation

NB! This function requires the use of regular expressions. The relevant expressions are provided here, but for more information, please refer to the section on Google Refine Expression Language (GREL).

  1. Go to the column you would like alter and click the arrow button on the column header.

  2. Select the “Edit columns” and then choose “Add Column based on this column.” OR

  3. Select the "Edit cells" and then chose "Transform."

  • If selecting 2, you can create a new column for merging data. See an example of the first picture below.

  • If selecting 3, you can overwrite merging data in the column you chose. See an example of the second picture below. 

  1. A pop-up window will appear with several sections, including a GREL expression input box and a preview panel. For more information on this window, refer to the Google Refine Expression Language (GREL) section.

  2. The expression for combining column information is:

value + cells[‘Column’].value OR

cells[‘Column1’].value + cells[‘Colmn2’].value

  • "Value" indicates the values in the current column.

  • "Column" should match the name of the column whose values you would like to combine with the current column.

  • If a data type in a column is number, the result will return to sum of two values in each row. To avoid this, add string as following step 5.

  1. To combine column information AND additional strings, use a “+.” For example:

value + “-” + cells[‘Column’].value OR

cells[‘Column1’].value + “-“ + cells[‘Colmn2’].value

  •  Adds a dash between the values of the current column and the values of the column to be combined with the current column.

  1. Once you have typed your expression into the expression box, check the preview window to confirm that your transformation looks the way it should.There is also a section to the right of the expression box which will tell you if there are any syntax errors in your expression.

  2. When you are satisfied with the preview, select “OK.”

NB! When inputting the name of the column you wish to combine with the current column, remember that regular expressions are CASE SENSITIVE. Nothing will happen if the column name is not exact.

Avoiding NULL values

If some sells contain null values, null cells will return when merging columns using these techniques. To avoid null values, follow a few simple steps:

  1. Go to "Text facet" and select "Customized facets."
  2. Chose "Facet by null" from the drop-down menu.
  3. A facet window will appear in the pane to the left side of the grid view. "True" means null cells.
  4. Go to "Edit cells" where you faceted and select "Common transforms." For more details, see "Common Transformations."
  5. Chose "To empty string" from the drop-down menu. 


Adapted from OpenRefine LibGuide (2023). University of Illinois Urbana-Champaign.

Transposing Data

OpenRefine includes functions which allow you to reshape your data, transposing information between rows and columns to visualize the information in a different arrangement.

Rows into Columns

  1. Go to the column with the data you would like to separate into different columns.
  2. Click the arrow button in the column header and select “Transpose.”
  3. Select “Transpose cells in rows into columns
  4. In the pop-up window, enter the number of rows you would like to transpose and click OK.

NB! If you enter “2,” OpenRefine will create two columns, pulling cell values from rows in sets of two. For example: (Row 1, Column A) and (Row 2, Column A) will become (Row 1, Column A) and (Row 1, Column B) respectively.

Tips

  • This function works best when you have multiple values in a single cell which have been split following the instructions for Creating a New Row Based on a Separator.
  • This function can be useful for visualizing and separating out compound pieces of information, similarly to Creating a New Column Based a Separator or Field Length.
  • All transformations of rows into columns are tracked in the Undo/Redo tab, so you can experiment with transposing rows into columns until you are comfortable with the result.

Columns into Rows

  1. Go to the column you would like to begin transposing into rows.
  2. Click the arrow button in the column header and select “Transpose.”
  3. Select “Transpose cells across columns into rows.”
  4. In the pop-up window: 
    1. Select the first column you would like affected by this transformation from  the “From Column” list.
    2. Select the last column you would like affected by this transformation form the "To Column" list.
    3. Pick to transpose into either “One column” or “Two new columns.”
      • “One column” will replace the current column, merging the current column title and the current column values into a single value using a separator of your choosing.
      • “Two new columns” will replace the current column with two new columns:
        • One using the column names of the current column and all columns to the right in the grid view
        • One using the current column values
    4. Select “Fill down in other columns” to automatically fill in the associated data in the new rows in all the columns appearing to the left of the current column in the grid view.
    5. Click “Transpose.”

Tips

  • If there are specific columns you would like to transpose into rows, consider moving them next to each other by following the instructions for Reordering Columns.
  • Since transposing columns to rows does not preserve record associations (see Records and Rows), make sure you have a unique identifier column not being included in the transformation. This way, when you select “Fill down in other columns.” the unique identifier can be used to associate related data.
  • All transformations of columns into rows are tracked in the Undo/Redo tab, so you can experiment with transposing columns into rows until you are comfortable with the result.

Adapted from OpenRefine LibGuide (2023). University of Illinois Urbana-Champaign.