Skip to Main Content

Digital Tools for Research

This guide provides information about digital tools that can be useful for research data management and analysis.

VoyantTools

Voyant Tools is a free, open-source application that combines many basic tools for text analysis: frequency lists, keyword extraction, topic modelling, collocation detection, etc. Its accessibility and flexibility make Voyant one of the most popular tools for Digital Humanities research, and you can find dozens of examples, such as researching trends in science fiction literature.

Upload Screen

Voyant Tools offers a web version at https://voyant-tools.org/, and a desktop version downloadable from the developers' GitHub. To start exploring, simply Open one of the sample corpora (Shakespeare's plays, Jane Austen's novels and Mary Shelley's Frankenstein) or Upload your own one on the tool's home page, and then click Reveal.

The following settings are located in the top right corner of the upload box:

  1. Language interface. At the moment, VoyantTools interface is available in Arabic, Bosnian, Croatian, Czech, English, French, German, Hebrew, Italian, Japanese, Portuguese, Russian, Serbian, and Spanish.
  2. Options. Here, you can specify your document format if you are not happy with auto-detect. In the Text tab, you can configure which part of the text document should be analysed, i.e. where this passage begins and ends. In the Processing tab, you can configure text language, text encoding and tokenisation method.
  3. Help. The question mark takes you to the relevant section of the documentation. You can find one in the top right corner of every tool.

Analysis Screen

All the main tools — word clouds, trends, KWIC, document statistics — are located on the main screen, divided into panels.

Panel Menu

Each panel can be customised and changed. There are many different tools available than are first visible. To export, change, or learn more about a panel, hover your mouse over the grey line at the top and menu items will appear.

  1. Export options for the current view
  2. Change the panel to a different analysis tool
  3. Change options (not always available)
  4. Hover over the question mark and a brief description of the current tool will appear.

Search

The search window in every panel supports the same search query syntax.

Frequency Lists

Compiling word frequency lists is one of the main features of Voyant Tools. For each lexeme, or term, VoyantTools calculates its absolute and relative frequency in the document and in the entire corpus. Based on these, word clouds and trends (graphs showing the change in word frequency from document to document) are compiled.

By default, a frequency list for your corpus is located in the second tab of the top left panel. The Trend column shows the use of each term in different documents within the corpus as a little graph, where documents lie on the X axis (in the Shakespeare corpus, they are ordered chronologically), and the Y axis is the absolute term frequency.

Categories

Each term can have a category label. There are two built-in ones, @positive and @negative, highlighted in green and red respectively.

Place the cursor in the search field at the bottom of the tool above and try each of the following searches, one at a time (remove the previous search term before entering a new one):

  • positive: this is each occurrence of the word "positive" in the text (only 3 in the Shakespeare's plays corpus)
  • @positive: this is the aggregate number of occurrences for all words in the positive categories group (37,763)
  • ^@positive: this shows the frequencies for each word in the positive categories group (925)

You can create your own categories and colour schemes for them by clicking on the Options icon. Once you click on the Options icon then you should see a Categories control, a box in which you can copy and paste values (categories are transferable between corpora), as well as an Edit button that allows you to edit the specified list.

Stop Words

Voyant Tools automatically filters stop words (conjunctions, prepositions, particles, etc.) for some languages, including English and Irish. The built-in list can be updated through the Options tab of the Cirrus panel by clicking the Edit List button. You can also create your own stop list from scratch. The Options button looks like a slider and is located between the question mark and the Windows icon.

Cirrus

The Cirrus panel is a visualisation of the most frequent words in a document in the form of a word cloud, where the size of a word is determined by its frequency in the document. Using Scale, you can choose to display the cloud for the entire corpus or for individual documents. Using the Terms slider, you can adjust the number of words in the cloud. You can also customise the appearance of your word cloud, selecting font family and palette in the Options menu.

You can also export your visualisation by clicking the Export menu button. This option is available for every tool!

   

Reader

  • The Reader panel allows you to view the corpus as continuous text. The corpus is divided into documents, and each document is divided into separate sections for viewing. You can navigate between documents/sections using the slider at the bottom.
  • The location of the section you are viewing in relation to the entire corpus is shown at the bottom as multi-coloured bar plot. Each column is a separate document in the corpus, and the height and width of the column reflect the length of the document.
  • When you hover over a word, you can see its frequency in the document.

Trends

The Trends panel shows the frequency of words in each document. It allows you to visualise several words for comparison. From the Display menu you can choose a convenient graph view.

Relashionships between Words

In addition to individual words frequencies, you can explore the relationships between words.

  • Collocations tool shows stable combinations of two words in the corpus and individual documents.
  • Links is a network graph showing keywords (in blue) and their frequent collocates (in orange).
  • TermsBerry is a variation of word cloud. When you hover over a word, you can see how often it occurs next to other words.
  • Correlations tool shows to what extent an increase in the frequency of one word correlates with the frequency of another
  • Contexts tool lets you view the context of a word with a custom window size in the KWIC format.

Links

Links tool is located in the third tab in the top left panel by default. It shows a network graph of higher frequency terms that appear in proximity. Keywords are shown in blue and collocates (words in proximity) are showing in orange. Features include:

  • hovering over keywords shows their frequency in the corpus
  • hovering over collocates shows their frequency in proximity (not their total frequency)
  • double-clicking on any word fetches more results
  • a search box for queries (hover over the magnifying icon for help with the syntax)

TermsBerry

TermsBerry tool is located in the second tab of the top central panel, beside Reader. It provides the same visualisation of frequency words as Cirrus, but is more useful for exploring collocates. Hovering over a word highlights words that occur near the selected word. How far away a word must be from the selected term to be considered a neighbour can be adjusted with the Context slider. The Strategy tab lets you switch between frequent words and “significant” words that may be rare overall, but appear much more frequently in certain documents than in others.

Contexts

By default, this panel is located in the first tab in the bottom right corner. This tool lets you view the context of a word with a custom window size in the KWIC (key word in context) format, common for linguistic corpora.

Collocates

Collocates tool is located in the second tab in the bottom right corner. It shows stable combinations of two words in the whole corpus and in individual documents.

Correlations

To open the Correlations tool, go to the bottom right panel, click the Windows icon and select Correlations in the dropdown menu.

The tool shows words with correlating frequencies. A positive coefficient means that when the frequency of one word increases or decreases, the same happens to another word to the same extent; a negative coefficient means that when the frequency of one word increases, the frequency of another decreases, and vice versa. In the Scale tab, you can choose to display statistics  for the entire corpus or only for individual documents.

Summary

The Summary panel, located in the first tab of the bottom left corner, provides general information about the corpus and all documents in it.

  • The number of word forms and lexemes in the entire corpus and in individual documents;
  • Vocabulary density: the ratio of the total number of words to the number of unique words in the document;
  • Average sentence length;
  • The most frequent words in the corpus;
  • Distinctive words:  words that occur more often in a specific document than in the corpus as a whole.

Documents

The Documents tab in the bottom left corner shows document length statistics, the number of unique words, or types, in every document, the ratio of unique words to the total number of words (Ratio column) and the average sentence length.

Phrases

The Phrases tab in the bottom left corner provides information about the N-gram frequency. N-grams are combinations of N words (1, 2, 3, 4 etc.) You can set the length of N-grams using the Length slider.