LibGuides: Research Data Management: Data Creation/Collection

Data Creation and Collection

"Information in digital, computer-readable format or paper-based that is collected, generated or obtained during the course of or as a result of undertaking research, which is subsequently used by the Researcher as a basis for making calculations or drawing conclusions to develop, support or revise theories, practices and findings."

See Research Data Management Policy at University of Galway

Examples of research data

Documents (text, MS Word), spreadsheets
Scanned laboratory notebooks, field notebooks, diaries
Online questionnaires, transcripts, surveys or codebooks
Digital audiotapes, videotapes and other digital recording media
Scanned photographs or films
Transcribed test responses
Database contents (video, audio, text, images)
Digital models, algorithms, scripts
Contents of an application (input, output, logfiles for analysis software, simulations)
Documented methodologies and workflows
Records of standard operating procedures and protocols
Historical documents
Physical objects e.g. blood samples

A Guide for Researchers from OpenAIRE: Data formats for preservation

Research Data Lifecycle

"The notion of a data lifecycle is one that has gained popularity as the culture of data sharing becomes part of our everyday research language. The data lifecyle extends the typical research cycle"

Corti, Louise, Van den Eynden, Veerle, Bishop, Libby, & Woolard, Matthew. (2014). Managing and sharing research data: a guide to good practice. London: Sage. p.17

CC BY-ND see https://www.jisc.ac.uk/guides/rdm-toolkit

Typical activities undertaken in the research data lifecycle

Sensible file names and a well-organised folder structures makes it easier to find and keep track of data files. Links to relevant advice and resources provided by the UK Data Service are outlined below.

UK Data Service table of file formats recommended and accepted by them for data sharing, reuse and preservation.

UK Data Service guidelines for organising and formatting your data

UK Data Service guidelines on file and folder structures

It is important to ensure that different copies or versions of files, files held in different formats or locations, and information that is cross-referenced between files are all subject to version control. Guidance on version control and authenticity is available from the UK Data Archive

A Guide for Researchers from OpenAIRE: Data formats for preservation

Documentation is the contextual and explanatory information required to make sense of the dataset. It is a user's guide to your data making it understandable, verifiable, and reusable.

Document your data so that ...

You remember the details later
Others can understand your research
Your findings can be verified
Your results can be replicated
You can avoid misinterpretation
Your data can be archived for access and re-use

Research data should be documented at various levels

Study level

Describes the research project, the data creation processes, rights and general contexts. Good study level data should include information about research design and context, data collection methods, structure of data files, secondary data sources used, data validation procedures, conditions of use.

Data level

Describes how all the files (or tables in a database) that make up the dataset relate to each other; what format are they are in; whether they supercede or are superceded by previous files. A readme.txt file is the classic way of accounting for all the files and folders in a project.

Examples of data documentation

Database schema
Information about equipment settings and instrument calibration
Laboratory notebooks and experimental protocols
Methodology reports
Provenance information about sources of derived or digitised data
Questionnaires, codebooks, data dictionaries
Software syntax and output files

Learn more ...

Advice about good practice relating to documentation and metadata is available from the UK Data Service.

A Guide for Researchers from OpenAIRE: Electronic Lab Notebooks - should you go “e”?

Metadata is similar to Documentation (see related tab) but is more structured, conforms to set standards and is machine readable. It is required to facilitate archiving, discovery and citation of the dataset.

Metadata is a formal structured description of a dataset, used by archives to create catalogue records. It is structured, conforms to set standards and is machine readable.There are three categories of metadata:

Descriptive metadata includes author, title, keywords and abstract and enable users to find resources online.

Administrative metadata includes information about when and how a resource was created as well as file type, technical information and access rights.

Structural metadata provides information about the relationship between the parts that make up a compound object e.g.relating articles, issues and volumes of serial publications, or the pages and chapters of a book.

Metadata describes the content, quality, condition, and other characteristics of a dataset. It enables data to be preserved, minimizes duplication of effort in the collection of expensive digital data and fosters the sharing of digital data resources.

Who created the data?
What is the content of the data?
When were the data created?
Where is the data geographically?
How were the data developed?
Why were the data developed?

Why is metadata essential?

Metadata enables data developers to:

Avoid data duplication because they check if data already exists
Share reliable information about a dataset by creating metadata for it
Reuse a dataset with confidence about its origins and quality as well as having valuable information about it
Publicize the data they have created by making the metadata available in repositories
Cite their datasets and increase the visibility of the data.

Metadata enables user to:

Search for and get access to data from a variety of sources
Restrict searches to a geographic regions
Determine whether the data will be applicable for use in a particular study
Acquire a dataset
Know restrictions on how a dataset use

Metadata enables organizations to:

Safeguard their investment in their data by retaining information about how it was collected, processed, quality controlled, used and restricted
Create a permanent record of the dataset which is critical institutional memory
Ensure that datasets “live on” for the organization after researchers leave or retire
Re-use dataset in another research project if appropriate and future researchers will know how the datset was created
Advertise its research and enable new partnerships and collaborations by data sharing

Essential fields

Title: Name of dataset or research project that produced it. (Include both if applicable.)

Creator(s): Names and addresses of the group that created the data.

Identifier: Unique identifier or number that is used to identify the data. This could be an internal project number or code to reference the data.

Abstract/Description: A brief synopsis of the project or data that another researcher can review quickly to see the relevance of the project to what they are seeking.

Dates: All the dates associated with the project. The most important is probably the release date of the data, but you'll eventually want to include:

start and end date of the project
time period covered by the data or project
maintenance cycle of the data
update schedule of the data
any other important dates that will help document the process and aid in preservation

Rights: Any known intellectual property rights held for the data or project.

Recommended fields

Contributor(s): Names and addresses of additional individuals that contributed to the project.

Subject: Keywords, phrases, or subject headings that will describe the subject or content of the data. (In adding these, think of how you would search for the materials.)

Funders: Organizations or agencies that funded the research or project.

Access Information: The location of the data and how the researcher can access the materials. (Confidentiality can be addressed here as well.)

Language: The language(s) of the content.

Location: If the data relates to a physical location, the spatial coverage should be documented.

Methodology: The process of how the data was generated, including the equipment software used including the version the experimental protocol data validation and quality assurance of the data any other relevant information

Data Processing: Documenting the alterations made to the data will aid in preservation of the data and record who made changes and for what reasons at specific times.

Sources: Citations for the sources that were used during the project. (Include where the other data or material was stored and how it was accessed when appropriate.)

List of File Names: List all of the data files associated with the project and include the file extensions. (e.g., stone.mov)

File Formats: Format(s) of the data and any software that is required to read the data including the version. (e.g., TIFF, FITS, JPEG, HTML)

File Structure: Organization of the data file(s) (and the layout of the variables when applicable).

Variable List: List of variables in the data files, when applicable.

Code Lists: Explanation of codes or abbreviations used in the file names, variables of the data, or the project over all that will help the user understand the project. (e.g., "999" indicates a missing value in the data)

Versions: Date/time stamp for each file and use a separate identifier for each version.

Checksums: Used to test if your file has changed over time. (This will aid in the long term preservation of the data and help make it secure by tracking alterations.)

Related Materials: Links or location of materials that are related to the project. (e.g., articles, presentations, papers)

Citation: The recommended way to cite the data or the information needed.

What is a metadata standard?

A Standard provides a structure to describe data with:

Common terms for consistency between records
Common definitions for easier interpretation
Common language for ease of communication
Common structure to quickly locate information

Standards provide a uniform summary description of a dataset.

The Research Data Alliance Standards Directory contains widely used metadata standards in the Arts and Humanities, Engineering, Life Sciences, Physical Sciences and Mathematics, Social and behavioural Sciences and General Research Data.

The Digital Curation Centre provides links to information about discipline specific metadata standards, including profiles, tools to implement the standards, and use cases of data repositories currently implementing them.

Biosharing is an educational resource on inter-related data standards, databases and policies in the life, environmental and biomedical sciences.

The Data Documentation Initiative (DDI) is an international standard for describing the data produced by surveys and other observational methods in the social, behavioural, economic, and health sciences. DDI is a free standard that can be used to document and manage different stages in the research data lifecycle, such as conceptualization, collection, processing, distribution, discovery, and archiving. Documenting data with DDI facilitates understanding, interpretation, and use by people, software systems, and computer networks.

CEDAR (Center for Expanded Data Annotation) is a repository of community defined metadata templates. and Retrieval. Its goal is to improve metadata and its use in the biomedical sciences. The CEDAR metadata tools can be used to create, annotate, analyze, validate and search metadata based on the fields and relations defined in the metadata templates.

Fairsharing.org is a good place to start to find metadata standards for your discipline.

Find datasets by using the Data Citation Index (available via the Web of Science) is an index to research data from repositories across disciplines and around the world. You can access it from the Library catalogue. It indexes data and provides links to repositories where it is stored. Click here for short tutorial on the Data Citation Index

Google Dataset Search is a good place to start your search for datasets related to your discipline. It is important to note that it is not a comprehensive index to datasets available in repositories.

Data repositories or archives allow researchers to upload and publish their data, thereby making the data available for other researchers to re-use. Similarly, a data archive allows users to deposit and publish data but will generally offer greater levels of curation to community standards, have specific guidelines on what data can be deposited and is more likely to offer long-term preservation as a service. Sometimes the terms data repositories and data archives are used interchangeably.

Learn more about data repositories and archives

Data Journals
- Data journals offer a platform for publication of "data articles" or "dataset papers" that are typically short articles providing a technical description of a dataset.
- Some data journals also publish (i.e. host) the dataset themselves. Others link to datasets hosted on dedicated data repositories.
- Conventional journals may link to datasets (e.g. Nature) or embed research data within the structure of the scientific article.

Note: Minimum requirements for the third-party hosting may be specified e.g. Geoscience Data Journal specifies that the host repository must be able to mint a DOI. New data journals that are peer reviewed and citable include Scientific Data (Nature) and the Geoscience Data Journal (Wiley).

Reference: Ware, Mark , & Mabe, Michael. (2015). The STM report: An overview of scientific and scholarly journal publishing: STM: International Association of Scientific, Technical and Medical Publishers.pp 141-2

Research Data Management

Data Creation and Collection

A Guide for Researchers from OpenAIRE: Data formats for preservation

Library

CONTACT

Connect