Data Management General Guidance


Table of Contents

Introduction

What is a data management plan?

A data management plan is a formal document that outlines what you will do with your data during and after a research project. Most researchers collect data with some form of plan in mind, but it's often inadequately documented and incompletely thought out. Many data management issues can be handled easily or avoided entirely by planning ahead. With the right process and framework it doesn't take too long and can pay-off enormously in the long run.

Who requires a plan?

In February of 2013, the White House Office of Science and Technology Policy (OSTP) issued a memorandum directing Federal agencies that provide significant research funding to develop a plan to expand public access to research. Among other requriments, the plans must

Ensure that all extramural researchers receiving Federal grants and contracts for scientific research and intramural researchers develop data management plans, as appropriate, describing how they will provide for long-term preservation of, and access to, scientific data in digital formats resulting from federally funded research, or explaining why long-term preservation and access cannot be justified

The National Science Foundation (NSF) already requires a 2-page plan as part of the funding proposal process. Soon most or all US Federally funded grants will require some form of data management plan.

We can help

We have been working with internal and external partners to make data management plan development less complicated. By getting to know your research and data, we can match your specific needs with data management best practices in your field to develop a data management plan that works for you. If you do this work at the beginning of your research process, you will have a far easier time following-through and complying with funding agency and publisher requirements.

We recommend that those applying for funding from US Federal agencies, such as the NSF, use the DMPTool. The DMPTool provides guidance for many of the NSF Directorate and Division requirements, along with links to UC resources, services, and help. Contact UC3 if you have questions and would like feedback on your plan.

Types of Data

Research projects generate and collect countelss varieties of data. To forumulate a data management plan, it's useful to categorize your data in four ways: by source, format, stability, and volume.

What's the source of the data?

Although data comes from many different sources, but they can be grouped into four main categories. The category(ies) your data comes from will affect the choices that you make throughout your data management plan.

Observational

Experimental

Simulation

Derived / Compiled

What's the form of the data?

Data can come in many forms, including

How stable is the data?

Data can also be fixed or changing over the course of the project (and perhaps beyond the project's end). Do the data ever change? Do they grow? Is previously recorded data subject to correction? Will you need to keep track of data versions? With respect to time, the common categories of dataset are

The answer to this question affects how you organize the data as well as the level of versioning you will need to undertake. Keeping track of rapidly changing datasets can be a challenge, so it is imperative that you begin with a plan to carry you through the entire data management process.

How much data will the project produce?

For instance, image data typically requires a lot of storage space, so you'll want to decide whether to retain all your images (and, if not, how you will decide which to discard) and where such large data can be housed. Be sure to know your archiving organization's capacity for storage and backups.

To avoid being under-prepared, estimate the growth rate of your data. Some questions to consider are

File Formats

The file format you choose for your data is a primary factor in someone else's ability to access it in the future. Tthink carefully about what file format will be best to manage, share, and preserve your data. Technology continually changes and all contemporary hardware and software should be expected to become obsolete. Consider how your data will be read if the software used to produce it becomes unavailable. Although any file format you choose today may become unreadable in the future, some formats are more likely to be readable than others.

Formats likely to be accessible in the future are:

Examples of preferred format choices:

Examples of discouraged format choices and better alternatives:

Discouraged Format Alternative Format
Excel (.xls, .xlsx) Comma Separated Values (.csv)
Word (.doc, .docx) plain text (.txt), or if formatting is needed, PDF/A (.pdf)
PowerPoint (.ppt, .pptx) PDF/A (.pdf)
Photoshop (.psd) TIFF (.tif, .tiff)
Quicktime (.mov) MPEG-4 (.mp4)

If you find it necessary or convenient to work with data in a proprietary/discouraged file format, do so, but consider saving your work in a more archival format when you are finished.

For more information on recommended formats, see the CDL Digital File Format Recommendations.

Tabular data

Tabular data warrants special mention because it is so common across disciplines, mostly as Excel spreadsheets. If you do your analysis in Excel, you should use the "Save As..." command to export your work to .csv format when you are done. Your spreadsheets will be easier to understand and to export if you follow best practices when you set them up, such as:

Other risks to accessibility

Organizing Files

Basic Directory and File Naming Conventions

These are rough guidelines to follow to help manage your data files in case you don't already have your own internal conventions. When organizing files, the top-level directory/folder should include:

The sub-directory structure should have clear, documented naming conventions. Separate files or directories could apply, for example, to each run of an experiment, each version of a dataset, and/or each person in the group.

File Renaming

Tools to help you:

File Naming Conventions for Specific Disciplines

Many disciplines have recommendations, for example:

Metadata: Data Documentation

Why document data?

Clear and detailed documentation is essential for data to be understood, interpreted, and used. Data documentation describes the content, formats, and internal relationships of your data in detail and will enable other researchers to find, use and properly cite your data.

Begin to document your data at the very beginning of your research project and continue throughout the project. Doing so will make the process much easier. If you have to construct the documentation at the end of the project, the process will be painful and important details will have been lost or forgotten. Don't wait to document your data!

What to document?

Research Project Documentation

Dataset documentation

How will you document your data?

Data documentation is commonly called metadata – "data about data". Researchers can document their data according to various metadata standards. Some metadata standards are designed for the purpose of documenting the contents of files, others for documenting the technical characteristics of files, and yet others for expressing relationships between files within a set of data. If you want to be able to share or publish your data, the DataCite metadata standard is of particular signficiance. It is important to establish a metadata strategy that is capable of describing your data and satisfying your data management needs. For assistance in defining an adequate metadata strategy, please contact uc3@ucop.edu.

Below are some general aspects of your data that you should document, regardless of your discipline. At minimum, store this documentation in a "readme.txt" file, or the equivalent, with the data itself. You can also reference a published article that may contain some of this information.

General Overview
Title Name of the dataset or research project that produced it
Creator Names and addresses of the organizations or people who created the data; preferred format for personal names is surname first (e.g., Smith, Jane).
Identifier Unique number used to identify the data, even if it is just an internal project reference number
Date Key dates associated with the data, including: project start and end date; release date; time period covered by the data; and other dates associated with the data lifespan, such as maintenance cycle, update schedule; preferred format is yyyy-mm-dd, or yyyy.mm.dd-yyyy.mm.dd for a range
Method How the data were generated, listing equipment and software used (including model and version numbers), formulae, algorithms, experimental protocols, and other things one might include in a lab notebook
Processing How the data have been altered or processed (e.g., normalized)
Source Citations to data derived from other sources, including details of where the source data is held and how it was accessed
Funder Organizations or agencies who funded the research
Content Description
Subject Keywords or phrases describing the subject or content of the data
Place All applicable physical locations
Language All languages used in the dataset
Variable list All variables in the data files, where applicable
Code list Explanation of codes or abbreviations used in either the file names or the variables in the data files (e.g. '999 indicates a missing value in the data')
Technical Description
File inventory All files associated with the project, including extensions (e.g. 'NWPalaceTR.WRL', 'stone.mov')
File Formats Formats of the data, e.g., FITS, SPSS, HTML, JPEG, etc.
File structure Organization of the data file(s) and layout of the variables, where applicable
Version Unique date/time stamp and identifier for each version
Checksum A digest value computed for each file that can be used to detect changes; if a recomputed digest differs from the stored digest, the file must have changed
Necessary software Names of any special-purpose software packages required to create, view, analyze, or otherwise use the data
Access
Rights Any known intellectual property rights, statutory rights, licenses, or restrictions on use of the data
Access information Where and how your data can be accessed by other researchers

Persistent Identifiers

If you want to be able to share or cite your dataset, you'll want to assign a public persistent unqiue identifier to it. There are a variety of public identifier schemes, but common properties of good schemes are that they are:

Today, this means data identifiers should fit inside a web address (URL) and be well-enough managed to remain actionable over the long-term. The most important factors in long-term data sharing are stable data storage and well-managed identifier redirection.

Any URL can be thought of as "resolving" either directly to its target (via the URL's hostname) or indirectly through one or more "redirects" to a final target URL. If your dataset is moved (e.g., from one archive to another) redirection allows the identifier to continue to resolve to the dataset, now at its new location.

An important factor to consider is choice of URL hostname. This is the domain name at the beginning of a URL (right after the "http://") that determines where URL resolution starts (for example, daac.ornl.gov). An identifier that doesn't contain a hostname may implicitly use a well-known hostname as the starting point for resolution. For example, dx.doi.org is the hostname for DOIs, so the document identified by doi:10.1000/182 can be found by typing "http://dx.doi.org/10.1000/182" in a web browser.

Because persistent (long-term) identifiers tend to be opaque (e.g., a string of digits) and reveal little or nothing about the nature of the identified object, it is also important for you to maintain metadata associated with the object. Among the most important pieces of metadata for you to maintain is the target URL that ensures that the identifier remains actionable. Whatever identifier scheme you choose, if you don't update the target URL when your data is moved, the identifier will break.

Here are some identifier schemes:

EZID: Identifiers Made Easy

CDL provides an identifier service called EZID that offers several choices of identifier. EZID enables you to take control of the management and distribution of your datasets, share and get credit for your datasets, and build your reputation for the collection and documentation of research. By making data resources easier to access, re-use, and verify, EZID helps you to build on previous work, conduct new research, and avoid duplicating previous work.

Security and Storage

Data Security

Data security is the protection of data from unauthorized access, use, change, disclosure and destruction. Make sure your data is safe in regards to:

Encryption and Compression

Unencrypted data will be more easily read by you and others in the future, but you may need to encrypt sensitive data.

Uncompressed data will be also be easier to read in the future, but you may need to compress files to conserve disk space.

Backups and storage

Making regular backups is an integral part of data management. You can backup data to your personal computer, external hard drives, or departmental or university servers. Software that makes backups for you automatically can simplify this process considerably. CDs or DVDs are not recommended because they are easily lost, decay rapidly, and fail frequently. The UK Data Archive provides additional guidelines on data storage, backup and security.

Backup Your Data

Data Backup Options

Test your backup system

To be sure that your backup system is working, periodically retrieve your data files and confirm that you can read them. You should do this when you initially set up the system and on a regular schedule thereafter.

Other data preservation considerations

Who is responsible for managing and controlling the data?

Who controls the data (e.g., the PI, a student, your lab, your university, your funder)? Before you spend a lot of time figuring out how to store the data, to share it, to name it, etc. you should make sure you have the authority to do so.

For what or whom are the data intended?

Who is your intended audience for the data? How do you expect they will use the data? The answer to these questions will help inform structuring and distributing the data.

How long should the data be retained?

Is there any requirement that the data be retained? If so, for how long? 3-5 years, 10-20 years, permanently? Not all data need to be retained, and some data required to be retained need not be retained indefinitely. Have a good understanding of your obligation for the data's retention.

Beyond any externally imposed requirments, think about the long-term usefulness of the data. If the data is from an experiment that you anticipate will be repeatable more quickly, inexpensively, and accurately as technology progresses, you may want to store it for a relatively brief period. If the data consists of observations made outside the laborartory that can never be repeated, you may wish to store it indefinitely.

Sharing and Archiving

Why share your data?

Considerations when preparing to share data

Ways to share your data

While the first three options above are valid ways to share data, a repository is much more able to provide long-term access. Data deposited in a repository can be supplemented with a "data paper"—a relatively new type of publication that describes a dataset, but does not analyze it or draw any conclusions—published in a journal such as Nature Scientific Data or [Geoscience Data Journal](http://onlinelibrary.wiley.com/journal/10.1002/(ISSN)2049-6060).

Finding a data repository

You should select a repository or archive for your data based on the long-term security offered and the ease of discovery and access by colleagues in your field. There are two common types of repository to look for:

A searchable and browsable list of repositories can be found at these websites:

Citing Data

Citing data is important in order to:

Citation Elements

A dataset should be cited formally in an article's reference list, not just informally in the text. Many data repositories and publishers provide explict instructions for citing their contents. If no citation information is provided, you can still construct a citation following generally agreed-upon guidelines from sources such as the CODATA Report on data citation, the ESIP Federation Data Citation Guidelines, and the current DataCite Metadata Schema.

Core elements

There are 5 core elements usually included in a dataset citation, with additional elements added as appropriate.

Creator names in non-Roman scripts should be transliterated using the ALA-LC Romanization Tables.

Common additional elements

Although the core elements are sufficient in the simplest case – citation to the entirety of a static dataset – additional elements may be needed if you wish to cite a dynamic dataset or a subset of a larger dataset.

Example citations

Sharing data that you produced/collected yourself

Sharing data that you have collected from other sources

If you are uncertain as to your rights to disseminate data, UC researchers can consult with your campus Office of General Council. Note: Laws about data vary outside the U.S.

For a general discussion about publishing your data, applicable to many disciplines, see the ICPSR Guide to Social Science Data Preparation and Archiving.

Confidendiality and Ethical Concerns

It is vital to maintain the confidentiality of research subjects both as an ethical matter and to ensure continuing participation in research. Researchers need to understand and manage tensions between confidentiality requirements and the potential benefits of archiving and publishing the data.

To ethically share confidential data, you may be able to