A data management plan is a formal document that outlines what you will do with your data during and after a research project. Most researchers collect data with some form of plan in mind, but it's often inadequately documented and incompletely thought out. Many data management issues can be handled easily or avoided entirely by planning ahead. With the right process and framework it doesn't take too long and can pay-off enormously in the long run.
In February of 2013, the White House Office of Science and Technology Policy (OSTP) issued a memorandum directing Federal agencies that provide significant research funding to develop a plan to expand public access to research. Among other requriments, the plans must
Ensure that all extramural researchers receiving Federal grants and contracts for scientific research and intramural researchers develop data management plans, as appropriate, describing how they will provide for long-term preservation of, and access to, scientific data in digital formats resulting from federally funded research, or explaining why long-term preservation and access cannot be justified
The National Science Foundation (NSF) already requires a 2-page plan as part of the funding proposal process. Soon most or all US Federally funded grants will require some form of data management plan.
We have been working with internal and external partners to make data management plan development less complicated. By getting to know your research and data, we can match your specific needs with data management best practices in your field to develop a data management plan that works for you. If you do this work at the beginning of your research process, you will have a far easier time following-through and complying with funding agency and publisher requirements.
We recommend that those applying for funding from US Federal agencies, such as the NSF, use the DMPTool. The DMPTool provides guidance for many of the NSF Directorate and Division requirements, along with links to UC resources, services, and help. Contact UC3 if you have questions and would like feedback on your plan.
Research projects generate and collect countelss varieties of data. To forumulate a data management plan, it's useful to categorize your data in four ways: by source, format, stability, and volume.
Although data comes from many different sources, but they can be grouped into four main categories. The category(ies) your data comes from will affect the choices that you make throughout your data management plan.
Data can come in many forms, including
Data can also be fixed or changing over the course of the project (and perhaps beyond the project's end). Do the data ever change? Do they grow? Is previously recorded data subject to correction? Will you need to keep track of data versions? With respect to time, the common categories of dataset are
The answer to this question affects how you organize the data as well as the level of versioning you will need to undertake. Keeping track of rapidly changing datasets can be a challenge, so it is imperative that you begin with a plan to carry you through the entire data management process.
For instance, image data typically requires a lot of storage space, so you'll want to decide whether to retain all your images (and, if not, how you will decide which to discard) and where such large data can be housed. Be sure to know your archiving organization's capacity for storage and backups.
To avoid being under-prepared, estimate the growth rate of your data. Some questions to consider are
The file format you choose for your data is a primary factor in someone else's ability to access it in the future. Tthink carefully about what file format will be best to manage, share, and preserve your data. Technology continually changes and all contemporary hardware and software should be expected to become obsolete. Consider how your data will be read if the software used to produce it becomes unavailable. Although any file format you choose today may become unreadable in the future, some formats are more likely to be readable than others.
|Discouraged Format||Alternative Format|
|Excel (.xls, .xlsx)||Comma Separated Values (.csv)|
|Word (.doc, .docx)||plain text (.txt), or if formatting is needed, PDF/A (.pdf)|
|PowerPoint (.ppt, .pptx)||PDF/A (.pdf)|
|Photoshop (.psd)||TIFF (.tif, .tiff)|
|Quicktime (.mov)||MPEG-4 (.mp4)|
If you find it necessary or convenient to work with data in a proprietary/discouraged file format, do so, but consider saving your work in a more archival format when you are finished.
For more information on recommended formats, see the CDL Digital File Format Recommendations.
Tabular data warrants special mention because it is so common across disciplines, mostly as Excel spreadsheets. If you do your analysis in Excel, you should use the "Save As..." command to export your work to .csv format when you are done. Your spreadsheets will be easier to understand and to export if you follow best practices when you set them up, such as:
These are rough guidelines to follow to help manage your data files in case you don't already have your own internal conventions. When organizing files, the top-level directory/folder should include:
The sub-directory structure should have clear, documented naming conventions. Separate files or directories could apply, for example, to each run of an experiment, each version of a dataset, and/or each person in the group.
Tools to help you:
Many disciplines have recommendations, for example:
Clear and detailed documentation is essential for data to be understood, interpreted, and used. Data documentation describes the content, formats, and internal relationships of your data in detail and will enable other researchers to find, use and properly cite your data.
Begin to document your data at the very beginning of your research project and continue throughout the project. Doing so will make the process much easier. If you have to construct the documentation at the end of the project, the process will be painful and important details will have been lost or forgotten. Don't wait to document your data!
Data documentation is commonly called metadata – "data about data". Researchers can document their data according to various metadata standards. Some metadata standards are designed for the purpose of documenting the contents of files, others for documenting the technical characteristics of files, and yet others for expressing relationships between files within a set of data. If you want to be able to share or publish your data, the DataCite metadata standard is of particular signficiance. It is important to establish a metadata strategy that is capable of describing your data and satisfying your data management needs. For assistance in defining an adequate metadata strategy, please contact email@example.com.
Below are some general aspects of your data that you should document, regardless of your discipline. At minimum, store this documentation in a "readme.txt" file, or the equivalent, with the data itself. You can also reference a published article that may contain some of this information.
|Title||Name of the dataset or research project that produced it|
|Creator||Names and addresses of the organizations or people who created the data; preferred format for personal names is surname first (e.g., Smith, Jane).|
|Identifier||Unique number used to identify the data, even if it is just an internal project reference number|
|Date||Key dates associated with the data, including: project start and end date; release date; time period covered by the data; and other dates associated with the data lifespan, such as maintenance cycle, update schedule; preferred format is yyyy-mm-dd, or yyyy.mm.dd-yyyy.mm.dd for a range|
|Method||How the data were generated, listing equipment and software used (including model and version numbers), formulae, algorithms, experimental protocols, and other things one might include in a lab notebook|
|Processing||How the data have been altered or processed (e.g., normalized)|
|Source||Citations to data derived from other sources, including details of where the source data is held and how it was accessed|
|Funder||Organizations or agencies who funded the research|
|Subject||Keywords or phrases describing the subject or content of the data|
|Place||All applicable physical locations|
|Language||All languages used in the dataset|
|Variable list||All variables in the data files, where applicable|
|Code list||Explanation of codes or abbreviations used in either the file names or the variables in the data files (e.g. '999 indicates a missing value in the data')|
|File inventory||All files associated with the project, including extensions (e.g. 'NWPalaceTR.WRL', 'stone.mov')|
|File Formats||Formats of the data, e.g., FITS, SPSS, HTML, JPEG, etc.|
|File structure||Organization of the data file(s) and layout of the variables, where applicable|
|Version||Unique date/time stamp and identifier for each version|
|Checksum||A digest value computed for each file that can be used to detect changes; if a recomputed digest differs from the stored digest, the file must have changed|
|Necessary software||Names of any special-purpose software packages required to create, view, analyze, or otherwise use the data|
|Rights||Any known intellectual property rights, statutory rights, licenses, or restrictions on use of the data|
|Access information||Where and how your data can be accessed by other researchers|
If you want to be able to share or cite your dataset, you'll want to assign a public persistent unqiue identifier to it. There are a variety of public identifier schemes, but common properties of good schemes are that they are:
Today, this means data identifiers should fit inside a web address (URL) and be well-enough managed to remain actionable over the long-term. The most important factors in long-term data sharing are stable data storage and well-managed identifier redirection.
Any URL can be thought of as "resolving" either directly to its target (via the URL's hostname) or indirectly through one or more "redirects" to a final target URL. If your dataset is moved (e.g., from one archive to another) redirection allows the identifier to continue to resolve to the dataset, now at its new location.
An important factor to consider is choice of URL hostname. This is the domain name at the beginning of a URL (right after the "http://") that determines where URL resolution starts (for example, daac.ornl.gov). An identifier that doesn't contain a hostname may implicitly use a well-known hostname as the starting point for resolution. For example, dx.doi.org is the hostname for DOIs, so the document identified by doi:10.1000/182 can be found by typing "http://dx.doi.org/10.1000/182" in a web browser.
Because persistent (long-term) identifiers tend to be opaque (e.g., a string of digits) and reveal little or nothing about the nature of the identified object, it is also important for you to maintain metadata associated with the object. Among the most important pieces of metadata for you to maintain is the target URL that ensures that the identifier remains actionable. Whatever identifier scheme you choose, if you don't update the target URL when your data is moved, the identifier will break.
CDL provides an identifier service called EZID that offers several choices of identifier. EZID enables you to take control of the management and distribution of your datasets, share and get credit for your datasets, and build your reputation for the collection and documentation of research. By making data resources easier to access, re-use, and verify, EZID helps you to build on previous work, conduct new research, and avoid duplicating previous work.
Data security is the protection of data from unauthorized access, use, change, disclosure and destruction. Make sure your data is safe in regards to:
Unencrypted data will be more easily read by you and others in the future, but you may need to encrypt sensitive data.
Uncompressed data will be also be easier to read in the future, but you may need to compress files to conserve disk space.
Making regular backups is an integral part of data management. You can backup data to your personal computer, external hard drives, or departmental or university servers. Software that makes backups for you automatically can simplify this process considerably. CDs or DVDs are not recommended because they are easily lost, decay rapidly, and fail frequently. The UK Data Archive provides additional guidelines on data storage, backup and security.
To be sure that your backup system is working, periodically retrieve your data files and confirm that you can read them. You should do this when you initially set up the system and on a regular schedule thereafter.
Who controls the data (e.g., the PI, a student, your lab, your university, your funder)? Before you spend a lot of time figuring out how to store the data, to share it, to name it, etc. you should make sure you have the authority to do so.
Is there any requirement that the data be retained? If so, for how long? 3-5 years, 10-20 years, permanently? Not all data need to be retained, and some data required to be retained need not be retained indefinitely. Have a good understanding of your obligation for the data's retention.
Beyond any externally imposed requirments, think about the long-term usefulness of the data. If the data is from an experiment that you anticipate will be repeatable more quickly, inexpensively, and accurately as technology progresses, you may want to store it for a relatively brief period. If the data consists of observations made outside the laborartory that can never be repeated, you may wish to store it indefinitely.
While the first three options above are valid ways to share data, a repository is much more able to provide long-term access. Data deposited in a repository can be supplemented with a "data paper"—a relatively new type of publication that describes a dataset, but does not analyze it or draw any conclusions—published in a journal such as Nature Scientific Data or [Geoscience Data Journal](http://onlinelibrary.wiley.com/journal/10.1002/(ISSN)2049-6060).
You should select a repository or archive for your data based on the long-term security offered and the ease of discovery and access by colleagues in your field. There are two common types of repository to look for:
A searchable and browsable list of repositories can be found at these websites:
Citing data is important in order to:
A dataset should be cited formally in an article's reference list, not just informally in the text. Many data repositories and publishers provide explict instructions for citing their contents. If no citation information is provided, you can still construct a citation following generally agreed-upon guidelines from sources such as the CODATA Report on data citation, the ESIP Federation Data Citation Guidelines, and the current DataCite Metadata Schema.
There are 5 core elements usually included in a dataset citation, with additional elements added as appropriate.
Creator names in non-Roman scripts should be transliterated using the ALA-LC Romanization Tables.
Although the core elements are sufficient in the simplest case – citation to the entirety of a static dataset – additional elements may be needed if you wish to cite a dynamic dataset or a subset of a larger dataset.
If you are uncertain as to your rights to disseminate data, UC researchers can consult with your campus Office of General Council. Note: Laws about data vary outside the U.S.
For a general discussion about publishing your data, applicable to many disciplines, see the ICPSR Guide to Social Science Data Preparation and Archiving.
It is vital to maintain the confidentiality of research subjects both as an ethical matter and to ensure continuing participation in research. Researchers need to understand and manage tensions between confidentiality requirements and the potential benefits of archiving and publishing the data.
To ethically share confidential data, you may be able to