You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/00_intro/10_fair.mdx
+15-15Lines changed: 15 additions & 15 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -28,13 +28,13 @@ In the following, we answer the questions: What makes data FAIR? What do researc
28
28
29
29
Researchers — and the computers working on their behalf — must be able to find datasets to be able to reuse them. Therefore, the first guideline of the FAIR Data Principles outlines methods to ensure a dataset’s discovery.
30
30
31
-
### F1. (meta)data are assigned a globally unique and persistent identifier
31
+
### F1. (meta)data are assigned a globally unique and persistent identifier{#f1}
32
32
33
33
A globally unique and [persistent identifier (PID)](/docs/pid) helps both machines and humans find the data in the first place. These PIDs are essential for research as they guarantee the availability of the associated resource, in this case a dataset. The registry services that make these identifiers available work to maintain the link to the resource, thus avoiding dead links. This ensures the resource remains findable and may be referenced simply by the use of its PID.
34
34
35
35
A common example of a citable PID is the Digital Object Identifier, or [DOI](https://doi.org/10.1000/182). As with many journals, scientific data repositories often assign a DOI automatically. The Registry of Research Data Repositories, [re3data](https://www.re3data.org/), indicates whether a given repository assigns an identifier, along with the PID type. For example, both the [The Cambridge Structural Database (CSD)](https://www.ccdc.cam.ac.uk/solutions/csd-system/components/csd/) and the [Chemotion Repository](https://www.chemotion-repository.net/) assign DOIs to each dataset deposited. Researchers must be aware of this option when searching for a suitable repository, while repositories should offer this service.
36
36
37
-
### F2. data are described with rich metadata (defined by R1 below)
37
+
### F2. data are described with rich metadata (defined by R1 below){#f2}
38
38
39
39
Data need to be sufficiently described in order to make them both findable and reusable. Hence, the specific focus here lies on making the (meta)data findable by using rich discovery [metadata](/docs/metadata) in a standardized format and allowing computers and humans to quickly understand the dataset’s contents. This is an essential component in the plurality of metadata described by [R1](#r1-metadata-are-richly-described-with-a-plurality-of-accurate-and-relevant-attributes) below. This information may include, but is not limited to:
40
40
@@ -46,34 +46,34 @@ Data need to be sufficiently described in order to make them both findable and r
46
46
47
47
Repositories should provide researchers with a fillable [application profile](https://en.wikipedia.org/wiki/Application_profile) that allows researchers to give extensive and precise information on their deposited datasets. For example, the Chemotion Repository uses, among others, the [Datacite Metadata Schema](http://doi.org/10.5438/0012) to build its application profile, a schema specifically created for the publication and citation of research data. [RADAR](https://radar.products.fiz-karlsruhe.de/en), including the variant [RADAR4Chem](https://www.nfdi4chem.de/index.php/2650-2/), has also built [its metadata schema](https://radar.products.fiz-karlsruhe.de/en/radarfeatures/radar-metadatenschema) on Datacite. These include an assortment of mandatory, recommended, and optional metadata properties, allowing for a rich description of the deposited dataset. For those publishing data, always keep in mind: the more information provided, the better.
48
48
49
-
### F3. metadata clearly and explicitly include the identifier of the data it describes
49
+
### F3. metadata clearly and explicitly include the identifier of the data it describes{#f3}
50
50
51
51
While [F1](#f1-metadata-are-assigned-a-globally-unique-and-persistent-identifier) stipulates the assignment of an identifier, F3 underlines the importance of including this identifier in the metadata itself. The metadata and the dataset it describes are typically separate files. Including the identifier in the metadata directly links the information to the associated dataset.
52
52
53
53
Furthermore, the dataset may not be published alongside the metadata. For example, in the case of unpublished archived datasets, the PID can lead to a method (e.g. a landing page) to contact those responsible for the data instead of to the dataset itself.
54
54
Researchers must be aware of this importance, while repositories must not only assign a PID as described in [F2](#f2-data-are-described-with-rich-metadata-defined-by-r1-below) above, but should also ensure that this PID is a required property of the metadata.
55
55
56
-
### F4. (meta)data are registered or indexed in a searchable resource
56
+
### F4. (meta)data are registered or indexed in a searchable resource{#f4}
57
57
58
58
Metadata are used to set up indices, enabling machines to efficiently search for and find datasets. For this process to work successfully, metadata must be complete as outlined [above](#f2-data-are-described-with-rich-metadata-defined-by-r1-below). Repositories should ensure the metadata entered for a deposited dataset is available in a machine-readable format to facilitate the assignment of indices.
59
59
60
60
## Accessible
61
61
62
62
Accessible means that humans and machines receive instructions on how to obtain the data. It should be noted that FAIR does not equate to open, as further explained in [A1.2](#a12-the-protocol-allows-for-an-authentication-and-authorization-procedure-where-necessary).
63
63
64
-
### A1. (meta)data are retrievable by their identifier using a standardized communications protocol
64
+
### A1. (meta)data are retrievable by their identifier using a standardized communications protocol{#a1}
65
65
66
66
To guarantee access to datasets, persistent identifiers, such as DOIs, are suggested, which are resolved by standard methods. Common protocols include http(s) or (s)ftp.
67
67
68
-
#### A1.1 the protocol is open, free, and universally implementable
68
+
#### A1.1 the protocol is open, free, and universally implementable{#a1_1}
69
69
70
70
Repositories should only use protocols that allow any computer to access at least the metadata. Not only does this refer to the use of standard communication protocols, as stated in [A1](#a1-metadata-are-retrievable-by-their-identifier-using-a-standardized-communications-protocol), these protocols must also be freely available and open-sourced. Therefore, proprietary or non-standard protocols should be avoided.
71
71
72
-
#### A1.2 the protocol allows for an authentication and authorization procedure, where necessary
72
+
#### A1.2 the protocol allows for an authentication and authorization procedure, where necessary{#a1_2}
73
73
74
74
Where necessary, machine-readable protocols that let the user know that action needs to be taken (such as a login) to access data must be in place. FAIR data and open data are not synonymous: FAIR data requires that it must be clearly stated how the data can be accessed, as opposed to granting anyone and everyone full access. In manuscripts of scientific articles, this information should be included in a [data availability statement](/docs/data_availability_statement). This can be especially important for sensitive data, where, for example, personal data and/or medical information may be disclosed. Hence, repositories should also provide a way for users (and their computers) to identify themselves, enabling access permission to be granted.
75
75
76
-
### A2. metadata are accessible, even when the data are no longer available
76
+
### A2. metadata are accessible, even when the data are no longer available{#a2}
77
77
78
78
The metadata that describes a dataset should be stored in a separate file so it is available, even if the datasets themselves can no longer be accessed. Problems with dataset availability are usually due to 1) the cost of maintaining and storing full datasets and 2) file format deprecation as technologies evolve. Maintaining metadata files is cheaper and simpler and ensures that, at a minimum, details such as contact information remains available. These files should thus be archived forever.
79
79
@@ -83,26 +83,26 @@ A repository should clearly state a contingency plan for metadata storage should
83
83
84
84
Data need to be integrated with and/or compared to other datasets, while computers must be able to interpret and exchange the information. Ideally, they are compatible with standard applications and can thus be integrated into (automated) processing and analysis workflows. Interoperability often functions as a precursor to [reusability](#reusable), as it ensures the compatibility across systems.
85
85
86
-
### I1. (meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation
86
+
### I1. (meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation{#i1}
87
87
88
88
Machines need to be able to understand how to exchange and interpret information. Similar to humans, a uniform and standard language aids in this understanding. In chemistry, a great typical example of such an information exchange standard is the [crystallographic information (CIF)](https://doi.org/10.1107/97809553602060000107). This standard also adheres to the aspects described in [I2](#i2-metadata-use-vocabularies-that-follow-fair-principles) and [R1.3](#r13-metadata-meet-domain-relevant-community-standards) below. Simply put, [standard file formats](/docs/format_standards) for a given analytical method ensure the data and the associated metadata, which typically include measurement details, for example, follows a prescribed format. This ensures both humans and machines receive the information required to interpret the data.
89
89
90
90
Especially when looking at metadata, effective and efficient machine readability greatly depends on being able to reduce ambiguity. Metadata provides context to datasets. However, machines need to be able to interpret this context. Therefore, the structured schemas chosen by the repositories should include universally applied [ontologies](https://terminology.nfdi4chem.de/ts/ontologies) and controlled vocabularies to define relationships and avoid ambiguity. For example, chemistry-specific repositories should be designed to include [ontologies](/docs/ontology) such as the [Chemical Methods Ontology](https://terminology.nfdi4chem.de/ts/ontologies/chmo) (CHMO) or the [Chemical Information Ontology](https://terminology.nfdi4chem.de/ts/ontologies/cheminf) (CHEMINF) to accurately describe the (meta)data provided. Such ontologies should be based on widely-applied data models, for instance, the
### I2. (meta)data use vocabularies that follow FAIR principles
93
+
### I2. (meta)data use vocabularies that follow FAIR principles{#i2}
94
94
95
95
The applied vocabularies or ontologies should be well-documented and resolvable using a PID. For instance, CHMO mentioned [above](#i1-metadata-use-a-formal-accessible-shared-and-broadly-applicable-language-for-knowledge-representation) uses a [persistent URL (PURL)](https://en.wikipedia.org/wiki/Persistent_uniform_resource_locator), resolvable using a standard web browser through `http`, while the [documentation](https://github.com/rsc-ontologies/rsc-cmo) is publicly available on Github.
96
96
97
-
### I3. (meta)data include qualified references to other (meta)data
97
+
### I3. (meta)data include qualified references to other (meta)data{#i3}
98
98
99
99
Related datasets should be linked in a reliable manner, preferably via their PIDs. This includes any previous versions, datasets required to fully use and comprehend the current dataset, or datasets that the dataset builds upon. This relationship should also be described in a meaningful manner. For example, if dataset X is a previous version of dataset Y, it would be described as such rather than simply being described as a related or an associated dataset. Repositories should include a method of referring to other datasets in their metadata form.
100
100
101
101
## Reusable
102
102
103
103
Many of the previous points lead to one key aspect of data sharing: data reusability. Datasets must be described in a manner that allows the user to easily determine how and under which conditions the data can be reused.
104
104
105
-
### R1. (meta)data are richly described with a plurality of accurate and relevant attributes
105
+
### R1. (meta)data are richly described with a plurality of accurate and relevant attributes{#r1}
106
106
107
107
Related to [F2](#f2-data-are-described-with-rich-metadata-defined-by-r1-below) above, the focus here lies on whether the data, once found, is useable to the person or computer searching. It also stresses giving the data as many attributes as possible. Researchers should not assume the person—or that person’s computer—looking to re(use) their data is completely familiar with the discipline. Examples of information to assign here include (non-exhaustive list):
108
108
@@ -122,15 +122,15 @@ An important piece of information for chemical data are [machine-readable chemic
122
122
123
123
Repositories should provide data publishers with the opportunity to include a plurality of information in their metadata. This includes giving a wide range of optional and free-fill fields for data publishers to complete.
124
124
125
-
#### R1.1. (meta)data are released with a clear and accessible data usage license
125
+
#### R1.1. (meta)data are released with a clear and accessible data usage license{#r1_1}
126
126
127
127
The metadata should include human and machine-readable use conditions, such as a [licence](/docs/licences). [Creative Commons](https://creativecommons.org/) licences are commonly used for scientific data. [re3data](https://www.re3data.org/) lists whether a repository allows researchers to directly select a licence or terms of use agreement when depositing data. At a minimum, repositories should allow researchers to add a licence file.
128
128
129
-
#### R1.2. (meta)data are associated with detailed provenance
129
+
#### R1.2. (meta)data are associated with detailed provenance{#r1_2}
130
130
131
131
In simple terms: metadata include any relevant history. If the dataset is related to other datasets or based on another researcher’s data, these should be linked via their PID as described in I3. This includes citing or acknowledging others for their work, which also takes their licensing or use agreements into consideration (see [R1.1](#r11-metadata-are-released-with-a-clear-and-accessible-data-usage-license)). Furthermore, metadata should contain machine-readable information on how the data was generated or processed.
132
132
133
-
#### R1.3. (meta)data meet domain-relevant community standards
133
+
#### R1.3. (meta)data meet domain-relevant community standards{#r1_3}
134
134
135
135
As research data management and, as such, [data publishing](/docs/data_publishing) becomes more and more prevalent across research areas, [best practices](/docs/best_practice) in the individual communities will arise. This should encompass metadata templates for proper documentation of datasets, how the data should be [organized](/docs/data_organisation), which vocabularies or [ontologies](/docs/ontology) to use, and [file formats](/docs/format_standards). NFDI4Chem is working to establish [metadata and data standards](https://www.nfdi4chem.de/index.php/task-areas/) for the various communities in chemistry.
0 commit comments