+ - 0:00:00
Notes for current slide
Notes for next slide

Introduction to data Repositories

EDS-213: Metadata, Data Modeling and Data Semantics

Julien Brun

NCEAS, UCSB

Fall 2021

1 / 26

Motivation

2 / 26

The Need for Data Management -- Big Data

3 / 26

Data Deluge

4 / 26

Why Manage Data? Advancement of Science

  • Data is a valuable asset – it is expensive and time consuming to collect

  • Data should be managed to:

    • maximize the effective use and value of data and information assets

    • continually improve the quality including: data accuracy, integrity, integration, timeliness of data capture

    • ensure appropriate use of data and information

    • ensure sustainability and accessibility in long term for re-use in science

5 / 26

The Need for Data Management -- Public Perception

6 / 26

"The climate scientists at the center of a media storm over leaked emails were yesterday cleared of accusations that they fudged their results and silenced critics, but a review found they had failed to be open enough about their work."

Why Manage Data? Researcher Perspective

  • Keep yourself organized – be able to find your files

  • Track your science processes for reproducibility

  • Quality control your data more efficiently

  • To avoid data loss (e.g. making backups)

  • Gain credibility and recognition for your science efforts through data sharing!

7 / 26

The Need for Data Management: Data Entropy

Michener et al 1997; Vines et al 2014

8 / 26

The Data Life Cycle

9 / 26

Data Reuse

10 / 26

Barriers to Data Reuse

  • Data not preserved

    • Tiny proportion of ecological data are readily available
  • Dispersed, isolated repositories

    • Each community has its own; disconnected; underutilized
  • Lack of software interoperability

    • Metacat, DSpace, Mercury, iRODS, XMCat, OPeNDAP, ...
  • Heterogeneous data

    • Many data formats, metadata formats, and varying semantics
11 / 26

Solutions

  • Preserve data

  • Adopt standards (e.g. metadata, APIs, ...)

  • Create networks

  • Use interoperable formats and date models

12 / 26

Preserving Data

  • Datasets are preserved with long-term commitment

  • Datasets are versioned and citeable

  • Datasets are searchable and discoverable

13 / 26

FAIR

14 / 26

FAIR Data Guiding Principles

Concise and measurable set of principles to enhance the reusability of data

  • Data should be Findable

  • Data should be Accessible

  • Data should be Interoperable

  • Data should be Re-usable

15 / 26

To be FINDABLE

F1. (meta)data are assigned a globally unique and eternally persistent identifier

F2. data are described with rich metadata

F3. (meta)data are registered or indexed in a searchable resource

F4. metadata specify the data identifier

16 / 26

To Be ACCESSIBLE

A1 (meta)data are retrievable by their identifier using a standardized communications protocol

A1.1 the protocol is open, free, and universally implementable

A1.2 the protocol allows for an authentication and authorization procedure, where necessary

A2 metadata are accessible, even when the data are no longer available

17 / 26

To Be INTEROPERABLE

I1. (meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation

I2. (meta)data use vocabularies that follow FAIR principles

I3. (meta)data include qualified references to other (meta)data

18 / 26

To Be RE-USABLE

R1. meta(data) have a plurality of accurate and relevant attributes

R1.1. (meta)data are released with a clear and accessible data usage license

R1.2. (meta)data are associated with their provenance

R1.3. (meta)data meet domain-relevant community standards

19 / 26

Data Repositories

20 / 26

What is a data repository?



System Long-Term Versioned Citable Discoverable
Google Drive maybe maybe no no
GitHub yes yes no no
University Server maybe no no maybe
KNB yes yes yes yes
21 / 26

Data Repositories

22 / 26

DataONE

Federation of data repositories https://www.dataone.org/

23 / 26

Google Dataset Search

https://datasetsearch.research.google.com/

24 / 26

Finding the Right Repository

Search engine for data repositories https://www.re3data.org/

25 / 26

Aknowledgement

This presentation has been adapted from the CRESCYNT training course organized by NCEAS and CRESCYNT. Credits goes to Matt Jones, Amber Budden, Jeanette Clark and many more.

26 / 26

Motivation

2 / 26
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow