Documenting things

https://twitter.com/JenMsft/status/1557218211971489792

The power of README

README files are not a new thing. They have been around computer projects since the early days. One great thing about the popularization of supporting the markdown syntax and its web rendering in most code repositories, is that you can move beyond a simple text file and start to present a compelling entry point to your project that can link to various parts and external resources.

Good types of information to have on a README:

  • Title capturing the essence of the project
  • List of current contributors
  • A short explanation of the goal / purpose
  • How to install / where to start
  • A quick demo on how to use the content (can be a link to another document as well)
  • What to do if a bug is spotted
  • How to contribute
  • Licensing
  • Acknowledgements of authors, contributors, sponsors or other related work

Adding images, short videos / animations can make a README more engaging.

Need some inspiration ?

  • Here is a interesting template: https://github.com/navendu-pottekkat/awesome-readme/tree/master

  • When you start an R package with the usethis package, a README will be created for you with all the relevant sections for such type of project.

  • pick a package you like and inspect their README

Making your code readable

https://twitter.com/cjm4189/status/1557346489613094914

It is important to make your code easy to read if you hope that others will reuse it. It starts with using a consistent style withing your scripts (at least within a project).

import this
The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!

There is also the visual aspect of the code that should not be neglected. Like a prose, if you receive a long text without any paragraphs, you might be not very excited about reading it. Indentation, spaces, and empty lines should be leveraged to make a script visually inviting and easy to read. The good news is that most of the Integrated Development Environment (IDE) will help you to do so by auto formatting your scripts according to conventions. Note that also a lot of IDEs, such as RStudio, rely on some conventions to ease the navigation of scripts and notebooks. For example, try to add four - or # after a line starting with one or several # in an R Script!

Comments

Real Programmers don’t comment their code. If it was hard to write, it should be hard to understand.
Tom Van Vleck, based on people he knew_ (https://multicians.org/thvv/realprogs.html)

Joke aside, it is really hard to comment too much your code, because even steps that might seem trivial today might not be so anymore in a few weeks or months for now. In addition, a well commented code is more likely to be read by others. Note also that comments should work in complement of the code and should not being seen as work around vague naming conventions of variables or functions.

x <- 9.81  #  gravitational acceleration

gravity_acc <- 9.81  #  gravitational acceleration

Inline

It does not matter if you are using a script or notebook. It is important to provide comments along your code to complement it by:

  • explaining what the code does
  • capturing decisions that were made on the analytical side. For example, why a specific value was used for a threshold.
  • specifying when some code was added to handle an edge case such as an unexpected value in the data (so a new user doesn’t have to guess what does lines of code and might want to delete them assuming it is not necessary)

Other thoughts:

  • It is OK to state (what seems) the obvious (some might disagree with this statement)
  • Try to keep comments to the point and short

Functions

Both Python and R have conventions on how to document functions. Adopting those conventions will help you to make your code readable but also to automate part of the documentation development.

Roxygen2

The goal of roxygen2 is to make documenting your code as easy as possible. It can dynamically inspect the objects that it’s documenting, so it can automatically add data that you’d otherwise have to write by hand.

How do we insert it? Make sure you cursor is inside the function you want to document and from RStudio Menu Code -> Insert Roxygen Skeleton

Example:

#' Add together two numbers
#'
#' @param x A number
#' @param y A number
#' @return The sum of \code{x} and \code{y}
#' @examples
#' add(1, 1)
#' add(10, 1)
add2 <- function(x, y) {
  x + y
}

Try it! - Copy the function (without the documentation) in a new script - Add a third parameter to the function such as it sums 3 numbers - Add the Roxygen skeleton - Fill it to best describe your function

Note that when you are developing an R package, the Roxygen skeleton can be leveraged to develop the help pages of your package so you only have one place to update and the help will synchronize automatically.

Python Docstring

A docstring is a string literal that occurs as the first statement in a module, function, class, or method definition. Such a docstring becomes the __doc__ special attribute of that object.

def complex(real=0.0, imag=0.0):
    """Form a complex number.

    Keyword arguments:
    real -- the real part (default 0.0)
    imag -- the imaginary part (default 0.0)
    """
    if imag == 0.0 and real == 0.0:
        return complex_zero

Here for more: https://www.python.org/dev/peps/pep-0257/

Leveraging Notebooks

As we have discussed and experimented with Notebooks during the week. It is because Notebooks provide space to further develop content, such as methodology, around the code you are developing in your analysis. Notebooks also enable you to integrate the outputs of your scientific research with the code that was used to produce it. Finally, notebooks can be rendered into various format that let them share with a broad audience.

Notebooks are not only used within the scientific community, see here for some thoughts from Airbnb data science team.


Hands-on

Documenting

getPercent <- function( value, pct ) {
    result <- value * ( pct / 100 )
    return( result )
}

Try adding the Roxygen Skeleton to this function and fill all the information you think is necessary to document the function

Commenting

Let’s try to improve the readability and documentation of this repository: https://github.com/brunj7/better-comments. Follow the instructions on the README

For inspiration, you can check out the NASA code for APOLLO 11 dating from 1969: https://github.com/chrislgarry/Apollo-11!!


Metadata

This topic will be the focus of our Fall course EDS-213. This a very important topic for scientific reproducibility and for today we will be only provide a partial overview of this broader topic.

Data life cycle, DataONE

Metadata (data about data) is an important part of the data life cycle because it enables data reuse long after the original collection. The goal is to have enough information for the researcher to understand the data, interpret the data, and then re-use the data in another study.

Here are good questions to answer with your metadata:

  • What was measured?
  • Who measured it?
  • When was it measured?
  • Where was it measured?
  • How was it measured?
  • How is the data structured?
  • Why was the data collected?
  • Who should get credit for this data (researcher AND funding agency)?
  • How can this data be reused (licensing)?

How do you organize all this information? You could use a free form format, like a README file or spreadsheet. But there is also great advantage to use a more standardized way that will make the content not only Human readable but also machine readable. This will enhance the data discovery as specific information will be potentially tagged or attributed to specific aspect of your data (e.g. spatial or temporal coverage, taxonomy, …).

There are a number of environmental metadata standards (think, templates) that you could use, including the Ecological Metadata Language (EML), Geospatial Metadata Standards like ISO 19115 and ISO 19139, the Biological Data Profile (BDP), Dublin Core, Darwin Core, PREMIS, the Metadata Encoding and Transmission Standard (METS), and the list goes on and on.

Data provenance & semantics

Data provenance is tracing the origin of a data set to the raw-data that were used as input of the processing / analysis that led to the creation of this data set. It can be done more or less formally and this is an active area of research. For today we will be focusing on capturing the information about the data you are collecting. Here are a set of good questions to help you in that process:

  • Source / owner (Person, institution, website, ….)
  • When was it acquired ?
  • By whom on the WG ?
  • Where is it currently located (Google drive, server, ….) ?
  • Short description of the data
  • Track if it is used in your analysis

Here is a template of a data log that could hep to store this information

Another important and related aspect and also active field of research is data semantics. Often data sets store complex information and concepts that can be describe more of less accurately. Let’s take an example, you have receive a csv file storing a table with several variables about a fish stock assessment. One of the variable is named “length”. However there are many ways to measure the length of a fish. Which one is it?

Data semantics aims at clearly identify those concepts relying on vocabularies and ontologies, such as ENVO in environmental sciences. In addition, it enables the leverage relations between those concepts to help with (data) discovery.

Licensing

It is a good practice to add a license to a repository / project. It will help to clarify what are the expectations regarding using and potentially contributing to this work.

Here is a good website to choose a license:

Here is also good set of instructions on how to make this happen on a GitHub repository: https://docs.github.com/en/github/creating-cloning-and-archiving-repositories/creating-a-repository-on-github/licensing-a-repository

Note that for content (such as this course), there is also another type of licensing that can be used: https://creativecommons.org/licenses/

Further Reading

Acknowledgements

Large portion of this material have been adapted from NCEAS Reproducible Research Techniques for Synthesis course https://learning.nceas.ucsb.edu/2021-02-RRCourse/



Bren School logo

The original parts of this work are licensed under a Creative Commons Attribution 4.0 International License.

This website was made with quarto by Posit.