class: center, middle, inverse, title-slide # Acessing Data in a Programmatic Way ## APIs and more ### Julien Brun ### MEDS-213 ### 2021/10 --- class: center, middle, inverse # the Internet --- # Basic networking * Computers networked on the Internet are called **hosts** * Host computers connect via networking equipment * Can send messages to each other over communication protocols * Client: the host *initiating* the request * Server: the host *responding* to a request ![](images/tcp_ports.jpg) --- # Routed networks ![](images/routers_archs.gif) * Routing is essential to connecting networks together via forwarding data along the best **routing paths** * Routers are special pieces of equipment (computers in themselves) that sit between two or more networks. They either pass on ("forward"), or block information (in Internet parlance these bits of information are packaged together in **packets**, from one network to another --- # IP Adress * Hosts are assigned a unique IP number used for all communication and routing - 128.111.220.7 - 128.111 **is** UCSB - 128.111.220 **is** NCEAS - above is a simplification, but is generally the case * Each IP Address can be used to communicate over various "ports" - Ports allow multiple applications to communicate with a host without mixing up traffic - the standard Web port for HTTP is port 80 - the standard port for sending email is 25 - the File Transfer Protocol FTP is 22 --- # Host Names IP numbers can be difficult to remember, so the Internet also has a way of assigning hostnames to IP numbers - Handled through the global, decentralized, and hierarchical **Domain Name System** (DNS) - Clients first look up a hostname in DNS to find the IP Number - `taylor.bren.ucsb.edu` -> 128.111.??.?? (Can you check!?) - Then they open a connection to the IP Number - The special computers handling the mapping of IP Numbers to Hostnames are called **Nameservers** --- class: center, middle, inverse # The World Wide Web --- # HTTP: Hypertext Transfer Protocol At the heart of web communications is the request message, which is sent via *U*niform *R*esource *L*ocators (URLs). Basic `URL` structure: ![credits: https://code.tutsplus.com/tutorials/http-the-protocol-every-web-developer-must-know-part-1--net-31177](images/http1-url-structure.png) The protocol is typically http or https for secure communications. The default port is 80, but one can be set explicitly, as illustrated in the above image. The resource path is the local path to the resource on the server. --- # Request ![credits: https://code.tutsplus.com/tutorials/http-the-protocol-every-web-developer-must-know-part-1--net-31177](./images/http1-req-res-details.png) The actions that should be performed on the host are specified via HTTP verbs. Today we are going to focus on two actions that are often used in web forms: - `GET`: fetch an existing resource. The URL contains all the necessary information the server needs to locate and return the resource. - `POST`: create a new resource. POST requests usually carry a payload that specifies the data for the new resource. --- # Response Status codes: - `1xx`: Informational Messages - `2xx`: Successful; most known is 200: OK, request was successfully processed - `3xx`: Redirection - `4xx`: Client Error; the famous 404: resource not found - `5xx`: Server Error --- # Document Types -- HTML The *H*yper*T*ext *M*arkup *L*anguage (`HTML`) describes and defines the content of a webpage. Other technologies besides HTML are generally used to describe a webpage's appearance/presentation (CSS) or functionality (JavaScript). "Hyper Text" in HTML refers to links that connect webpages to one another, either within a single website or between websites. Links are a fundamental aspect of the Web. HTML uses "markup" to annotate text, images, and other content for display in a Web browser. HTML markup includes special "elements" such as `<head>`, `<title>`, `<body>`, `<header>`, `<footer>`, `<article>`, `<section>`, `<p>`, `<div>`, `<span>`, `<img>`, and many others. Using you web browser, you can inspect the HTML content of any webpage of the World Wide Web. --- class: center, middle, inverse # Application Programming Interfaces --- # APIs Constructs built into third-party platforms (e.g. Twitter, Facebook, data repositories) that allow you to use some of those platform's functionality in your own script. They abstract more complex code away from you, providing some easier syntax to use in its place. For example, get the latest Tweets on a specific topic or list all the issues on a GitHub repository --- # REST APIs A lot of data provider use this type of API. REST stands for _Representational State Transfer_. Those APIs a some similarities: - Use HTTP methods explicitly. These are GET, PUT, POST, DELETE - Expose directory structure-like URIs - Transfer XML, JavaScript Object Notation (JSON), or both - REST APIs are well-suited to build libraries on top of them --- # Authentification <br> ### You often need to use some of kind of authentication (often token) to use an API. <br> ### This is mostly because data providers do not want one user using all the server resources. --- # What type of objects do you get back? APIs often return xml or JSON documents either for the data themselves but also for some kind of metadata <img src="images/api-plug-socket.png" width="80%" style="display: block; margin: auto;" /> --- # Document Types -- XML The e**X**tensible **M**arkup **L**anguage (XML) provides a general approach for representing all types of information, such as data sets containing numerical and categorical variables. XML provides the basic, common, and quite simple structure and syntax. It is the building block of `HTML`, `SVG` and `EML` are specific vocabularies of XML. ```xml <?xml version="1.0" encoding="ISO-8859-1"?> <note> <to>Tove</to> <from>Jani</from> <heading>Reminder</heading> <body>Don't forget me this weekend!</body> </note> ``` --- # Document Types -- JSON **J**ava**S**cript **O**bject **N**otation (JSON) is an open data interchange format that uses human-readable text to store and transmit data objects consisting of attribute–value pairs and arrays (or other serializable values) --- # Document Types -- JSON <img src="https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/man/figures/logo.png" style="float:right" width="20%"> <img/> ```json [ { "gender": ["male"], "first_name": ["Mumble"], "measurements": [ { "date": ["2007-11-11"], "body_mass": [4801], "flipper_length": [181] }, { "date": ["2007-11-16"], "body_mass": [5699], "flipper_length": [182] }, { "date": ["2007-11-19"], "body_mass": [5743], "flipper_length": [181] } ] } ] ``` --- class: center, middle, inverse # Interacting with APIs --- # The R_aw way `httr` (httr2 is around the corner!) is an R package that let's you send requests to web services. Let's try to use the GitHub API to list all my repositories ```r library(httr) library(purrr) # Querying the the GitHub REST api r <- GET("https://api.github.com/users/brunj7/repos") # Extract the content from the response my_repos_list <- content(r) # Extract what we want from the list my_repos <- map_chr(my_repos_list, "full_name") my_repos[1:2] ``` ``` ## [1] "brunj7/adv-r" "brunj7/aquaculture-crop-feed" ``` -- .large[ **Note that since xml and JSON are often nested formats, you will more likely end up with a `list` as an R Object.** ] --- # Your Turn! List all your repos! --- # The Bette_R way The good news is that there is often an R package for that!! There are many R packages out there that will let you interact with APIs in a very convenient way. - RopenSci: https://ropensci.org/packages/data-access/ - Check Data Provider on GitHub! e.g. https://github.com/USGS-R/dataRetrieval --- # Exercise Plot the discharge time-series for the Ventura River from 2019-10-01 to 2020-09-30 Webpage: https://waterdata.usgs.gov/nwis/uv?site_no=11118500 Tutorial for `dataRetrieval`: https://cran.r-project.org/web/packages/dataRetrieval/vignettes/dataRetrieval.html#daily-data --- # Debugging code using API ### Note that begugging code relying on an API is hard!! <br> Since you are using a web service, you have now a few more moving parts that can make your code fail, espcially if there is connectivity issues (including paywall) and potential server issues (or limitations). **A good sign of an unreliable API is when your code seems to not consistently fail or provide variable results when ran several times in a row.** --- # What if there is no API ? <br> ### Developing and maintaining an API requires resources and not all data provider can afford it! <br> ### If no APIs are available, then you can try to use web scraping to collect data. Quick intro to webscrapping [here](https://r-meetup-sb.github.io/web-scraping-rvest/index.html#4_Web_scraping_workflow) Please always check that the data you are scraping is publicly available and that there is no personal or confidential information gathered. Also please do not overload the web server you are scraping