3 Computing environment setup

The source code published on Github alongside the original study combines three different coding languages: python, R and Julia. Although the code is generally clearly written and includes some explanatory comments, several adjustments in the code are required to enable it to run on a different machine. This chapter focuses on the configuration (operating system, statistical software and specific libraries) required to run the program.

3.1 Procedure

Guidelines by scientific journals to promote the reproducibility of published scientific articles recommend the authors to publish their code including a file named README that specifies the configuration of the computing environment and the procedure to follow to obtain the same results (for guidance on code publications, see here for social science journals or here for Nature journals. This source code only includes an empty README file, so we made several attempts to guess a working configuration. We tested the code on the following platforms:

A Windows personal computer with 32 Go RAM and a 8 cores processor: the file 001 - dl_GEE.pysuccessfully ran, but the subsequent code files failed, because they call on Linux commands (e.g. wget) or rely on libraries that seem to be unix native.
A Linux (Ubuntu 22.04) computer with 8 Go RAM and a 4 cores processor: The code failed to run due to a lack of memory.
A Linux pod (Ubuntu 22.04) on a Kubernetes server (Onyxia/SSP Cloud) running aDocker image with R and R Studio on which we install a python distribution (miniconda) with the R package {reticulate} and a 18.5 Julia distribution with the R package {JuliaCall}. The script files 002 - dl_IUCN.R, 003 - species_richness.py and 004 - join_rasters.R run successfully (after several trials and errors and including some corrections in the code, documented in the following chapters). However, the script 005 - covariate_matching.jl fails to start running, apparently because Julia does not successfully identify dependencies required by the ArchGDAL package (CURL_4), dependencies that are nevertheless present on the system.
A Linux pod (Ubuntu 22.04) on a Kubernetes server (Onyxia/SSP Cloud) running aDocker image with python and Julia where we install R and spatial dependenties with Linux apt package manager (following a procedure documented by ThinkR on RTasks). The scripts files 002 - dl_IUCN.R, 003 - species_richness.py and 004 - join_rasters.R run succesfully and the script 005 - covariate_matching.jl starts running, but fails further down in the execution process. Because it happens in parallel processing, the error messages are not meaningful and we need to contact the authors to help us identify the cause of the error.

Working with Kubernetes pods enables to mobilize large memory and processing resources, and to flexibly adapt the configuration. However, pods must be deleted after use and re-created for each new use, which is time consuming. With the help of Onyxia admin, we are preparing a docker image that includes R, python and Julia with all the required package. This should make quicker the re-creation process.

3.2 Technical environment and prerequisites

We document the execution process by including all package installation, code modifications and script execution in a literate programming format. It is possible to combine different programming languages in litterate programming platforms such as Jupyter or RMarkdown, or its new generation Quarto. We decided to use Quarto because of its versatility and our familiarity of this tool. Quarto can be obtained at www.quarto.org.

The code also requires Linux or a Windows machine running Windows subsystem for Linux, otherwise a substantial rewriting of some scripts is needed. We run it on a Windows personal computer. We have not tried, but it might be possible to run it on Mac as it is similar to Linux in several regards (both are unix systems).

The first script 001 - dl_GEE.py requires to have a gmail account to access google services and to registrer on Google Earth Engine.

3.3 Set up R

The following script installs the R dependies called in the code. We are not certain that all dependencies are effectively used in the script.

Code

# Install from Github --------------------------------------------------------

# Some packages need to be installed from developper sources because there are
# not available or official sources have some issues.
# Installing version 1.3.5 which is the last working version apparently

if (system.file(package = "doMC") == "") {
  remotes::install_github("https://github.com/cran/doMC/tree/fbea362b96cc4469deb6065ff9fbd5d4794ccac1")
} 
if (system.file(package = "gdalUtils") == "") {
  remotes::install_github("https://github.com/cran/gdalUtils", upgrade = FALSE)
} 
if (system.file(package = "ggeasy") == "") {
  remotes::install_github("jonocarroll/ggeasy")
} 
if (system.file(package = "velox") == "") {
  remotes::install_github("https://github.com/hunzikp/velox", upgrade = FALSE)
} 
if (system.file(package = "rnaturalearth") == "") {
  remotes::install_github("https://github.com/ropensci/rnaturalearth")
} 
if (system.file(package = "gdalUtils") == "") {
  remotes::install_github("gearslaboratory/gdalUtils")
} 
if (system.file(package = "geoarrow") == "") {
  remotes::install_github("paleolimbot/geoarrow")
} 


# Install from CRAN ------------------------------------------------------------

# These packages are available from the usual source from R
required_packages <- c( # List all required R packages
  "reticulate", # To interact with python (normally installed with Quarto)
  "JuliaCall", # To interact with Julia 
  "tidyverse", # To facilitate data manipulation
  "aws.s3", # to interact with S3
  # All packages below are used in Wolf and al. code files:
  "countrycode",
  "cowplot",
  "data.table",
  "dtplyr",
  "fasterize",
  "foreach",
  "foreign",
  "ggforce",
  "ggplot2",
  "ggrepel",
  "GpGp",
  "grid",
  "jsonlite",
  "landscapetools",
  "lme4",
  "MCMC.OTU",
  "ncdf4",
  "parallel",
  "pbapply",
  "plyr",
  "raster",
  "rasterVis",
  "rbounds",
  "RColorBrewer",
  "RCurl",
  "readr",
  "reshape2",
  "rgdal",
  "rjson",
  "rnaturalearth",
  "scales",
  "sf",
  "smoothr",
  "spaMM",
  "spgwr",
  "spmoran",
  "spNNGP",
  "stars",
  "stringr",
  "tidyverse",
  # "unix",
  "velox",
  "viridis",
  "wbstats",
  "wdpar") 
missing <- !(required_packages %in% installed.packages())

# Install 
if(any(missing)) install.packages(required_packages[missing], 
                                  repos = "https://cran.irsn.fr/")

3.4 Set up python

The following scripts install a python distribution (miniconda) with the R package reticulate. It also installs several python packages. Note that several of these packages are not effectively called within the R code.

Code

library(reticulate) # to run python from R

# Variables to modify
my_envname <- "replication-wolf"
scripts_to_run <- c("003") # or c("001", "003") or "001"

# Install python if not already present
if (!dir.exists(miniconda_path())) {
  install_miniconda()
}
# conda_remove(my_envname) # for debugging purposes
# Create environment if not already
if (!my_envname %in% conda_list()$name) {
  conda_create(my_envname)
}
# Packages needed for each script
requirements <- list(
  "001" = c("earthengine-api", "rasterio", "pandas", "pydrive"),
  "003" = c("fiona", "rasterio", "ray[default]", "dbfread", "pandas"))
# Combined depending on the variable defined at beginning of code chunk
required <- unique(unlist(requirements[scripts_to_run]))
# identify which ones are missing
py_installed_packages <- py_list_packages(my_envname)$package
missing_packages <- required[!required %in% py_installed_packages]
# Install those
if (length(missing_packages) > 0) {
  conda_install(envname = my_envname, 
              packages = missing_packages,
              pip = TRUE)
}

# Adding a dependency required on linux platforms
 if ((Sys.info()["sysname"] == "Linux") &
 (!"libstdcxx-ng" %in% py_installed_packages)) {
  conda_install(envname = my_envname,
              packages = "libstdcxx-ng")
  } 
    

# Activate the corresponding environment
use_condaenv(my_envname)

Code

library(reticulate) # to run python from R

# Variables to modify
my_envname <- "replication-wolf2"

# py_list_packages(my_envname)
# system("strings /usr/lib/x86_64-linux-gnu/libstdc++.so.6 | grep GLIBCXX")
# system('conda install -c "conda-forge/label/gcc7" libstdcxx-ng')
# 
# conda_install()

# Install python if not already present
if (!dir.exists(miniconda_path())) {
  install_miniconda()
}

# Create environment if not already
if (my_envname %in% conda_list()$name) {
  conda_remove(my_envname)
} 
conda_create(my_envname)

requirements_pip <- c("fiona", "ray[default]", "rasterio", "ray", "dbfread", "pandas")
requirements_conda <- c("libstdcxx-ng")

# identify which ones are missing
py_installed_packages <- py_list_packages(my_envname)$package
missing_pip <- requirements_pip[!requirements_pip %in% py_installed_packages]
missing_conda <- requirements_conda[!requirements_conda %in%
                                      py_installed_packages]

# Install those
if (length(missing_pip) > 0) {
  conda_install(envname = my_envname, 
              packages = missing_pip,
              pip = TRUE)
}
if (length(missing_conda) > 0) {
  conda_install(envname = my_envname, 
              packages = missing_conda)
}

# Activate the corresponding environment
use_condaenv(my_envname)

Create an authorization

Code

from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive

gauth = GoogleAuth()
# Try to load saved client credentials
gauth.LoadCredentialsFile("mycreds.txt")
if gauth.credentials is None:
    # Authenticate if they're not there
    gauth.LocalWebserverAuth()
elif gauth.access_token_expired:
    # Refresh them if expired
    gauth.Refresh()
else:
    # Initialize the saved creds
    gauth.Authorize()
# Save the current credentials to a file
gauth.SaveCredentialsFile("mycreds.txt")

3.5 Set up Julia

First we install or set-up Julia from R if needed.

Code

library(JuliaCall)

if (!dir.exists(paste0(rappdirs::user_data_dir(), "/R/JuliaCall/julia"))) {
  install_julia()
}
if (!exists("my_julia")) {
  my_julia <- julia_setup()
}
julia_install_package_if_needed("ArchGDAL")
julia_install_package_if_needed("DataFrames")
julia_install_package_if_needed("Discretizers")
julia_install_package_if_needed("Shapefile")
julia_install_package_if_needed("FreqTables")
julia_install_package_if_needed("Plots")
julia_install_package_if_needed("StatsBase")
julia_install_package_if_needed("CSV")
julia_install_package_if_needed("LibGEOS")

3.6 Adapt to Windows system (optionnal)

The script 002 - dl_IUCN.R includes a system command that wget that refers to a fownloading software that is included in UNIX platforms (Mac and Linux).

On Linux, if wget is not available, the user must run the same commands without wsl, that is run the command sudo apt-get update followed by the command sudo apt-get install parallel.

Code

if (Sys.info()["sysname"] == "Linux") {
  system("sudo apt update")
  system("sudo apt install -y parallel")
}

It is possible however to run it on Windows, if and only the Windows system includs Windows Subsystem for Linux. In that case, if wsl is not already installed the user must first install parallel that includes wget, runing the command wsl sudo apt-get update followed by the command wsl sudo apt-get install parallel.

Code

# Replace all wget calls by wsl wget
if (Sys.info()["sysname"] == "Windows") {
  replace_all("002 - dl_IUCN.R", "wget", "wsl wget")
}