2020_Day2_Marius_Appel/tutorial_02.Rmd at master · OpenGeoHub/2020_Day2_Marius_Appel · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
---
title: "Creating and Analyzing Multi-Variable Earth Observation Data Cubes in R"
author: "Marius Appel"
date: "*Aug 18, 2020*"
output:
  html_document:
    theme: flatly
    highlight: default
    toc: true
    toc_float:
      collapsed: false
      smooth_scroll: false
    toc_depth: 2
bibliography: references.bib
link-citations: yes
csl: american-statistical-association.csl
---

```{r setup, include=FALSE}
figtrim <- function(path) {
  img <- magick::image_trim(magick::image_read(path))
  #img <- magick::image_extent(img,magick::geometry_size_percent(width = 100, height = 110))
  magick::image_write(img, path)
  path
}
knitr::opts_chunk$set(echo = TRUE)
#knitr::opts_chunk$set(eval = FALSE)
knitr::opts_chunk$set(out.width = "100%")
knitr::opts_chunk$set(dev = "jpeg")
knitr::opts_chunk$set(fig.process = figtrim)
knitr::opts_chunk$set(fig.width = 10, fig.height = 10)
```

**OpenGeoHub Summer School 2020, Wageningen**


# Outline


- Part I: Introduction to Earth observation (EO) data cubes and the `gdalcubes` R package

- **Part II: Examples on using data cubes to combine imagery from different satellite-based EO missions**

- Part III: Summary, discussion, and practical exercises


All the material of this tutorial is online at [GitHub](https://github.com/appelmar/opengeohub_summerschool2020), including R markdown sources, rendered HTML output, and solutions to the practical exercises.


# Part II: Mulit-variable data cubes from different EO products

_Please notice that the data used in this part of the tutorial is not available online._

## First steps

At first, we make sure to load the needed packages again.

```{r setup_gdalcubes}
library(gdalcubes)
library(magrittr)
library(colorspace)
```


## Example 1: Building a combined NDVI data cube from Sentinel-2, Landsat 8, and MODIS data


In this example, we are interested in monitoring the vegetation with NDVI as target variable. To get as much information as possible, we will use observations from Sentinel-2, Landsat 8, and MODIS (product [MOD09GA](https://lpdaac.usgs.gov/products/mod09gav006/)). These datasets have very different properties:

| Dataset | Spatial Resolution (NDVI bands) | Temporal Resolution | File Format
|------------------------------|-----------------|-----------------|-----------------|
| Sentinel-2 Level 2A | 10m  | 5 days | .jp2
| Landsat 8 surface reflectance |  30m  | 16 days | GeoTIFF
| MODIS MOD09GA | 500m | daily | HDF4

Our study area covers a region around Münster, Germany with mostly agricultural and some smaller forest areas.

We start with an overview of available data:


```{r sc1_01}
MOD.files = list.files("/media/marius/Samsung_T5/eodata/MOD09GA_MS", pattern = ".hdf",full.names = TRUE, recursive = TRUE)
head(MOD.files, n = 3)
length(MOD.files)
sum(file.size(MOD.files)) / 1024^3 # GiB

L8.files = list.files("/media/marius/Samsung_T5/eodata/L8_MS2019", pattern = ".tif",full.names = TRUE, recursive = TRUE)
head(L8.files, n = 3)
length(L8.files)
sum(file.size(L8.files)) / 1024^3 # GiB

S2.files = list.files("/media/marius/Samsung_T5/eodata/S2L2A_MS", pattern = ".zip",full.names = TRUE, recursive = TRUE)
head(S2.files, n = 3)
length(S2.files)
sum(file.size(S2.files)) / 1024^3 # GiB
```

Notice that the Sentinel-2 images come from three different tiles covering two UTM zones. As in the first part, we now need to create separate image collections, which requires a bit of research to find the correct collection formats (e.g., using `collection_formats()`).


```{r sc1_02}
if (!file.exists("MOD09GA.db")) {
  MOD.col = create_image_collection(MOD.files, "MxD09GA", "MOD09GA.db")
}
MOD.col = image_collection("MOD09GA.db")
MOD.col

if (!file.exists("L8_MS.db")) {
  L8.col = create_image_collection(L8.files, "L8_SR", "L8_MS.db")
}
L8.col = image_collection("L8_MS.db")
L8.col

if (!file.exists("S2_MS.db")) {
  S2.col = create_image_collection(S2.files, "Sentinel2_L2A", "S2_MS.db")
}
S2.col = image_collection("S2_MS.db")
S2.col
```

After carefully reading the product information from the data providers, we define separate masks to consider high-quality pixels only.


```{r sc1_03}
MOD.mask = image_mask("state_1km", values=0, bits = 0:1, invert  = TRUE)
L8.mask =  image_mask("PIXEL_QA", values=c(322, 386, 834, 898, 1346, 324, 388, 836, 900, 1348), invert = TRUE)
S2.mask = image_mask("SCL", values=c(1,3,8,9,10))
```


We first build three data cubes with separate cube views to get an initial overview of the collections.


```{r sc1_04}
MOD.v1 = cube_view(extent = MOD.col, dx = 5000, dt = "P1M", resampling = "average", srs = "EPSG:3857", aggregation = "median")
raster_cube(MOD.col, MOD.v1, mask = MOD.mask) %>%
  select_bands(c("sur_refl_b01", "sur_refl_b04", "sur_refl_b03")) %>%
  plot(rgb = 1:3, zlim=c(0, 2000), ncol = 4)

L8.v1 = cube_view(extent = L8.col, dx = 500, dt = "P1M", resampling = "average", srs = "EPSG:3857", aggregation = "median")
raster_cube(L8.col, L8.v1, mask = L8.mask) %>%
  select_bands(c("B02", "B03", "B04")) %>%
  plot(rgb = 3:1, zlim=c(0,2000), ncol=4)

S2.v1 = cube_view(extent = S2.col, dx = 500, dt = "P1M", resampling = "average", srs = "EPSG:3857", aggregation = "median")
raster_cube(S2.col, S2.v1, mask = S2.mask) %>%
  select_bands(c("B02", "B03", "B04")) %>%
  plot(rgb = 3:1, zlim=c(0,2000), ncol=4)
```


Now, we define our spatiotemporal area of interest, desired resolution, and build separate data cubes with NDVI measurements.


```{r sc1_05}
srs = "EPSG:32632"
aoi = list(left = 383695, right = 399749, bottom = 5751683, top = 5760444, t0="2019-04-01", t1 = "2019-10-31")
v.aoi = cube_view(srs = srs, extent = aoi, dx = 200, dt = "P1D", resampling = "average", aggregation = "median")

raster_cube(L8.col, v.aoi, mask = L8.mask) %>%
  select_bands(c("B04", "B05")) %>%
  apply_pixel("(B05-B04)/(B05+B04)", "NDVI")  -> L8.cube

raster_cube(S2.col, v.aoi, mask = S2.mask) %>%
  select_bands(c("B04", "B08")) %>%
  apply_pixel("(B08-B04)/(B08+B04)", "NDVI") -> S2.cube

raster_cube(MOD.col, cube_view(v.aoi, resampling = "bilinear"), mask = MOD.mask) %>%
  select_bands(c("sur_refl_b01", "sur_refl_b02")) %>%
  apply_pixel("(sur_refl_b02 - sur_refl_b01)/(sur_refl_b02 + sur_refl_b01)", "NDVI") -> MOD.cube
```


Now, we can combine all data cubes into a single data cube with three variables using the `join_bands()` function. The `cube_names` argument here is used to add prefixes and ensure unique band names of the output data cube. As a first step, we count the number of observations of pixel time series from the different datasets.

```{r sc1_06}
combined.cube = join_bands(list(L8.cube, S2.cube, MOD.cube), cube_names = c("L8", "S2", "MOD"))
combined.cube

combined.cube %>%
  reduce_time("count(L8.NDVI)", "count(S2.NDVI)", "count(MOD.NDVI)") %>%
  plot(key.pos=1, zlim=c(0, 80), col = viridis::viridis, nbreaks=19, ncol = 3)
```


Having all data available in one data cube for example allows to create a simple _best available NDVI_ image time series.


```{r sc1_07}
ndvi.col = function(n) {
  rev(sequential_hcl(n, "Green-Yellow"))
}
join_bands(list(L8.cube, S2.cube, MOD.cube), cube_names = c("L8", "S2", "MOD")) %>%
  apply_time(names="NDVI_BEST", FUN = function(x) {
    out = x["MOD.NDVI",]
    L8.idx = which(!is.na(x["L8.NDVI",]))
    out[L8.idx] = x["L8.NDVI",L8.idx]
    S2.idx = which(!is.na(x["S2.NDVI",]))
    out[S2.idx] = x["S2.NDVI",S2.idx]
    out
  }) %>%
  plot(key.pos = 1, t = 150:169, col = ndvi.col)
```

This is of course a very optimistic approach, assuming that NDVI values of different sensors are comparable. Another (equally optimistic) approach, which completely fills the NDVI pixel time series, might be to use MODIS observations only to identify the temporal trend and then apply the trend to Sentinel-2 observations to fill gaps, as in the following.


```{r sc1_08}
join_bands(list(S2.cube, MOD.cube), cube_names = c("S2", "MOD")) %>%
  apply_time(names="NDVI", FUN = function(x) {
    library(zoo)
    mod.ndvi = na.approx(x["MOD.NDVI",], na.rm = FALSE)
    mod.ndvi.diff = c(0, diff(mod.ndvi))

    out = rep(NA, ncol(x))

    S2.idx = which(!is.na(x["S2.NDVI",]))
    if (length(S2.idx) > 1) {
      for (i in S2.idx[1]:ncol(x)){
        if (!is.na(x["S2.NDVI",i])) {
         out[i] = x["S2.NDVI",i]
        }
        else {
          out[i] =  out[i-1] + mod.ndvi.diff[i]
        }
      }
    }
    out
  }) %>%
  plot(key.pos = 1, t = 150:169, col = ndvi.col)
```

There are of course a lot of issues here, too. For example, the result contains NDVI values > 1. However, it is easy to think about improvements to this method (e.g., learning from days when observations from all sensors are available).


# Example 2: Joint analysis of satellite derived vegetation, precipitation, and soil moisture observations


Instead of combining data from different sensors but the same target variable, we may use data cubes to analyze interactions or correlations between different variables. In the example below, we combine NDVI vegetation index observations from MODIS ([MOD13A2](https://lpdaac.usgs.gov/products/mod13a2v006)), daily precipitation from the Global precipitation measurement mission ([GPM_3IMERGDF](https://disc.gsfc.nasa.gov/datasets/GPM_3IMERGDF_06/summary), @huffman2019gpm), and daily soil moisture from ESA CCI [ESA CCI](https://www.esa-soilmoisture-cci.org/) [@gruber2019evolution, @dorigo2017esa, @gruber2017triple].

As in the previous example, we start by creating image collections for all different datasets.


```{r sc2_01}
MOD13A2.files = list.files("/media/marius/Samsung_T5/eodata/MOD13A2", recursive = TRUE, full.names = TRUE, pattern=".hdf")
head(MOD13A2.files)
if (!file.exists("MOD13A2.db")) {
  MOD13A2.collection = create_image_collection(MOD13A2.files, "MxD13A2", "MOD13A2.db")
}
MOD13A2.collection = image_collection("MOD13A2.db")
MOD13A2.collection

GPM.files = list.files("/media/marius/Samsung_T5/eodata/GPM", recursive = TRUE, full.names = TRUE, pattern=".tif")
head(GPM.files)
if (!file.exists("GPM.db")) {
  GPM.collection = create_image_collection(GPM.files, "GPM_IMERG_3B_DAY_GIS_V06A", "GPM.db")
}
GPM.collection = image_collection("GPM.db")
GPM.collection

SM.files = list.files("/media/marius/Samsung_T5/eodata/ESACCI-SOILMOISTURE-PASSIVE",
                      recursive = TRUE, full.names = TRUE, pattern=".*nc")
head(SM.files)
if (!file.exists("SM.db")) {
  SM.collection = create_image_collection(SM.files, "ESA_CCI_SM_PASSIVE.json", "SM.db")
}
SM.collection = image_collection("SM.db")
SM.collection
```


We now define our area of interest, select the variables of interest, and plot separate data cubes (using the same data cube view) for one year to get an initial overview.


```{r sc2_04}
view_de = cube_view(srs = "EPSG:3035", extent = list(left=4039313, bottom=2775429,right=4670348, top=3482115, t0 = "2018-01", t1="2018-12"), dx = 5000, dt = "P1M", aggregation = "mean", resampling = "bilinear")

ndvi.col = function(n) {
  rev(sequential_hcl(n, "Green-Yellow"))
}

prcp.col = function(n) {
  rev(sequential_hcl(n, "Purple-Blue"))
}

raster_cube(MOD13A2.collection, view_de) %>%
  select_bands("NDVI") %>%
  plot(key.pos = 1, col = ndvi.col)

raster_cube(GPM.collection, view_de) %>%
  select_bands("total_accum") %>%
  plot(key.pos = 1, col = prcp.col, zlim = c(0, 100), nbreaks = 100)

raster_cube(SM.collection, view_de) %>%
  select_bands("sm") %>%
  plot(key.pos = 1, zlim = c(0,1), col = prcp.col, nbreaks=100)
```

Next, we preprocess our data cubes separately as described below:

* For the NDVI cube, we decompose the pixel time series by removing seasonal and trend components using the R function `stl()`. We refer to the remainder as _anomalies_. This can be done with the `apply_time()` function, which expects a user-defined R function over pixel time series but does no reduction (compared to `reduce_time()`. In the provided function, we first convert time series to `ts` objects, fill NA values by linear interpolation using `zoo::na_approx()` and afterwards use the `stl()` function to decompose the time series.

* For the soil moisture cube, we remove seasonal and trend components but compute a rolling mean of the last two months before.

* We calculate the monthly standard precipitation index from the precipitation cube using the R function `spi()` from package `precintcon`.

Notice that to keep this example simple and computation times in the order of minutes, we still aggregate our observations monthly and limit our analysis to four years.


```{r sc2_05}
view_de = cube_view(view_de, extent = list(t0 = "2015-01", t1 = "2018-12"))

raster_cube(MOD13A2.collection, view_de) %>%
  select_bands("NDVI") %>%
  apply_time(name = "NDVI_anomaly", FUN = function(x) {
    library(zoo)
    y = na.approx(x["NDVI",], na.rm = FALSE)
    if (any(is.na(y))) return(rep(NA, ncol(x)))
    y = ts(y, frequency = 12,  start = c(2015, 1))
    y_decomposed = stl(y, "periodic")
    res = as.vector(y_decomposed$time.series[,"remainder"])
    return(res)
  }) -> ndvi.anom

raster_cube(GPM.collection, view_de) %>%
  select_bands("total_accum") %>%
  apply_time(names = "spi", FUN = function(x) {
    library(precintcon)
    year = as.numeric(substr(colnames(x), 1, 4))
    month = as.numeric(substr(colnames(x), 5, 6))
    y = data.frame(year = year, months = month, precipitation = x["total_accum",])
    class(y) <- c("data.frame", "precintcon.monthly")
    if (any(is.na(y$precipitation))) return(rep(1, ncol(x)))
    spi_monthly = spi(y, period = 1)$spi
    return(spi_monthly)
  }) -> prcp.spi

raster_cube(SM.collection, view_de) %>%
  select_bands("sm") %>%
  apply_time(names = "sm_anomaly", FUN = function(x) {
    library(zoo)
    y = na.approx(x["sm",], na.rm = FALSE)
    if (any(is.na(y))) return(rep(NA, ncol(x)))
    y = rollmean(y, 2, fill = NA, align = "right")
    y = ts(y[-1], frequency = 12,  start = c(2015, 2))
    y_decomposed = stl(y, "periodic")
    res = c(0, as.vector(y_decomposed$time.series[,"remainder"]))
    return(res)
  }) -> sm.anom
```

To demonstrate how we can combine the three data cubes anyway, we apply a simple statistical test for zero correlation (Kendall's $\tau$) on $(NDVI_t, SPI_{t-0..2})$ and $(NDVI_t, SM_{t-0..2})$ for each time series below.

```{r sc2_07}
out_bandnames = c("NDVI_SPI_LAG_0_tau", "NDVI_SPI_LAG_0_p",
                  "NDVI_SPI_LAG_1_tau", "NDVI_SPI_LAG_1_p",
                  "NDVI_SPI_LAG_2_tau", "NDVI_SPI_LAG_2_p",
                  "NDVI_SM_LAG_0_tau", "NDVI_SM_LAG_0_p",
                  "NDVI_SM_LAG_1_tau", "NDVI_SM_LAG_1_p",
                  "NDVI_SM_LAG_2_tau", "NDVI_SM_LAG_2_p")
join_bands(list(ndvi.anom, prcp.spi, sm.anom), cube_names = c("X1","X2","X3")) %>%
  reduce_time(FUN = function(x) {
    v = data.frame(ndvi = x["X1.NDVI_anomaly", ], spi = x["X2.spi", ], sm = x["X3.sm_anomaly", ])
    # use only April to October
    #month = as.numeric(substr(colnames(x), 5,6))
    #v = v[which(month >= 4 & month <= 10),]
    if (sum(complete.cases(v)) < 6) return(rep(NA, 12))
    n = nrow(v)

    res = rep(NA, 12)
    ct = cor.test(v$ndvi, v$spi, use = "complete.obs", method = "kendall")
    res[c(1,2)] = c(ct$estimate, ct$p.value)
    ct = cor.test(v$ndvi[2:n], v$spi[1:(n-1)], use = "complete.obs", method = "kendall")
    res[c(3,4)] = c(ct$estimate, ct$p.value)
    ct = cor.test(v$ndvi[3:n], v$spi[1:(n-2)], use = "complete.obs", method = "kendall")
    res[c(5,6)] = c(ct$estimate, ct$p.value)
    ct = cor.test(v$ndvi, v$sm, use = "complete.obs", method = "kendall")
    res[c(7,8)] = c(ct$estimate, ct$p.value)
    ct = cor.test(v$ndvi[2:n], v$sm[1:(n-1)], use = "complete.obs", method = "kendall")
    res[c(9,10)] = c(ct$estimate, ct$p.value)
    ct = cor.test(v$ndvi[3:n], v$sm[1:(n-2)], use = "complete.obs", method = "kendall", na.action = na.exclude)
    res[c(11,12)] = c(ct$estimate, ct$p.value)

    return(res)

  }, names = out_bandnames) -> result.cube
  result.cube
  plot(result.cube, key.pos = 1, zlim=c(-1,1), col = viridis::viridis, ncol = 4, nbreaks=50)
```

Is this meaningful? Probably not, but it shows _how_ complex calculations can be applied on data cubes in R.
In applications, one would of course use much longer time series to derive anomalies and furthermore, the interactions of our variables are much more complex in reality (and may need more information on soil, vegetation type, elevation, and similar).
The result contains $p$ values and $\tau$ estimates for both pairs of variables at temporal lags 0, 1, and 2.


# References