css/sequence-analysis.qmd at main · GDSL-UL/css · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
# Sequence Analysis {#sec-chp4}

This chapter illustrates the use of sequence analysis and [WorldPop](https://www.worldpop.org) raster data to identify trajectories of population decline in Ukraine. Developed in Biology for the analysis of DNA sequencing, sequence analysis offers a novel approach to generate a more holistic understanding of population decline trajectories capturing differences in the ordering, frequency, timing and occurrence of population decline. The *longitudinal* and *categorical* nature is a key feature of the data that can be analysed using sequence analysis. In this chapter, We first show how to manipulate gridded, raster population to create a spatial data frame, and explore national and sub-national population trends and spatial structure of population change. We describe the process of implementing sequence analysis to identify trajectories of population decline in a four-stage process. We first define different levels of population change. Second, we apply measure the difference or similarity between individual area-level trajectories. Third, we use an unsupervised machine learning clustering algorithm to identify representative trajectories. Fourth, we use different visualisation tools to understand key features encoded in trajectories.

The chapter is based on the following references:

@gabadinho2011 describe the functionalities of the `TraMineR` package to visualise and analysis categorical sequence data;

@newsham2022a, @gonzález-leonardo2023 provide empirical applications to identify and study trajectories of population decline across Europe and in Spain;

@tatem2017 provide an overview of the *WorldPop* data project;

@patias2021, @patias2021a provide good examples of the use of sequence analysis to define trajectories of inequality and urban development.

## Dependencies

We use the libraries below. Note that to use the `theme_tufte2()` used for `ggplot()` objects in this chapter, you need to call the file `data-viz-themes.R` in the repository.

```{r}
#| include = FALSE
# clear memory and adds tufte templates for plots and maps
rm(list=ls())
source("data-viz-themes.R")
```

```{r}
#| warning: false

# data manipulation
library(tidyverse)
# spatial data manipulation
library(stars)
library(sf)
# download world pop data
library(wpgpDownloadR) # you may need to install this package running `install.packages("devtools")` `devtools::install_github("wpgp/wpgpDownloadR")'
# data visualisation
library(viridis)
library(RColorBrewer)
library(patchwork)
library(ggseqplot) # may need to install by running `devtools::install_github("maraab23/ggseqplot")`
# sequence analysis
library(TraMineR)
# cluster analysis
library(cluster)
```

Key packages to this chapter are `TraMineR`,`stars` and `ggseqplot`. `TraMineR` is the go-to package in social sciences for exploring, analysing and rendering sequences based on categorical data. `stars` is designed to handle spatio-temporal data in the form of dense arrays, with space and time as dimensions. `stars` provides classes and methods for reading, manipulating, plotting and writing data cubes. It is a very powerful package. It interacts nicely with `sf` and is suggested to be superior to `raster` and `terra`, which are also known for their capacity to work with multilayer rasters. `stars` is suggested to [deal with more complex data types](https://r-spatial.github.io/stars/) and [be faster](https://www.r-bloggers.com/2021/05/a-comparison-of-terra-and-stars-packages/) than `raster` and `terra`. `ggseqplot` provides functionality to visualise categorical sequence data based on `ggplot` capabilities. This differs from `TraMineR` which is based on the base function `plot`. We prefer `ggseqplot` for the wide usage of `ggplot` as a data visualisation tool in R.

## Data

The key aim of this chapter is to define representative trajectories of population decline using sequence analysis and WorldPop data. We use WorldPop data for the period extending from 2000 to 2020. WorldPop offers open access gridded population estimates at a high spatial resolution for all countries in the world. WoldPop produces these gridded datasets using a top-down (i.e. dissagregating administrative area counts into smaller grid cells) or bottom-up (i.e. interpolating data from counts from sample locations into grid cells) approach. You can learn about about these approaches and the data available from [WorldPop](https://www.worldpop.org).

WorldPop population data are available in various formats:

-   Two spatial resolutions: 100m and 1km;

-   Constrained and unconstrained counts to built settlement areas;

-   Adjusted or unadjusted to United Nations' (UN) national population counts;

-   Two formats i.e. `tiff` and `csv` formats.

We use annual 1km gridded, UN adjusted, unconstrained population count data for Ukraine during 2000-2021 in tiff format. We use tiff formats to illustrate the manipulation of raster data. Such skills will come handy if you ever decide to work with satellite imagery or image data in general.

Before calling the data, let's see how we can use `wpgpDownloadR` package. Let's browse the data catalogue.

```{r}
wpgpListCountries() %>%
  head()
```

By using the ISO3 country code, let's look for the available datasets for Ukraine.

```{r}
wpgpListCountryDatasets(ISO3 = "UKR") %>%
  head()
```

The `wpgpDownloadR` package includes 100m resolution data. To keep things efficient, we use 1km gridded population counts from the [WorldPop data page](https://hub.worldpop.org/geodata/listing?id=75). Obtain population data for Ukraine 2000-2020. We start by reading the set of tiff files using the `read_stars` function from the `star` package.

```{r}
#| warning: false

# create a list of file names
file_list <- fs::dir_ls("./data/sequence-analysis/raster")
file_list
# read a list of raster data
pop_raster <- read_stars(file_list, quiet = TRUE)
```

We map the data for 2000 to get a quick understanding of the data.

```{r}
plot(pop_raster[1], col = inferno(100))
```

Next we read shapefile of administrative boundaries in the form of polygons. We obtain these data from the [GADM website](https://gadm.org). GADM provides maps and spatial data for individuals countries at the national and sub-national administrative divisions. In this chapter, we will work with these data as they come directly from the website which provides a more realistic and similar context to which you will probably come across in the "real-world".

```{r}
# read spatial data frame
ukr_shp <- st_read("./data/sequence-analysis/ukr_shp/gadm41_UKR_2.shp") %>%
  st_simplify(., # simplify boundaries for efficiency
              preserveTopology = T,
              dTolerance = 1000) %>%  # 1km
  sf::st_make_valid(.) %>%
  fortify(.) %>%  # turns maps into a data frame so they can more easily be plotted with ggplot2
  st_transform(., "EPSG:4326") # set projection system

ukr_shp
```

Let's have a quick look at the resolution of the administrative areas we will be working. The areas below represent areas at the administrative area level 2 in the spatial data frame `ukr_shp`.

```{r}
plot(ukr_shp$geometry)
```

We ensure that the `pop_raster` object is in the same projection system as `ukr_shp`. So we can make both objects to work together.

```{r}
pop_raster <- st_transform(pop_raster, st_crs(ukr_shp))
```

### Data wrangling

For our application, we want to work with administrative areas for three reasons. First, public policy and planning decisions are often made based on administrative areas. These are the areas local governments have jurisdiction, represent and can exert power. Second, migration is a key component of population change and hence directly determines population decline. At a small area, residential mobility may also impact patterns of population potentially adding more complexity and variability to the process. Third, WorldPop data are modelled population estimates with potentially high levels of uncertainty or errors in certain locations. Our aim is to mitigate the potential impacts of these errors.

We therefore recommend working with aggregated data. We aggregate the 1km gridded population data to administrative areas in Ukraine. We use `system.time` to time the duration of the proccess of aggregation which could take some time depending on your local computational environment.

```{r}

system.time({

popbyarea_df = aggregate(x = pop_raster,
                                   by = ukr_shp,
                                   FUN = sum,
                                   na.rm = TRUE)
})

```

**Sub-national population**

The chunk code above returns a list of raster data. We want to create a spatial data frame containing population counts for individual sub-national areas and years. We achieve this by running the following code:

```{r}
# create a function to bind the population data frame to the shapefile
add_population <- function(x) mutate(ukr_shp,
                      population = x)

# obtain sub-national population counts
ukr_eshp <- lapply(popbyarea_df, add_population)

# create a dataframe with sub-national populations
select_pop <- function(x) dplyr::select(x, GID_2, NAME_2, population)
population_df <- lapply(ukr_eshp, select_pop) %>%
  do.call(rbind, .)
population_df$year <- rep(seq(2000, 2020), times = 1, each = nrow(ukr_shp))
rownames(population_df) <- rep(seq(1, nrow(population_df), by=1), times = 1)

# sub-national spatial data frame
population_df
```

**National population**

We also create a data frame providing population counts at the national level.

```{r}
# obtain national population counts
population_count <- map_dbl(ukr_eshp, ~.x %>%
          pull(population) %>%
          sum(na.rm = TRUE)
        ) %>%
  as.data.frame()

# change labels
colnames(population_count) <-  c("population")
rownames(population_count) <- rep(seq(2000, 2020, by=1), times = 1)
population_count$year <- rep(seq(2000, 2020, by=1), times = 1)

# national annual population counts
population_count
```

### Exploratory data analysis

Now we are ready to start analysing the data. Before building complexity on our analysis, conducting some exploratory data analysis to understand the data is generally a good starting point, particularly given the multi-layer nature of the data at hand - capturing space, time and population levels.

**National patterns**

We first analyse national population trends. We want to know to what extent the population of Ukraine has declined over time over the last 20 years. An effective way to do this is to compute summary statistics and visualise the data. Below we look at year-to-year changes in population levels and as a percentage change. By using `patchwork`, we combine two plots into a single figure.

```{r}
# visualise national population trends
pop_level_p <- ggplot(population_count,
       aes(x = year, y = population/1000000 )) +
  geom_line(size = 1) +
  theme_tufte2() +
  ylim(0, 48) +
  labs(y = "Population \n(million)",
       x = "Year")

# visualise percentage change in population
pop_percent_p <- population_count %>%
  mutate(
    pct_change = ( ( population - 47955683) / 47955683) * 100
  ) %>%
ggplot(aes(x = year, y = pct_change )) +
  geom_line(size = 1) +
  theme_tufte2() +
  labs(y = "Population \npercentage change (%)",
       x = "Year")

pop_level_p | pop_percent_p
```

**Sub-national**

Population losses are likely vary across the country. From previous research we know that rural and less well connected areas tend to lose population through the internal migration of young individuals as they move for work and job opportunities [@rowe2016]. We also know that they tend to move to large, densely populated cities where these opportunities are concentrated and because they also offer a wide variety of amenities and activities. Cities tend to work as accelarators enabling fast career development and occupational progression [@fielding1992]. Though, we have also seen the shrinkage of populations in cities, particularly in eastern European countries [@turok2007].

To examine the patterns of sun-national population losses, we compute two summary measures: (1) annual percentage change in population; and, (2) overall percentage change in population between 2000 and 2021. We start by looking the overall percentage change as it is easier to visualise. To this end, we categorise our measure of overall percentage change into seven different classes. Based on previous work by @gonzález-leonardo2023, we classify changes into high decline ($\leq$ -3), decline (\> -3 and $\leq$ -1.5), moderate decline (\> -1.5 and $\leq$ -0.3), stable (\< -0.3 and \< 0.3), moderate growth ($\geq$ 0.3 and \< 1.5), growth ($\geq$ 1.5 and \< 3) and high growth ($\geq$ 3). Let's first create the measures of population change.

```{r}
# compute population change metrics
population_df <- population_df %>%
  dplyr::group_by(GID_2) %>%
  arrange(-year, .by_group = TRUE ) %>%
  mutate(
  pct_change = ( population / lead(population) - 1) * 100, # rate of population change
  pct_change_2000_21 = ( population[year == "2020"] / population[year == "2000"] - 1) * 100, # overall rate of change
  ave_pct_change_2000_21 = mean(pct_change, na.rm = TRUE)
) %>%
  ungroup()

```

Let's map the overall percentage change in population between 2000 and 2020. We see a wide spread pattern of population decline across Ukraine. We observe a large spatial cluster of high population decline across the country with moderate population decline in some areas. Administrative areas containing large cities seem to record overall population growth between 2000 and 2020, potentially absorbing population movements from the rest of the country. What else do you think may be driving population growth in cities? And in contrast, what do you think is contributing to population decline in most of Ukraine?

```{r}

# set colours
cols <- c("#7f3b08", "#b35806", "#e08214", "#faf0e6", "#8073ac", "#542788", "#2d004b")
# reverse order
cols <- rev(cols)

population_df %>% dplyr::filter( year == 2020) %>%
  drop_na(pct_change_2000_21) %>%
    mutate(
    ove_pop_class = case_when( pct_change_2000_21 <= -3 ~ 'high_decline',
                           pct_change_2000_21 <= -1.5 & pct_change_2000_21 > -3 ~ 'decline',
                           pct_change_2000_21 <= -.3 & pct_change_2000_21 > -1.5 ~ 'moderate_decline',
                           pct_change_2000_21 > -0.3 & pct_change_2000_21 < 0.3 ~ 'stable',
                           pct_change_2000_21 >= 0.3 & pct_change_2000_21 < 1.5 ~ 'moderate_growth',
                           pct_change_2000_21 >= 1.5 & pct_change_2000_21 < 3 ~ 'growth',
                           pct_change_2000_21 >= 3 ~ 'high_growth'),
    ove_pop_class = factor(ove_pop_class,
         levels = c("high_decline", "decline", "moderate_decline", "stable", "moderate_growth", "growth", "high_growth") )
    ) %>%
  ggplot(aes(fill = ove_pop_class)) +
  geom_sf(col = "white", size = .1) +
  scale_fill_manual(values = cols,
                    name = "Population change") +
  theme_map_tufte()


```

Now that we have understanding of population changes over the whole 2000-2020 period. Let's try to understand how different places arrive to different outcomes. A way to do this is to look at the evolution of population changes. Different trajectories of population change could underpin the outcomes of population change that we observe today. Current outcomes could be the result of a consistent pattern of population decline over the last 20 years. They could be the result of acceleration in population loss after a major natural or war event, or they could reflect a gradual process of erosion. We visualise way to get an understanding of this is to analyse annual percentage population changes across individual areas. We use a Hovmöller Plot as illustrated by @rowe2022e for the analysis of spatio-temporal data.

```{r}
population_df %>% dplyr::filter( ave_pct_change_2000_21 < 0) %>%
  tail(., 40*21) %>%
  ggplot(data = .,
           mapping = aes(x= year, y= reorder(NAME_2, pct_change), fill= pct_change)) +
  geom_tile() +
  scale_fill_viridis(name="Population", option ="plasma", begin = .2, end = .8, direction = 1) +
  theme_tufte2() +
  labs(title= paste(" "), x="Year", y="Area") +
  theme(text = element_text(size=14)) +
  theme(axis.text.y = element_text(size=8))
```

The Hovmöller Plot shows that most of the selected areas tend to experience annual population decline, with varying spells of population growth. Percentage population changes range between 1 and -2.5. We also observe areas with consistent trajectories of annual population decline, like Zolochivs'kyi and Barvinkivs'kyi, and areas with strong decline in the first few years between 2000 and 2005 but moderate decline later on, such as Novovorontsovk'kyi. Yet, Hovmöller Plots provide a limited understanding of the annual population changes for a handful of areas at the time and it is therefore difficult to identify systematic representative patterns. Here we have selected 40 areas of a total of 629. Displaying the total number of areas in a Hovmöller Plot will not produce readable results. Even if that was the case, it would be difficult to identify systematic patterns. As we will seek to persuade you below, sequence analysis provides a very novel way to define representative trajectories in the data, identify systematic patterns and extract distinctive features characterising those trajectories.

## Application

Next, we focus on the application of sequence analysis to identify representative trajectories of population decline at the sub-national level between 2000 and 2020 in Ukraine. Intuitively, sequence analysis can be seen as a four-stage process. First, it requires the definition of longitudinal categorical outcome. Second, it measures the dissimilarity of individual sequences via a process known as optimal matching (OM). Third, it uses these dissimilarity measures to define a typology of representative trajectories using unsupervised machine learning clustering techniques. Fourth, trajectories can be visualised and their distinctive features can be measured. Below we describe the implementation of each stage to identify representative trajectories of population decline.

### Defining outcome process

Sequence analysis requires longitudinal categorical data as an input. We therefore classify our population count data into distinct categorical categories, henceforth referred to as *states* of population change. We compute the annual percentage rate of population change for individual areas and use these rates to measure the extent and pace of population change. The annual rate of population change is computed as follows:

$$
{p(t1) - p(t0) \over p(t0)}*100
$$

where: $p(t0)$ is the population at year t0 and $p(t1)$ is the population at t + 1.

As previously, we differentiate areas of high decline, decline, moderate decline, stable, moderate growth, growth and high growth. For the analysis, we focus on areas recording population losses between 2000 and 2020. The histogram shows the magnitude and distribution of population decline over this period. We observe that most occurrences of decline are moderate around zero, while very few exceed 5%.

```{r}
# select areas reporting losses between 2000 and 2020
population_loss_df <- population_df %>%
  dplyr::filter( pct_change_2000_21 < 0)
# plot distribution of percentage change
population_loss_df %>%
  dplyr::filter(pct_change  < 0) %>%
  ggplot(data =  ) +
  geom_density(alpha=0.8, colour="black", fill="lightblue", aes(x = pct_change)) +
  theme_tufte2()
```

Next we classify the annual percentage of population change into our seven states.

```{r}
# remove 2000 as it has no observations of population change
population_loss_df <- population_loss_df %>%
  dplyr::filter( year != 2000)
# clasify data
population_loss_df <- population_loss_df %>%
  mutate(
    pop_class = case_when( pct_change <= -3 ~ 'high_decline',
                           pct_change <= -1.5 & pct_change > -3 ~ 'decline',
                           pct_change <= -.3 & pct_change > -1.5 ~ 'moderate_decline',
                           pct_change > -0.3 & pct_change < 0.3 ~ 'stable',
                           pct_change >= 0.3 & pct_change < 1.5 ~ 'moderate_growth',
                           pct_change >= 1.5 & pct_change < 3 ~ 'growth',
                           pct_change >= 3 ~ 'high_growth')
)

```

### Optimal matching

We measure the extent of dissimilarity between individual sequence of population decline. To this end, we used a sequence analysis technique, OM, which computes distances between sequences as a function of the number of transformations required to make sequences identical. Two sets of operations are generally used: (1) insertion/deletion (known as indel) and (2) substitution operations. Both of these operations represent the cost of transforming one sequence into another. These costs are challenging to define and below we discuss what is generally used in empirical work. Intuitively, the idea of OM is to estimate the cost of transforming one sequence into another so that the greater the cost to make two sequences identical, the greater the dissimilarity and *vice versa*.

*Indel operations* involve the addition or removal of an element within the sequence and substitution operations are the replacement of one element for another. Each of these operations is assigned a cost, and the distance between two sequences is defined as the minimum cost to transform one sequence to another [@abbott2000sequence]. By default, indel costs are set to 1. To illustrate indel operations, let's consider an example of sequences of annual population change for three areas during 2000 and 2003. The sequences are identical, except for 2003. In this case, indel operations involve the cost of transforming the status *stable* in the sequence for area 1 to *high decline* in the sequence for area 2, and thus this operation would return a cost is 2. Why 2? It is 2 because you would need to delete *stable* and add *high decline*. Now, let's try the cost of transforming the status *stable* in the sequence for area 1 to the status in the sequence for area 3 using indel operations. What is the cost? The answer is 1 because we only need to delete *stable* to make it identical.

| Area | 2000    | 2001    | 2002    | 2003         |
|------|---------|---------|---------|--------------|
| 1    | decline | decline | decline | stable       |
| 2    | decline | decline | decline | high decline |
| 3    | decline | decline | decline | \-           |

*Substitution operations* or costs represent transition costs; that is, the cost for substituting each state with another. Substitution costs are defined in one of two ways [@salmela-aro2011]. One approach is the theory-driven approach. In such approach, substitution costs are grounded in theory suggesting that, for example, transforming state 1 to state 2 should have a greater cost than transforming state 1 to state 3, or performing the opposite operation i.e. transforming state 2 to state 1. An example could be that it is more financially costly to transition from *full-time employment* to *full-time education* than transition from *full-time education* to *full-time employment*.

A second approach and most commonly used in empirical work is a data-driven approach. In this approach, substitution costs are empirically derived from transition rates between states. The cost of substitution is inversely related to the frequency of observed transitions within the data. This means that infrequent transitions between states have a higher substitution cost. For example, as we will see, transitions from the state of high decline to high growth are rarer than from high growth to high decline in Ukraine. The transition rate between state $i$ and state $j$ is the probability of observing state $j$ at time $t1$ given that the state $i$ is observed at time $t$ for $i \neq j$. The substitution cost between states $i$ and $j$ is computed as:

$$
2 - {p(i | j) - p(j | i)}
$$ where $p(i | j)$ is the transition rate between state $i$ and $j$.

To implement OM, we first need to rearrange the structure of our data from long to wide format. You can now see now that individual rows represent areas (column 1) and columns from 2 to 21 represent years.

-   see @rowe2022e for a description on different spatio-temporal data structures and their manipulation using tidyverse principles.

```{r}
# transform from long to wide format
wide_population_loss_df <- population_loss_df %>%
  as_tibble() %>%
  group_by(GID_2) %>%
  arrange(year, .by_group = TRUE ) %>%
  ungroup() %>%
  tidyr::pivot_wider(
  id_cols =  GID_2,
  names_from = "year",
  values_from = "pop_class"
)

wide_population_loss_df
```

Once the data frame has been reshaped into a wide format, we define the data as a state sequence object using the R package `TraMineR`. Key here is to appropriately define the labels and an appropriate palette of colours. Depending on the patterns you are seeking to capture a diverging, sequential or qualitative colour palette may be more appropriate. For this chapter, we use a diverging colour palette as we want to effectively represent areas experiencing diverging patterns of population decline or growth.

> Note: various types of sequence data representation exist in `TraMineR`. These representations vary in the way they capture states or events. Chapter 4 in @gabadinho2009mining describes the various representations that `TraMineR` can handle. In any case, the state sequence representation used in this chapter is the most commonly used and internal format used by `TraMineR`. Hence we focus on it.

```{r}
# alphabet
seq.alphab <- c("high_growth", "growth", "moderate_growth", "stable", "moderate_decline", "decline", "high_decline")
# labels
seq.lab <- c("High growth", "Growth", "Moderate growth", "Stable", "Moderate decline", "Decline", "High decline")
# define state sequence object
seq.cl <- seqdef(wide_population_loss_df,
                 2:21,
                 alphabet = seq.alphab,
                 labels = seq.lab,
                 cnames = c("2001", "2002", "2003", "2004", "2005", "2006", "2007", "2008", "2009", "2010", "2011", "2012", "2013", "2014", "2015", "2016", "2017", "2018", "2019", "2020"),
                 cpal =c("1" = "#7f3b08",
                         "2" = "#b35806",
                         "3" = "#e08214",
                         "4" = "#faf0e6",
                         "5" = "#8073ac",
                         "6" = "#542788",
                         "7" = "#2d004b"))
```

Using the sequence data object, we create a state distribution plot to get an understanding of the data. The plot shows the distribution of areas across status of population change in individual years. The overall picture emerging from the plot is an overall pattern of population decline between 2000 and 2020, predominantly moderate decline and limited spells of high population decline or growth. This aligns with the predominant trajectory of population decline observed in Ukrain based on more aggregate data at the regional level [@newsham2022].

```{r}
#| warning: false

seqplot(seq.cl,
        title="State distribution plot",
        type = "d",
        with.legend = "right",
        border = NA)
```

We now move on to compute the substitution costs for our population states. From the equation above, you may have realised that transition costs vary between 0 and 2, with the former indicating zero cost. The latter indicates the maximum cost of converting one state into another. The option `TRATE` in method states to derive costs from observed transition rates. That is the data-driven approach discussed above. From the matrix below, you can see that it is more costly to convert a status "high growth" to "moderate decline" than from "growth" to "moderate". This makes sense. We expect gradual changes in population along a given trajectory if they were to occur due to natural causes.

> Note we are considering a fixed measure of transition rates. That means that we are using the whole dataset to compute an average transition rate between states. That assumes that the rate of change between states does not change over time. Yet, there may be good reasons to believe they do as areas move across different states. In empirical work, time varying transition rates are more often considered. That means we use temporal slices of the data to compute transition rates; for example, using data from 2001, 2002 and so on. In this way, we end up with potentially different transition rates for every year.

```{r}
#| warning: false

# Calculate transition rates
subs_costs <- seqsubm(seq.cl,
                      method = "TRATE",
                      #time.varying = TRUE
                      )

subs_costs
```

To understand better the idea of substitution costs, we can have direct look at transition rates underpinning these costs. Transition rates can be computed via `seqtrate` . By definition, transition rates vary between 0 and 1, with zero indicating no probability of a transition occurring. One indicates a 100% probability of a transition taking place. Thus, for example, the matrix below tell us that there is a 5% probability of observing a transition from "high growth" to "moderate decline" in our sample. Examining transition rates could provide very valuable information about the process in analysis.

```{r}
seq.trate <- seqtrate(seq.cl)
round(seq.trate, 2)
```

Now we focus on the probably most important component of sequence analysis; that is, the calculation of dissimilarity. Recall our aim is to identify representative trajectories. To this end, we need a way to measure how similar or different sequences are - which is known as OM. Above, we described that we can use indel and substitution operations to measure the dissimilarity or costs between individual sequences. The code chunk implements OM based on indel and substitution operations. The algorithm takes an individual sequence and compares it with all of the sequences in the dataset, and identifies the sequence with the minimum cost i.e. the most similar sequence. The result of this computing intensive process is a distance matrix encoding the similarity or dissimilarity between individual sequences.

> For indel, `auto` sets the indel as `max(sm)/2` when sm is a matrix. For more details, run `?seqdist` on your console

```{r}
#| warning: false

# Calculate a distance matrix
seq.om <- seqdist(seq.cl,
                  method = "OM", # specify the method
                  indel = "auto", # specify indel costs
                  sm = subs_costs) # specify substitution costs
```

As highlighted above, if you would like to apply varying substitution costs, you can do this directly here by using the option `method = DHD` .

### Clustering

The resulting distance matrix from OM `seq.om` indicates the degree of similarity between individual sequences. To identify representative trajectories, we then need to a way to group together similar sequences to produce a typology, in this case of population decline trajectories. Unsupervised cluster analysis is generally used for this task. Trusting you have built an understanding of cluster analysis from the previous chapter, we will not provide an elaborate description here. If you would like to know more about cluster analysis, we recommend the introductory book by @kaufman2009finding. We use a clustering method called `k-meloids` . This methods is known to be more robust to noise and outliers than the conventional k-means procedure [@backman2020]. This is because the medoid algorithm clusters the data by minimising a sum of pair-wise dissimilarities [@kaufman2009finding], rather than a sum of squared Euclidean distances. We run cluster analyses at different numbers of `k` starting from 2 to 20.

```{r}
# run PAMs
for (k in 2:20)
  pam_sol <- pam(seq.om, k)
```

We then seek to determine the optimal number of clusters `k`. We use silhouette scores, but as we noted @sec-chp3, the optimal number of clusters is better determined by the user given the context and use case. It is an art. There is no wrong or right answer. As can be seen from the results below from the average silhouette score, two clusters is suggested as the optimal solution. However, we could argue that we gain very little from such coarse partition of the data. We suggest to take this as guidance and a starting point to look to identify an appropriate data partition. We suggest to visualise different solution and gain an understanding of what data get split and decide on whether the resulting patterns contribute to the understanding of the process at hand.

```{r}
# compute average silhouette scores for all 20 cluster solutions
asw <- numeric(20)
for (k in 2:20)
  asw[k] <- pam(seq.om, k) $ silinfo $ avg.width
  k.best <- which.max(asw)
  cat("silhouette-optimal number of clusters:", k.best, "\n")
  asw
```

We rerun and save the results for a 7k cluster partition. If you inspect the resulting data frame, it provides an identifier for each cluster. Each individual area is attributed to a cluster. Next the question that we seek to answer is what sort of pattern do these clusters capture?

```{r}
# rerun pam for k=7
pam_optimal <- pam(seq.om, 7)
```

### Visualising

To understand the representative patterns captured in our data partition, we use visualisation. There is a battery of different visualisation tools to extract information and identify distinctive features of the identified trajectories. We start by using *individual sequence plots* by trajectory type. They provide a visual representation of how individual areas in each trajectory type moves between states. Recall that we are capturing representative trajectories; hence, there is still quite a bit of variability in terms of the patterns encapsulated in each representative trajectory. Back to the individual sequence plots, each line in these plots represents an area. Time is displayed horizontally and colours encode different states - in our case of population change. Numbers on the y-axis display the number of areas in each cluster. The figure immediate below relies on the base `plot` library, and by default, it is not very visually appealing.

```{r}
# create individual sequence plots
par(mar = c(4, 4, 2, 2))  # set margins
seqplot(seq.cl,
        group = pam_optimal$clustering,
        type = "I",
        border = NA,
        cex.axis = 1.5,
        cex.lab = 1.5,
        sortv = seq.om)
```

We therefore switch to the R library `ggseqplot` which enables visualisation of sequence data based on `ggplot` functionalities. This package may provide more flexibility if we are more familiar with `ggplot`.

The figure below offers a clear representation of the systematic sequencing of states that each trajectory captures. It provides information on two key features of trajectories: sequencing and size. For example, trajectory 1 seems to capture a sequencing pattern of transitions from moderate population decline to stability and back to moderate decline. Trajectory 2 shows a pattern high population decline during the first few years and then consistent moderate decline. Trajectory 3 displays a predominant pattern of moderate population decline. Trajectory 4 represents patterns of areas experiencing decline with spells of high population decline. Trajectory 5 shows a pattern of decline in the first few years followed by moderate decline and decline again. Trajectory 6 shows a similar pattern with more prevalent spells of population decline across the entire period. Trajectory 7 displays a trend of temporary decline, with spells of population growth and stability. From these plots, you can also identify which trajectories tend to be more common. In our example, trajectory and 3 accounts the largest number of areas: 189.

```{r}
# create individual sequence plots based on ggplot
ggseqiplot(seq.cl,
        group = pam_optimal$clustering,
        sortv = seq.om,
        facet_ncol = 4) +
  scale_fill_manual(values = rev(cols)) +
  scale_color_manual(values = rev(cols))
```

We can also get a better understanding of the resulting trajectories by analysing *state frequency plots*. They show the number of occurrences of a given state in individual years. These plots examine the data from a vertical perspective i.e. looking at individual years across areas, rather than at individual areas over time. State frequency plots reveal that predominant states in each year and changes in their prevalence. Focusing on trajectory 1, for example, we observe that moderate decline was the predominant state between 2000 and 2007 and stability became equally prevalent during 2008 and 2015.

```{r}
# create state frequency plots based on ggplot
ggseqdplot(seq.cl,
        group = pam_optimal$clustering,
        facet_ncol = 4) +
  scale_fill_manual(values = cols) +
  scale_color_manual(values = cols)
```

We can also examine time spent in individual states in each trajectory. *Time spent plots* report the average time spent in each state. The measure of time depends on the original data used in the analysis. We use years so the y-axis refers to the average number of years that a given status appears in a representative trajectory type. For example, a score over 5 for stable in trajectory 1 indicates that the average number of years that areas in that typology are classified in that category is over 5.

```{r}
# create time spent plots based on ggplot
ggseqmtplot(seq.cl,
        group = pam_optimal$clustering,
        facet_ncol = 4) +
  scale_fill_manual(values = rev(cols)) +
  scale_color_manual(values = rev(cols)) +
  scale_x_discrete(labels=c("high_growth" = "HG", "growth" = "G",
                              "moderate_growth" = "MG", "stable" = "S", "moderate_decline" = "MD", "decline" = "D", "high_decline" = "HD" ))
```

Finally we analyse *entropy index plots*. Entropy is a measure of diversity. The greater the score, the greater the entropy or diversity of states. The plot below displays the entropy index computed for individual trajectories each year. Each line represents the entropy index for a trajectory in each year. The top yellow line in 2001 indicates that in 2001 areas following a trajectory 7 type were distributed across a larger number of states than any other trajectory, reflecting the fact that areas experiencing this trajectory moves between states of decline, stability and growth. In other word, it indicates that there was more diversity of states.

```{r}
# create entropy index plots based on ggplot
ggseqeplot(seq.cl,
        group = pam_optimal$clustering) +
  scale_colour_viridis_d()
```

## Questions

For the first assignment, we will continue to focus on London as our area of analysis. We will use population count estimates from the Office of National Statistics (ONS). The dataset provides information on area, population numbers and population density at national, regional and smaller sub-national area level, including Unitary Authority, Metropolitan County, Metropolitan District, County, Non-metropolitan District, London Borough, Council Area and Local Government District for the period from 2001 to 2020.

```{r}
#| warning: false
pop_df <- read_csv("./data/sequence-analysis/population_uk/population-uk-2011_20.csv")
pop_df
```

For the assignment, you should only work with smaller sub-national areas. Filter out country and regional area. You should address the following questions:

1.  Use sequence analysis to identify representative trajectories of population change and discuss the type of trajectories identified in London Boroughs.

2.  Use individual sequence plot to identify distinctive features in the resulting trajectories.

For the analysis, aim to focus on the area of London so you can link your narrative to the rest of analyses you will be conducting.

Ensure you justify the number of optimal clusters you will use in your analysis and provide a brief description of the trajectories identified. Describe how they are unique.