SolarPulse/SunspotAnalysis.Rmd at main · Git-With-Chris/SolarPulse · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Sunspot Analysis"
author: "Chris John"
header-includes:
  \usepackage{fvextra}
  \DefineVerbatimEnvironment{Highlighting}{Verbatim}{breaklines,commandchars=\\\{\}}
output:
  pdf_document:
    number_sections: true
fontsize: 11pt
geometry:
  [left=2cm,
  right=2cm,
  top=2.5cm,
  bottom=2cm]
link-citations: yes
linkcolor: blue
classoption: a4paper
---

\newpage

\tableofcontents

\newpage

# **Problem Definition**
The goal of this project is to analyze the yearly mean sunspot numbers dataset from 1700 to 2023, obtained from the Solar Influences Data Analysis Center of the Royal Observatory of Belgium (available [here](https://www.sidc.be/SILSO/infosnytot)). This dataset represents the yearly mean total sunspot number, calculated as the arithmetic mean of the daily total sunspot number for each year.

The project aims to provide insights into trends and develop a reliable model for the data. Sunspots, which are temporary phenomena on the Sun's photosphere appearing as darker spots, are crucial for understanding solar cycles with implications for space weather and climate. Understanding their patterns and periodicity is essential for these analyses.

# **Aim**
The primary objectives are as follows:

1. Conduct an initial exploratory data analysis to understand the characteristics, trends, and patterns present in the dataset. This analysis will involve examining summary statistics, visualizations such as time series plots, and identifying any notable features or anomalies.

2. Propose a set of possible models using various model specification tools such as Autocorrelation Function (ACF), Partial Autocorrelation Function (PACF), Extended Autocorrelation Function (EACF), and Bayesian Information Criterion (BIC) table.

3. Fit all the proposed models to the dataset to obtain parameter estimates. This step involves interpreting the estimated coefficients and assessing their significance.

4. Using the goodness-of-fit metrics such as Akaike Information Criterion (AIC), Bayesian Information Criterion (BIC), Mean Squared Error (MSE), etc., to compare, select the best-fitted model among the set of proposed models and utilizing that best model to forecast for next 10 years.

# **Setup**
This section ensures that the necessary setup steps are performed before proceeding with the analysis, including setting the working directory and importing required packages for data manipulation, visualization, and time series analysis.This preparation step is crucial for conducting a smooth and effective analysis process.

## Setting Working Directory

```{r message=FALSE, warning=FALSE}
# Add user working directory path.
setwd("~/Documents/RMIT/TimeSeriesAnalysis/TSA-A3")

# Verify present working directory.
getwd()
```

## Dependencies

```{r message=FALSE}
# Code to install dependencies.
# install.packages(c('tidyverse', 'TSA', 'forecast', 'lmtest', 'tseries', 'urca'))

# Import Dependencies.
library(tidyverse)
library(TSA)
library(forecast)
library(lmtest)
library(urca)
library(tseries)
library(patchwork)

# Source Utility Script
source("./utilities.R")
```

Details on the functions used from the Utility Script (utilities.R) can be found in the Appendix section of this report.

# **Data**

The data set under analysis in this assignment represents yearly mean Sunspot Numbers, calculated as the arithmetic mean of the daily total sunspot number for each year from 1700-2023. Sourced from Solar Influences Data Analysis Center of the Royal Observatory of Belgium (available [here](https://www.sidc.be/SILSO/infosnytot)), the data set offers insights into long-term climate trends and variations.

```{r}
# Reading Data into R environment.
sunspot_data <- read.csv('./Resources/YearlySunspotData.csv', sep = ';', header = FALSE)

# Extract the first two columns
sunspot_data <- sunspot_data[, 1:2]

# Assign column names
colnames(sunspot_data) <- c("Year","MeanSunspotNumber")

# Display Sunspot Data
sunspot_data %>% head(15)
```

The data set consists of observations starting from 1700 to 2023 (constituting 324 years inclusive of the start and end terms). It is essential to check if all years are accounted for in the data set without inconsistent or null values as this might be an indication to an incomplete sequence.

```{r}
# Calculate number of years between 1700 - 2023 (324 years)
true_years = (2023 - 1700) + 1

# Verify year count with true year count
cat("Number of years between 1700 - 2023 (inclusive): ", true_years,
    "\nNumber of years accounted for out of 324 years: ", sunspot_data %>% nrow(),
    "\nCount of null values in the data set: ", sum(is.na(sunspot_data$MeanSunspotNumber)))
```


## Convert to TS Object

Converting CSV (Comma-Separated Values) data to a ts() object in R for time series analysis serves the purpose of enabling efficient manipulation, exploration, and modeling of time-dependent data using specialized functions and tools designed for such analyses.

Time series data inherently includes temporal information, such as time stamps or time intervals between observations. The conversion to a `ts()` object ensures that R can effectively recognize and leverage this temporal aspect of the data for conducting time-based analyses, thus avoiding the risk of overlooking or disregarding important temporal characteristics.

**Note**: The frequency for the observed series is set to 11 for two primary reasons. Firstly, the ACF plot (Figure 4.1) of the raw series reveals periodic cycles occurring roughly every 11 years. Secondly, detailed [domain analysis of solar cycles](https://www.space.com/solar-cycle-frequency-prediction-facts) corroborates this pattern, as each solar cycle is approximately 11 years in duration.

```{r}
# Convert data to time series object
sunspot_TS <- sunspot_data$MeanSunspotNumber %>% ts(frequency = 11)

# View first 20 years of the series
sunspot_TS %>% head(20)
```

# **EDA**

The Exploratory Data Analysis (EDA) section initiates the exploration and summarization of the time series data being analyzed. This involves verifying the data type of the time series object (sunspot_TS) to ensure its suitability for further analysis. Furthermore, summary statistics are provided to offer a comprehensive overview of the distribution and features of the mean sunspot numbers data.

## Exploration

```{r}
# Verify Data type
sunspot_TS %>% class()

# Summarize the time series object
sunspot_TS %>% summary()
```

```{r fig.align='center', fig.width=8, fig.height=3}
# Create box plot to visualize sunspot number
sunspot_TS %>% boxplot(horizontal = TRUE,
                       main = "Figure 1: Yearly Mean Sunspot Number (1700-2023)",
                       xlab = "Mean Sunspot Number",
                       border = "black")
# Add grid
grid(nx = NULL, ny = NULL, col = "gray", lty = "dotted")
```

The summary statistics offer insight into the distribution of sunspots spanning from 1700 to 2023. The minimum value of 0.00 and a median value of 65.55 indicating an almost near even split where almost half of the recorded sunspots are below this value and the other half are above. These statistics provide a glimpse into the variability and range of mean sunspots over the analyzed time frame.

Moreover, the box plot (Figure 1) demonstrates a relatively uniform distribution of number of sunspot observations without many outliers. There are only 3 outliers observed which exhibits the largest sunspot areas within the series on the years 1778, 1957 and 1958.

## Visualizing data

In this section, a time series plot is created to depict the trajectory of sunspots over time. The goal is to extract meaningful insights regarding the data’s characteristics, including trends, seasonality, variance changes, potential change points, and overall data behavior. This step is crucial for informing the model selection process and guiding decision-making effectively.

```{r fig.align='center', fig.width=15, fig.height=7}
# Plot sunspot series across time
sunspot_TS %>% plot(type = 'o',
                    main = "Figure 2: Yearly Mean Sunspot Number (1700-2023)",
                    xlab = "Year",
                    ylab = "Mean Sunspot Number")

# Add grid
grid(nx = NULL, ny = NULL, col = "gray", lty = "dotted")
```


## Plot Analysis

1. **Trend**:

    The plot indicates that there is minimal trend present within the sunspot number data. This initial visual assessment suggests that the series exhibits stationarity, a crucial characteristic of time series data. The lack of a clear trend and the apparent stability of the statistical properties in the plot support the hypothesis that the series is stationary or trendless (based on visual assessment). However, we do acknowledge that there is a possibility that the underlying seasonality of the data might be masking the trend.

2. **Seasonality**:

    The time series plot clearly exhibits seasonality in the data. Seasonality refers to patterns that repeat over specific periods, such as a year, quarter, or month. In this case, there is a distinct pattern of ups and downs recurring across the years, as evident in the graph. These regular fluctuations indicate that the data has a seasonal component, which is an important aspect to consider in time series analysis and forecasting. Recognizing seasonality helps in building more accurate models by accounting for these predictable changes.

3. **Changing Variance**:

    Visual inspection of the time series reveals no severe fluctuations in the variability of sunspot observations. Some fluctuations manifest as periodic bursts of high variability, followed by intervals with significantly lower variance. During these lower variance periods, the data points exhibit minimal dispersion. This pattern indicates that the sunspot observations experience cycles of high and low activity, which is crucial for understanding the underlying dynamics of the series.

4. **Behavior**:

    Due to the presence of seasonality, it is highly likely that the underlying pattern cannot be inferred solely from a visual assessment. The time series plot indicates strong and predominant autoregressive (AR) behavior, as most points exhibit a pattern where past observations influence future ones, suggesting a relationship between successive observations. However, there are minimal regions that display jumps and gaps, particularly near the peaks and tails of each cycle, indicating a possible moving average (MA) influence. This combination of AR behavior and occasional MA characteristics highlights the complex dynamics of the sunspot observations, where both past values and random shocks play a role in shaping the time series.

5. **Change/Intervention Point**:

    A change point can be defined as a juncture in the series where there is an abrupt and significant alteration in either the series behavior or trend. Based on this definition, it could be argued that no distinct change point is discernible in the series. The overall behavior of the series remains relatively consistent, without any abrupt shifts or notable deviations that would indicate a **significant** change in the underlying pattern or trend. This suggests that the series maintains its general characteristics throughout the observed period.

## Assessing Auto-Correlation

As observed in the previous section [Visualizing Data](#visualizing-data), the series exhibits indications of autocorrelation. To further explore and gain deeper insights into the data, we will conduct an autocorrelation analysis by examining the correlation of the series with its first and second lag.

```{r}
# Creating first and second lags for data
series <- sunspot_TS
first_lag <- zlag(sunspot_TS)
second_lag <- zlag(zlag(sunspot_TS))

# Setting index for correlation test
index1 <- 2:length(first_lag)
index2 <- 3:length(second_lag)

# Pearson's Correlation Test for first and second lag's
cor_first_lag <- cor(series[index1], first_lag[index1])
cor_second_lag <- cor(series[index2], second_lag[index2])

# Results of Pearson's correlation coefficient test
cat("Pearson's Correlation between series and first lag: ", cor_first_lag %>% round(4),
    "\nPearson's Correlation between series and second lag: ", cor_second_lag %>% round(4))

```

```{r fig.height=5, fig.width=15}
# Set up a 1x2 layout for plots
par(mfrow=c(1,2))

# Visualization of correlation between series and first lag
series[index1] %>% plot(first_lag[index1],
                        xlab = "First Lag of Sunspot's Series",
                        ylab = "Shares Series",
                        main = "Figure 3.1: Scatterplot of Sunspot's series and it's First Lag") %>%
                   text(x=40, y=260, col='blue',
                        labels=paste0("Correlation value: ", cor_first_lag %>% round(4)))

# Add grid
grid(nx = NULL, ny = NULL, col = "gray", lty = "dotted")

# Visualization of correlation between series and second lag
series[index2] %>% plot(second_lag[index2],
                        xlab = "Second Lag of Sunspot's Series",
                        ylab = "Share's Series",
                        main = "Figure 3.2:Scatterplot of Sunspot's series and it's Second Lag") %>%
                   text(x=40, y=260, col='blue',
                        labels=paste0("Correlation value: ", cor_second_lag %>% round(4)))

# Add grid
grid(nx = NULL, ny = NULL, col = "gray", lty = "dotted")
```

The values obtained from the Pearson's Correlation Test indicate strong positive correlations between the series and its first lag, and a moderately positive correlation with its second lag. Additionally, the scatter plots (Figure 3.1 & 3.2) visualize this correlation between the series and its first and second lags. Both plots depict a clear positive linear relationship, with the first lag showing a more dominant linear relationship. This suggests that the current values of the series are heavily influenced by their immediate past values, reinforcing the presence of auto-regressive behavior in the data.

## Assessing Stationarity

From the visual assessment of the sunspot series in the time series plot (Figure 2), there was no indication of a clear trend being present in the series. However, to confirm this observation, this section will include further analysis using various tests and tools. Specifically, we will analyze the Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF) plots, and conduct statistical tests for stationarity. These methods will provide a more rigorous evaluation of the series' properties and help determine if the initial visual assessment holds true.

### ACF and PACF plot

```{r fig.height=6, fig.width=17}
# Use utility script to plot ACF and PACF plot
sunspot_TS %>% plot_acf_pacf(acf_main = "Figure 4.1: Autocorrelation Function (ACF) of Yearly Sunspot Numbers",
                             pacf_main = "Figure 4.2: Partial Autocorrelation Function (PACF) of Yearly Sunspot Numbers",
                             max_lag = 70)
```

From the ACF plot (Figure 4.1), we do not observe a strict slowly decaying pattern, which is usually expected in a non-stationary series with a distinct trend. However, the ACF plot does show a decaying wave pattern, which strongly indicates seasonality or seasonal-trend in the series, aligning with our conclusion from the time series plot (Figure 2) that the series exhibits seasonality. The PACF plot (Figure 4.2) shows a very high first lag value, which could possibly indicate non-stationarity, but this high value could also be due to the underlying seasonality. Due to the lack of conclusive evidence, we need to employ statistical tests to confirm the presence or absence of stationarity in the series, providing a more definitive assessment.

**Note:** The above plots are generated using a utility script written specifically for the analysis to avoid any repetetive code chunks. Please refer to the Appendix section for further details.

### Statistical Tests for Sationarity

```{r warning=FALSE}
# Augmented Dickey-Fuller Test for stationarity
sunspot_TS %>% adf.test()

# Phillips-Perron Unit Root Test for stationarity
sunspot_TS %>% pp.test()
```

Both the Augmented Dickey-Fuller Test and Phillips-Perron Test yield p-values of lower than 0.01, well below the conventional threshold of 0.05, confirming that the yearly sunspot series is stationary. Given that the time series plot (Figure 2), the ACF plot (Figure 4.1), and both the Augmented Dickey-Fuller Test and Phillips-Perron Test show evidence of stationarity in the series, this collective evidence strongly indicates stationarity in our series.

## Assessing Normality

This section will investigate the presence of normality in the series, as normality is an underlying assumption of modeling methods like Maximum Likelihood Estimation (MLE), which will be employed in the modeling sections of this analysis. Assessing normality is crucial to ensure the validity of the modeling approach and to make accurate inferences from the data. To visually and statistically assess normality, we can use tools such as histograms, Q-Q plots, and normality tests like the Shapiro-Wilk test.

```{r fig.align='center', fig.width=16, fig.height=5}
# Set up a 1x2 layout for plots
par(mfrow=c(1,2))

# Plotting a Histogram for sunspot_TS
sunspot_TS %>% hist(main = "Figure 5.1: Histogram of Sunspot Series",
                    xlab = "Sunspot Values")

# Add grid
grid(nx = NULL, ny = NULL, col = "gray", lty = "dotted")

# Plotting a QQ plot for sunspot_TS
sunspot_TS %>% qqnorm(main = "Figure 5.2: Normal QQ Plot of Sunspot Series")
sunspot_TS %>% qqline(col='blue', lty=2)

# Add grid
grid(nx = NULL, ny = NULL, col = "gray", lty = "dotted")
```

```{r}
# Perform Shapiro-Wilks test for normality.
rawTS_shapiro <- sunspot_TS %>% shapiro.test()
rawTS_shapiro
```

The above histogram (Figure 5.1) does not show a clear normal distribution of the sunspot numbers. Similarly, the QQ-Plot (Figure 5.2) does not indicate normality due to the significant deviation of the observations from the reference line. The Shapiro-Wilk test further supports this, with a p-value of 4.853×10-12, which is significantly lower than the conventional threshold of 0.05. This strongly suggests that the series does not exhibit normality. Given this lack of normality, appropriate transformations of the data may need to be considered to improve the normality of the series.

# **Data Transformation**

In order to improve the normality of the series, we will be using the Box-Cox transformation to find an optimal transformation for the series. This process will include determining an optimal lambda value and transforming the data based on this value.

## Box-Cox Transformation

```{r fig.height=6, fig.width=8, fig.align='center', warning=FALSE}
# Adding a small positive value to ensure non-zero data
positive_sunspot_TS <- sunspot_TS + (abs(min(sunspot_TS)) + 0.01)

# Applying Box-Cox transformation
BC <- positive_sunspot_TS %>% BoxCox.ar(lambda = seq(-2, 2, 0.01), method = 'yule-walker')

# Add grid
grid(nx = NULL, ny = NULL, col = "gray", lty = "dotted")

# Add main title separately
title(main = 'Figure 6: Optimal Lambda value', adj = 0.5, line = 1.4)
```

```{r}
# Use Log-likelihood estimation to find optimal lambda values
lambda <- BC$lambda[which(max(BC$loglike) == BC$loglike)]

# Print results
cat("Lower bound Confidence Interval: ", BC$ci[1],
    "\nUpper bound Confidence Interval: ", BC$ci[2],
    "\n\nOptimal Lambda value: ", lambda)
```

The optimal lambda value for the Box-Cox transformation is determined using log-likelihood estimation. The Box-Cox transformation function generates a range of lambda values from -2 to 2 with an increment of 0.01. For each lambda value, the log-likelihood of the transformed data is computed. The lambda value corresponding to the highest log-likelihood is considered optimal as it indicates the transformation that best fits the data.

Regarding the confidence intervals, they provide insights into the reliability of the transformation. The confidence intervals, typically derived from statistical tests, indicate a range of lambda values within which the transformation is deemed statistically valid or effective.

These values suggest that lambda values between approximately 0.44 and 0.56 are likely to produce meaningful transformations of the data. Therefore, for the subsequent analysis phases, the optimal lambda value of approximately 0.49 should be considered for applying the Box-Cox transformation.

```{r fig.height=15, fig.width=19, fig.align='center'}
# Set up a 1x2 layout for plots
par(mfrow=c(3,2))

# Plot original sunspot series across time
sunspot_TS %>% plot(type = 'o',
                    main = "Figure 7.1: Original Sunspot series (1850-2023)",
                    xlab = "Year",
                    ylab = "Temperature Anomalies (°C)")

# Add grid
grid(nx = NULL, ny = NULL, col = "gray", lty = "dotted")

# Plot Box-Cox Transformed sunspot series across time
BC_sunspot_TS <- ((positive_sunspot_TS^lambda) - 1) / lambda

BC_sunspot_TS %>% plot(type = 'o',
                       main = 'Figure 7.2: Box-Cox Transformed Sunspot series (1850-2023)',
                       xlab = 'Year',
                       ylab = 'Transformed Temperature anomalies (°C)')

# Add grid
grid(nx = NULL, ny = NULL, col = "gray", lty = "dotted")

# Test for normality with Box-Cox transformed series.
BC_TS_Stest <- BC_sunspot_TS %>% shapiro.test()

# Plotting a Histogram for sunspot_TS
sunspot_TS %>% hist(main = "Figure 7.3: Histogram of Original Sunspot Series",
                    xlab = "Sunspot Values")

# Plotting a Histogram for sunspot_TS
BC_sunspot_TS %>% hist(main = "Figure 7.4: Histogram of Transformed Sunspot Series",
                       xlab = "Sunspot Values")

# Plotting a QQ plot for sunspot_TS
sunspot_TS %>% qqnorm(main = "Figure 7.5: Normal QQ Plot of Original Sunspot series")
sunspot_TS %>% qqline(col='blue', lty=2) %>%
text(x = 1.4, y = 3,
     labels = paste0("Shapiro-Wilks (p-value): ", rawTS_shapiro[2]),
     col='blue', cex=1.5)
grid(nx = NULL, ny = NULL, col = "gray", lty = "dotted")

# QQ-Plot Box-Cox transformed series
BC_sunspot_TS %>% qqnorm(main = "Figure 7.6: Normal QQ Plot of Transformed Sunspot series")
BC_sunspot_TS %>% qqline(lty=2, col='blue') %>%
text(x = 1.2, y = -1.7,
     labels = paste0("Shapiro-Wilks (p-value): ", BC_TS_Stest[2]),
     col='blue', cex=1.5)
grid(nx = NULL, ny = NULL, col = "gray", lty = "dotted")
```

Although the Box-Cox transformation does not fully achieve normality for the data, it does result in a significant improvement in the normality of the series. The improved normality of the series can be observed through the QQ-Plot of the original and the transformed series (Figure 7.3 & 7.4) and also from the histogram of the respective series. Initially, the Shapiro-Wilk test produced a p-value of 4.853e-12, which is extremely low and indicates a strong deviation from normality.

After applying the Box-Cox transformation, the Shapiro-Wilk p-value increased to 0.0004. This substantial increase in the p-value reflects a notable enhancement in the normality of the series, making it more suitable for modeling methods that assume normality, such as Maximum Likelihood Estimation (MLE). Given this information we will proceed using the `BC_sunspot_TS` for the successive part of the analysis.

**Note**: We do not conduct a stationarity test on the transformed series because we expect the stationarity to remain unaffected by the transformation. The transformation is performed primarily for two reasons: first, to address any variability in our series, and second, to improve the normality of the series. Since the series is already stationary and no ordinary differencing is performed, the stationarity will remain unchanged.

# **Model Specification**

Through the initial exploratory data analysis, we identified a strong seasonal component in the series. The ACF and PACF plots of the non-transformed data further supported this observation. Given the presence of seasonality, we have several options for modeling the sunspot number series. For this analysis, we will proceed with a Seasonal AutoRegressive Integrated Moving Average (SARIMA) modeling approach. This approach allows us to account for both the seasonal and non-seasonal patterns in the data, providing a comprehensive framework for accurate modeling and forecasting of the sunspot numbers.

The SARIMA model can be expressed as follows: **\( \text{SARIMA}(p,d,q)(P,D,Q)_s \)**

where:

  - \( p \) is the order of the non-seasonal autoregressive (AR) component,
  - \( d \) is the order of non-seasonal differencing,
  - \( q \) is the order of the non-seasonal moving average (MA) component,
  - \( P \) is the order of the seasonal autoregressive (SAR) component,
  - \( D \) is the order of seasonal differencing,
  - \( Q \) is the order of the seasonal moving average (SMA) component, and
  - \( s \) is the length of the seasonal cycle i.e. periodicity.

## Exclusion of Trend Models: Rationale

In the analysis of the sunspot number series, we have opted not to use trend models for several reasons. Instead, we chose the Seasonal AutoRegressive Integrated Moving Average (SARIMA) model, which is better suited for capturing the underlying patterns in the data.

Through initial exploratory data analysis and visual inspection of the time series plot, it became evident that the sunspot numbers exhibit a strong seasonal pattern. The Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF) plots of the non-transformed data further confirmed the presence of seasonality, displaying a decaying wave pattern typical of seasonal data. In contrast, trend models primarily focus on capturing long-term trends and are not designed to account for seasonality effectively.

The sunspot series, known for its periodic fluctuations corresponding to solar cycles, exhibits cycles **approximately** every 11 years, though this period can vary between 9 and 14 years. This variability in the natural phenomenon necessitates a model that can explicitly account for these changing cycles. Traditional trend models lack the capability to capture such cyclical patterns, resulting in inadequate representations and potentially misleading forecasts.

The SARIMA model is specifically designed to handle time series data with both seasonal and non-seasonal components. It incorporates terms to address autoregressive behavior, moving averages, and differencing for both seasonal and non-seasonal parts of the series. This makes it a comprehensive and flexible approach for modeling the complex dynamics observed in the sunspot data. By using SARIMA, we can accurately capture the periodicity, autoregressive influences, and any random shocks present in the data.

## Addressing Seasonality

In this section, we aim to determine the optimal seasonal orders for the SARIMA model, which include the seasonal autoregressive (AR) order (P), seasonal differencing (D), seasonal moving average (MA) order (Q), and the periodicity (s). Accurately identifying these parameters is crucial for capturing the seasonal patterns in the time series data.

To achieve this, we will employ a **residuals-based approach**. This method involves fitting a series of SARIMA models with different combinations of seasonal orders to the time series data. Each model will be evaluated based on how well it captures the underlying seasonal patterns. After fitting each model, we will analyze the residuals for any signs of remaining seasonality. Ideally, the residuals will not show any seasonal pattern, indicating that the model has successfully captured the **seasonal** structure of the data.

We will use tools such as Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF) plots to examine the residuals. Through this iterative process, we aim to select the most appropriate seasonal orders that provide the best fit for the data. This ensures that the final SARIMA model is robust and capable of accurately capturing the seasonal dynamics present in the time series.


```{r fig.height=6, fig.width=19, fig.align='center'}
# Plot ACF and PACF plots BC_sunspot_TS
BC_sunspot_TS %>%
  plot_acf_pacf(acf_main = "Figure 8.1: Autocorrelation Function (ACF) of Transformed Sunspot Numbers",
                pacf_main = "Figure 8.2: Partial Autocorrelation Function (PACF) of Transformed Sunspot Numbers",
                max_lag = 60)
```

The above ACF (Figure 8.1) of the transformed sunspot series further confirms the data's periodicity, with the most significant lags predominantly occurring around the 11-year mark.

### Seasonal Differencing (D)

In this section, we will perform seasonal differencing on the time series data using a SARIMA(0,0,0)(0,1,0)[11] model. This model has all parameters set to zero except for the seasonal differencing order (D), which is set to 1. The purpose of this configuration is to remove any seasonal pattern with a periodicity of 11 units from the data.

By setting the seasonal differencing component (D) to 1, we expect the residuals of the model to show no remaining seasonal patterns, indicating that the seasonal effects have been effectively removed. This process helps in isolating the underlying trends and irregular components of the time series.

It is crucial to note that setting higher orders for the seasonal differencing parameter (D) can lead to data loss and may result in over-differencing. Over-differencing can distort the data and obscure meaningful patterns, making the time series analysis less reliable. Therefore, careful consideration is required when selecting the order of differencing to ensure that the seasonal effects are adequately removed without compromising the integrity of the data.

```{r fig.align='center', fig.height=7, fig.width=15, message=FALSE, warning=FALSE}
# Checking residuals of seasonally differenced model - SARIMA(0,0,0)(0,1,0)[11]
BC_sunspot_TS %>%
  check_residuals(order = c(0,0,0),
                  seasonal_order = c(0,1,0),
                  period = 11,
                  acf_lag = 50,
                  pacf_lag = 50,
                  res_main = 'Figure 9.1: Residuals of model: SARIMA(0,0,0)(0,1,0)[11]',
                  hist_main = 'Figure 9.2: Histogram of residuals',
                  qq_main = 'Figure 9.3: QQ-Plot of residuals',
                  acf_main = 'Figure 9.4: ACF plot of residuals',
                  pacf_main = 'Figure 9.5: PACF plot of residuals')
```

We observe that seasonality in the series is significantly reduced, as evidenced by the residuals, which exhibit minimal signs of seasonality. However, it's important to note that not all stochastic components have been fully addressed as evidenced by the time series plot (Figure 9.1) and Ljung-Box Plot. This is also indicated by the Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF) plots, which show significant autocorrelation at the seasonal lags. To address this remaining seasonality, we need to determine the appropriate seasonal Autoregressive (AR) and Moving Average (MA) orders for the series.

From the ACF plot, we observe three significant seasonal lags at the first and second seasonal periods. Given this information, the potential orders for the seasonal MA component (Q) is likely to be 2. The PACF plot suggests that the seasonal AR order (P) is likely to be 0 or 1.

### Seasonal AR and MA orders

```{r fig.align='center', fig.height=7, fig.width=15, message=FALSE, warning=FALSE}
# Checking residuals of seasonally differenced model with P and Q orders - SARIMA(0,0,0)(1,1,2)[11]
BC_sunspot_TS %>% check_residuals(order = c(0,0,0),
                                  seasonal_order = c(1,1,2),
                                  period = 11,
                                  acf_lag = 50,
                                  pacf_lag = 50,
                                  res_main = 'Figure 10.1: Residuals of model: SARIMA(0,0,0)(1,1,2)[11]',
                                  hist_main = 'Figure 10.2: Histogram of residuals',
                                  qq_main = 'Figure 10.3: QQ-Plot of residuals',
                                  acf_main = 'Figure 10.4: ACF plot of residuals',
                                  pacf_main = 'Figure 10.5: PACF plot of residuals')
```

We can observe from the ACF (Figure 10.4) and PACF plot (Figure 10.5) there are no more significant spikes at the seasonal lags. The residuals plot also shows significant reduction in the seasonality when compared to the original series. After experimenting with possible combinations of the P and Q derived from the previous section of the analysis. The SARIMA(0,0,0)(1,1,2)[11] model shows the best results in addressing the seasonal component underlying in the data.

**Given this information we, set the seasonal orders (P,D,Q) as (1,1,2)**

## Ordinary Differencing (d)

From the above [section](#seasonal-ar-and-ma-orders), once we remove the predominant seasonal behaviour inherent in the data, we see that the series does show signs of a trend which is assessed using the ACF (Figure 10.4) and PACF (Figure 10.5) plots. The ACF plot shows a set of gradually decaying lags while the PACF show a significantly high first lag, both indications of a possible pattern carrying over in the residuals which is addressed through ordinary differencing below.

```{r fig.align='center', fig.height=7, fig.width=15, message=FALSE, warning=FALSE}
# Checking residuals of seasonally differenced model with P and Q orders - SARIMA(0,1,0)(1,1,2)[11]
BC_sunspot_TS %>%
  check_residuals(order = c(0,1,0),
                  seasonal_order = c(1,1,2),
                  period = 11,
                  acf_lag = 70,
                  pacf_lag = 70,
                  res_main = 'Figure 11.1: Residuals of model: SARIMA(0,1,0)(1,1,2)[11]',
                  hist_main = 'Figure 11.2: Histogram of residuals',
                  acf_main = 'Figure 11.3: ACF plot of residuals',
                  pacf_main = 'Figure 11.4: PACF plot of residuals')
```

After introducing ordinary differencing (d=1) we can see a significant improvement of the residuals. The time series plot (Figure 11.1) does not show any significant seasonal patterns and is well centered around the zero line. We can also see that the ACF no longer features a deacying pattern and the PACF no longer has a significant first lag. Additionally, we see can also observe some improvement in the Ljung box plot which was not observed earlier in any of the plots.

Two interesting insights:

1. We can see in the histogram (Figure 11.2) that the standardized residuals do not conform to the expected -3 to 3 range, which likely indicates outliers in the residuals. From the initial [EDA Exploration section](#exploration) we can see that these residuals are resultant from the outliers seen in our series for the years 1778, 1957 and 1958.

2. We can see a significant lag that appears at the first season after applying ordinary differencing, given that we have addressed the seasonality effectively with optimal seasonal order we assume that this shows up due to the trend of the series rather an indication of leftover seasonality. We expect the successive steps to give us more information on if this is a result of an underlying stochastic component other than seasonality.

**In conclusion we will now proceed to find the ordinary Auto-Regressive (p) and Moving-Average (q) order with the above set orders which are as follows -> SARIMA(p,1,q)(1,1,2)[11]**

**Note:** The residuals from this model (SARIMA(0,1,0)(1,1,2)[11]) will hereafter be referred to as **base residuals** in this report.

## Determining Potential Models

We will now conduct a thorough analysis to determine potential models to accurately forecast future sunspot numbers. To find the sets of possible models we will use tools such as the ACF plot, PACF Plot, EACF Plot and the BIC Table. These test/tools will be applied over the base residuals we determined from the SARIMA(0,1,0)(1,1,2)[11] model.

### ACF and PACF Analysis of residuals

```{r fig.height=6, fig.width=19, fig.align='center'}
# Store the residuals of the SARIMA(0,1,0)(1,1,2)[11] model
base_residuals <- BC_sunspot_TS %>%
  Arima(order = c(0,1,0),
        seasonal = list(order=c(1,1,2),
                        period = 11)) %>% rstandard()

# Plot ACF and PACF of the base residuals
base_residuals %>%
  plot_acf_pacf(acf_main = 'Figure 12.1: ACF of SARIMA(0,1,0)(1,1,2)[11] residuals',
                pacf_main = 'Figure 12.2: PACF of SARIMA(0,1,0)(1,1,2)[11] residuals')

```
From the analysis of the ACF plot (Figure 12.1) we can observe 7 significant lags and 9 significant lags from the PACF plot (Figure 12.2).

**Set of possible models:**

- SARIMA(9,1,7)(1,1,2)[11]

### EACF Analysis

```{r}
# Plot EACF
base_residuals %>% eacf(ar.max = 15, ma.max = 15)
```

In the EACF table, each cell represents a combination of AR and MA orders. The "**x**" symbol indicates that the corresponding combination of orders is not suitable for modeling the series, while the "**o**" symbol suggests that the combination may be viable.

When interpreting the EACF table, attention is typically focused on patterns of "o" symbols. Successive "o" symbols near a particular cell suggest that the corresponding combination of AR and MA orders is supported by the data. Conversely, the presence of "x" symbols nearby indicates potential inadequacies in the corresponding model orders.

In the provided EACF table, the "o" symbols suggest possible combinations of AR and MA orders that are worth considering for modeling the series. In this case, the top model orders identified as promising are 0 and 10, indicating that a model with zero AR terms and ten MA terms may be suitable. Additionally, neighboring cells with "o" symbols provide further options to explore for extending the set of possible models.

**Set of possible models**:

- SARIMA(0,1,10)(1,1,2)[11]

- SARIMA(0,1,11)(1,1,2)[11]

- SARIMA(1,1,11)(1,1,2)[11]

### BIC Table

```{r fig.align='center', fig.height=6, fig.width=10, message=FALSE, warning=FALSE}
# Create BIC table
bic_table <- base_residuals %>% armasubsets(nar =13, nma = 13,
                                            y.name = 'p',
                                            ar.method = 'ols')

# Plot BIC Table without main title
bic_table %>% plot()

# Add main title separately
title(main = 'Figure 13: BIC Table of possible model orders', adj = 0.5, line = 5.7)
```

The BIC Table (Figure 13) presents the outcomes of fitting various ARMA (AutoRegressive Moving Average) models to the time series data. It displays different combinations of AutoRegressive (AR) and Moving Average (MA) orders alongside their respective BIC values.

The selection of `nar` and `nma` parameters profoundly influences the ARMA model's goodness-of-fit to the data. A balance must be struck: too few terms may lead to underfitting, missing crucial data patterns, while an excess of terms may cause overfitting, capturing noise instead of genuine patterns.

In our analysis, based on the BIC Table (Figure 13), the "p-lag" column signifies the AR(p) orders, while the "error-lag" column represents the MA(q) orders. To avoid excessively large models, we capped `nar` and `nma` at 10, considering the highest orders estimated from the ACF, PACF, and EACF, which was 9 (EACF).

Based on the BIC table, the optimal model indicates an order of p=9 and q=10. However, it is noteworthy that the AR(3) and AR(6) terms consistently appear in at least four of the top models. Similarly, the MA(2) and MA(5) terms also feature in at least four of the top models. This consistency suggests that these specific AR and MA terms may play a significant role in accurately capturing the underlying structure of the time series data.

Hence, we have four potential AR orders: (3, 5, 6, 11) and four possible MA orders: (2, 5, 8, 10).

**Set of possible models (p, d, q)**: SARIMA(3,1,2)(1,1,2)[11], SARIMA(3,1,5)(1,1,2)[11], SARIMA(3,1,8)(1,1,2)[11], SARIMA(3,1,10)(1,1,2)[11], SARIMA(5,1,2)(1,1,2)[11], SARIMA(5,1,5)(1,1,2)[11], SARIMA(5,1,8)(1,1,2)[11], SARIMA(5,1,10)(1,1,2)[11], SARIMA(6,1,2)(1,1,2)[11], SARIMA(6,1,5)(1,1,2)[11], SARIMA(6,1,8)(1,1,2)[11], SARIMA(6,1,10)(1,1,2)[11], SARIMA(11,1,2)(1,1,2)[11], SARIMA(11,1,5)(1,1,2)[11], SARIMA(11,1,8)(1,1,2)[11], SARIMA(11,1,10)(1,1,2)[11]

## Set of possible model SARIMA(p,d,q)(P,D,Q)[s]

Based on the comprehensive analysis conducted in the [model specification](#model-specification) section, we have identified a set of 20 possible models. These models, along with their corresponding ARIMA specifications, are as follows:

- SARIMA(9,1,7)(1,1,2)[11]

- SARIMA(0,1,10)(1,1,2)[11]

- SARIMA(0,1,11)(1,1,2)[11]

- SARIMA(1,1,11)(1,1,2)[11]

- SARIMA(3,1,2)(1,1,2)[11]

- SARIMA(3,1,5)(1,1,2)[11]

- SARIMA(3,1,8)(1,1,2)[11]

- SARIMA(3,1,10)(1,1,2)[11]

- SARIMA(5,1,2)(1,1,2)[11]

- SARIMA(5,1,5)(1,1,2)[11]

- SARIMA(5,1,8)(1,1,2)[11]

- SARIMA(5,1,10)(1,1,2)[11]

- SARIMA(6,1,2)(1,1,2)[11]

- SARIMA(6,1,5)(1,1,2)[11]

- SARIMA(6,1,8)(1,1,2)[11]

- SARIMA(6,1,10)(1,1,2)[11]

- SARIMA(11,1,2)(1,1,2)[11]

- SARIMA(11,1,5)(1,1,2)[11]

- SARIMA(11,1,8)(1,1,2)[11]

- SARIMA(11,1,10)(1,1,2)[11]

# **Parameter Estimation**

This section evaluates 20 candidate models to assess the significance of their coefficients. We anticipate the most suitable model to exhibit consistently significant coefficients. Three parameter estimation methods are employed: Least Squares (CSS), Maximum Likelihood (ML), and a combined approach (CSS-ML).

The analysis leverages functions within the utilities.R script. Refer to the [Appendix](#appendix) for details on these functions and their implementation.

## Model 1: SARIMA(9,1,7)(1,1,2)[11]

```{r fig.align='center', fig.height=7, fig.width=15, message=FALSE, warning=FALSE}
model_917 <- BC_sunspot_TS %>% arima_summary(pdq_order = c(9,1,7),
                                             seasonal_order = c(1,1,2),
                                             period = 11)

BC_sunspot_TS %>%
  check_residuals(order = c(9,1,7),
                  seasonal_order = c(1,1,2),
                  period = 11,
                  acf_lag = 70,
                  pacf_lag = 70,
                  res_main = 'Figure 14.1: Residuals of model: SARIMA(9,1,7)(1,1,2)[11]',
                  hist_main = 'Figure 14.2: Histogram of residuals',
                  acf_main = 'Figure 14.3: ACF plot of residuals',
                  pacf_main = 'Figure 14.4: PACF plot of residuals')
```

The SARIMA(9,1,7)(1,1,2)[11] model was inferred from the ACF and PACF plots. The parameter estimation was conducted using three methods: CSS, ML, and CSS-ML. In the CSS and ML methods, several parameters showed NaN values, indicating issues with parameter estimation. However, the CSS-ML method provided complete parameter estimates, with most parameters being significant. Notably, higher-order AR and MA terms were significant, suggesting their necessity to capture the underlying series dynamics.

The residuals analysis showed a near-normal distribution, with a histogram indicating one outlier and a QQ-Plot confirming the near-normality of residuals. The ACF and PACF plots revealed one significant lag near the 5th season, but the Ljung-Box test results indicated that the residuals follow a white noise pattern. This suggests the model adequately captures the data's underlying patterns. Despite the presence of some `NaN` values in individual methods, the overall model, particularly through the CSS-ML method, is robust, with significant higher-order terms and residuals behaving as expected.

## Model 2: SARIMA(0,1,10)(1,1,2)[11]

```{r fig.align='center', fig.height=7, fig.width=15, message=FALSE, warning=FALSE}
model_0110 <- BC_sunspot_TS %>% arima_summary(pdq_order = c(0,1,10),
                                             seasonal_order = c(1,1,2),
                                             period = 11)

BC_sunspot_TS %>%
  check_residuals(order = c(0,1,10),
                  seasonal_order = c(1,1,2),
                  period = 11,
                  acf_lag = 70,
                  pacf_lag = 70,
                  res_main = 'Figure 15.1: Residuals of model: SARIMA(0,1,10)(1,1,2)[11]',
                  hist_main = 'Figure 15.2: Histogram of residuals',
                  acf_main = 'Figure 15.3: ACF plot of residuals',
                  pacf_main = 'Figure 15.4: PACF plot of residuals')
```

In the SARIMA(0,1,10)(1,1,2)[11] model, the majority of the parameters are highly significant. Similar to the previous model, the insignificant parameters tend to be lower-order MA terms, which is consistent across all three estimation methods (CSS, ML, and CSS-ML).

The residual diagnostics plots show a near-normal distribution of the standardized residuals, with a persistent outlier. While the ACF and PACF plots indicate some slightly significant lags, the Ljung-Box test results show all p-values above the conventional 0.05 threshold, indicating that the residuals follow a white noise pattern. This suggests the model adequately captures the underlying data patterns.

## Model 3: SARIMA(0,1,11)(1,1,2)[11]

```{r fig.align='center', fig.height=7, fig.width=15, message=FALSE, warning=FALSE}
model_0111 <- BC_sunspot_TS %>% arima_summary(pdq_order = c(0,1,11),
                                             seasonal_order = c(1,1,2),
                                             period = 11)

BC_sunspot_TS %>%
  check_residuals(order = c(0,1,11),
                  seasonal_order = c(1,1,2),
                  period = 11,
                  acf_lag = 70,
                  pacf_lag = 70,
                  res_main = 'Figure 16.1: Residuals of model: SARIMA(0,1,11)(1,1,2)[11]',
                  hist_main = 'Figure 16.2: Histogram of residuals',
                  acf_main = 'Figure 16.3: ACF plot of residuals',
                  pacf_main = 'Figure 16.4: PACF plot of residuals')
```

In the SARIMA(0,1,11)(1,1,2)[11] model, across all three estimation methods (CSS, ML, and CSS-ML), the ma7 and ma8 terms are consistently insignificant, whereas all other MA terms, including higher-order terms like ma9, ma10, and ma11, are highly significant. The ACF and PACF plots show minimal significant lags, which can be neglected as the Ljung-Box test confirms that the series follows a white noise pattern at all lags, with p-values above the conventional 0.05 threshold.

The residuals' normality is not perfect, likely due to the outlier observed in the data. This is further supported by the Shapiro-Wilk test, which fails to accept the normality of the standardized residuals.

## Model 4: SARIMA(1,1,11)(1,1,2)[11]

```{r fig.align='center', fig.height=7, fig.width=15, message=FALSE, warning=FALSE}
model_1111 <- BC_sunspot_TS %>% arima_summary(pdq_order = c(1,1,11),
                                             seasonal_order = c(1,1,2),
                                             period = 11)

BC_sunspot_TS %>%
  check_residuals(order = c(1,1,11),
                  seasonal_order = c(1,1,2),
                  period = 11,
                  acf_lag = 70,
                  pacf_lag = 70,
                  res_main = 'Figure 17.1: Residuals of model: SARIMA(1,1,11)(1,1,2)[11]',
                  hist_main = 'Figure 17.2: Histogram of residuals',
                  acf_main = 'Figure 17.3: ACF plot of residuals',
                  pacf_main = 'Figure 17.4: PACF plot of residuals')
```

Similar to the previous models, the SARIMA(1,1,11)(1,1,2)[11] model shows highly significant higher-order parameters. This model also does not exhibit normal behavior in the residuals. Only one significant lag is observed in the ACF and PACF plots, which is neglected as it does not indicate significant correlation, as confirmed by the Ljung-Box test.

## Model 5: SARIMA(3,1,2)(1,1,2)[11]

```{r fig.align='center', fig.height=7, fig.width=15, message=FALSE, warning=FALSE}
model_312 <- BC_sunspot_TS %>% arima_summary(pdq_order = c(3,1,2),
                                             seasonal_order = c(1,1,2),
                                             period = 11)

BC_sunspot_TS %>%
  check_residuals(order = c(3,1,2),
                  seasonal_order = c(1,1,2),
                  period = 11,
                  acf_lag = 70,
                  pacf_lag = 70,
                  res_main = 'Figure 18.1: Residuals of model: SARIMA(3,1,2)(1,1,2)[11]',
                  hist_main = 'Figure 18.2: Histogram of residuals',
                  acf_main = 'Figure 18.3: ACF plot of residuals',
                  pacf_main = 'Figure 18.4: PACF plot of residuals')
```

The SARIMA(3,1,2)(1,1,2)[11] model is comparatively smaller than its predecessors. Visual assessment of the histogram and QQ plot shows some signs of normality, but this is not supported by the Shapiro-Wilk test. The ACF and PACF plots reveal many significant lags, which cannot be neglected, as the Ljung-Box test indicates a significant number of lags fall below the conventional 0.05 value. This suggests that there is autocorrelation left in the residuals that the model fails to capture.

## Model 6: SARIMA(3,1,5)(1,1,2)[11]

```{r fig.align='center', fig.height=7, fig.width=15, message=FALSE, warning=FALSE}
model_315 <- BC_sunspot_TS %>% arima_summary(pdq_order = c(3,1,5),
                                             seasonal_order = c(1,1,2),
                                             period = 11)

BC_sunspot_TS %>%
  check_residuals(order = c(3,1,5),
                  seasonal_order = c(1,1,2),
                  period = 11,
                  acf_lag = 70,
                  pacf_lag = 70,
                  res_main = 'Figure 19.1: Residuals of model: SARIMA(3,1,5)(1,1,2)[11]',
                  hist_main = 'Figure 19.2: Histogram of residuals',
                  acf_main = 'Figure 19.3: ACF plot of residuals',
                  pacf_main = 'Figure 19.4: PACF plot of residuals')
```

The SARIMA(3,1,5)(1,1,2)[11] model has very few significant parameters. Additionally, the ACF and PACF plots reveal multiple significant lags. The Ljung-Box test confirms that these lags indicate signs of autocorrelation due to their low p-values, suggesting that the model does not adequately capture all patterns in the data. Furthermore, the residuals do not exhibit normality, as supported by both visual assessments and formal tests. This indicates that the model may not be the best fit for the series, as it leaves behind significant autocorrelation and fails to produce normally distributed residuals.

## Model 7: SARIMA(3,1,8)(1,1,2)[11]

```{r fig.align='center', fig.height=7, fig.width=15, message=FALSE, warning=FALSE}
model_318 <- BC_sunspot_TS %>% arima_summary(pdq_order = c(3,1,8),
                                             seasonal_order = c(1,1,2),
                                             period = 11)

BC_sunspot_TS %>%
  check_residuals(order = c(3,1,8),
                  seasonal_order = c(1,1,2),
                  period = 11,
                  acf_lag = 70,
                  pacf_lag = 70,
                  res_main = 'Figure 20.1: Residuals of model: SARIMA(3,1,8)(1,1,2)[11]',
                  hist_main = 'Figure 20.2: Histogram of residuals',
                  acf_main = 'Figure 20.3: ACF plot of residuals',
                  pacf_main = 'Figure 20.4: PACF plot of residuals')
```

The SARIMA(3,1,8)(1,1,2)[11] model demonstrates a regression in normality, as indicated by both visual assessments and formal tests. The histogram and QQ plot suggest deviations from normality, which is confirmed by the Shapiro-Wilk test. Additionally, the Ljung-Box test results show several lags with p-values falling below the 0.05 threshold, indicating the presence of autocorrelation that the model fails to capture. Despite these issues, the model does have almost all significant parameters, suggesting that it is able to capture some of the underlying patterns in the series. However, the remaining autocorrelation and non-normal residuals suggest that the model might not be fully adequate in capturing the complexities of the data.

## Model 8: SARIMA(3,1,10)(1,1,2)[11]

```{r fig.align='center', fig.height=7, fig.width=15, message=FALSE, warning=FALSE}
model_3110 <- BC_sunspot_TS %>% arima_summary(pdq_order = c(3,1,10),
                                             seasonal_order = c(1,1,2),
                                             period = 11)

BC_sunspot_TS %>%
  check_residuals(order = c(3,1,10),
                  seasonal_order = c(1,1,2),
                  period = 11,
                  acf_lag = 70,
                  pacf_lag = 70,
                  res_main = 'Figure 21.1: Residuals of model: SARIMA(3,1,10)(1,1,2)[11]',
                  hist_main = 'Figure 21.2: Histogram of residuals',
                  acf_main = 'Figure 21.3: ACF plot of residuals',
                  pacf_main = 'Figure 21.4: PACF plot of residuals')
```

The SARIMA(3,1,10)(1,1,2)[11] model exhibits a notable improvement in the normality of residuals, although it still falls slightly below the conventional threshold in the Shapiro-Wilk test. Visual assessments, including the histogram and QQ plot, indicate closer adherence to normality. The ACF and PACF plots reveal only two near-significant lags, which can be reasonably neglected. The Ljung-Box test confirms that there is minimal autocorrelation left uncaptured by the model.

This model demonstrates that higher-order terms are highly significant in effectively capturing the series' behavior. The improved normality of residuals and the reduction in significant autocorrelation suggest that the SARIMA(3,1,10)(1,1,2)[11] model is more adept at fitting the data compared to previous models, despite minor deviations in normality.

## Model 9: SARIMA(5,1,2)(1,1,2)[11]

```{r fig.align='center', fig.height=7, fig.width=15, message=FALSE, warning=FALSE}
model_512 <- BC_sunspot_TS %>% arima_summary(pdq_order = c(5,1,2),
                                             seasonal_order = c(1,1,2),
                                             period = 11)

BC_sunspot_TS %>%
  check_residuals(order = c(5,1,2),
                  seasonal_order = c(1,1,2),
                  period = 11,
                  acf_lag = 70,
                  pacf_lag = 70,
                  res_main = 'Figure 22.1: Residuals of model: SARIMA(5,1,2)(1,1,2)[11]',
                  hist_main = 'Figure 22.2: Histogram of residuals',
                  acf_main = 'Figure 22.3: ACF plot of residuals',
                  pacf_main = 'Figure 22.4: PACF plot of residuals')
```

The SARIMA(5,1,2)(1,1,2)[11] model, similar to its predecessor, demonstrates high significance in higher-order AR and MA terms. The model exhibits improved normality of residuals, as indicated by the Shapiro-Wilk test. However, despite this improvement, the ACF and PACF plots reveal a few significant lags. Notably, two of these lags fall below the conventional 0.05 threshold in the Ljung-Box test, indicating the presence of residual autocorrelation.

This suggests that while the SARIMA(5,1,2)(1,1,2)[11] model effectively captures much of the series' behavior, there are still some aspects that it might be overlooking. The significant lags in the ACF and PACF plots, combined with the Ljung-Box test results, imply that the model may not fully account for all underlying patterns in the data. Despite these minor shortcomings, the overall performance of the model, particularly in terms of normality and significance of parameters, marks a improvement over previous models.

## Model 10: SARIMA(5,1,5)(1,1,2)[11]

```{r fig.align='center', fig.height=7, fig.width=15, message=FALSE, warning=FALSE}
model_515 <- BC_sunspot_TS %>% arima_summary(pdq_order = c(5,1,5),
                                             seasonal_order = c(1,1,2),
                                             period = 11)

BC_sunspot_TS %>%
  check_residuals(order = c(5,1,5),
                  seasonal_order = c(1,1,2),
                  period = 11,
                  acf_lag = 70,
                  pacf_lag = 70,
                  res_main = 'Figure 23.1: Residuals of model: SARIMA(5,1,5)(1,1,2)[11]',
                  hist_main = 'Figure 23.2: Histogram of residuals',
                  acf_main = 'Figure 23.3: ACF plot of residuals',
                  pacf_main = 'Figure 23.4: PACF plot of residuals')
```

The SARIMA(5,1,5)(1,1,2)[11] model demonstrates significant improvement across multiple fronts, including parameter significance and residual diagnostics. Almost all terms are significant, with only the lowest AR order and the lowest SMA order showing insignificance. Both visual assessments (histogram and QQ plot) and formal tests indicate a close alignment to normality, although the residuals are not perfectly normal, as the Shapiro-Wilk p-value remains slightly below the conventional 0.05 threshold.

The ACF and PACF plots reveal minimal significant lags, which can be considered negligible. The Ljung-Box test supports this, as all lags have p-values above the 0.05 threshold, indicating that the model captures the autocorrelation structure well.

In summary, the SARIMA(5,1,5)(1,1,2)[11] model effectively captures the series' behavior, with high parameter significance and residuals that closely approximate normality. The minimal significant lags in the ACF and PACF plots, corroborated by the Ljung-Box test, suggest that the model leaves little autocorrelation unaccounted for, marking a substantial improvement over previous models.


## Model 11: SARIMA(6,1,2)(1,1,2)[11]

```{r fig.align='center', fig.height=7, fig.width=15, message=FALSE, warning=FALSE}
model_612<- BC_sunspot_TS %>% arima_summary(pdq_order = c(6,1,2),
                                             seasonal_order = c(1,1,2),
                                             period = 11)

BC_sunspot_TS %>%
  check_residuals(order = c(6,1,2),
                  seasonal_order = c(1,1,2),
                  period = 11,
                  acf_lag = 70,
                  pacf_lag = 70,
                  res_main = 'Figure 24.1: Residuals of model: SARIMA(6,1,2)(1,1,2)[11]',
                  hist_main = 'Figure 24.2: Histogram of residuals',
                  acf_main = 'Figure 24.3: ACF plot of residuals',
                  pacf_main = 'Figure 24.4: PACF plot of residuals')
```

The SARIMA(6,1,2)(1,1,2)[11] model displays several significant parameters, indicating that it captures some key aspects of the series. However, the residual diagnostics reveal shortcomings. The Ljung-Box test results indicate a significant number of lags with p-values falling below the conventional threshold of 0.05. This suggests that the model fails to fully account for all autocorrelation in the data, leaving some underlying patterns uncaptured. Despite the significance of individual parameters, the presence of these significant lags points to residual autocorrelation, indicating that the model may not be entirely adequate in capturing the series' dynamics.

## Model 12: SARIMA(6,1,5)(1,1,2)[11]

```{r fig.align='center', fig.height=7, fig.width=15, message=FALSE, warning=FALSE}
model_615 <- BC_sunspot_TS %>% arima_summary(pdq_order = c(6,1,5),
                                             seasonal_order = c(1,1,2),
                                             period = 11)

BC_sunspot_TS %>%
  check_residuals(order = c(6,1,5),
                  seasonal_order = c(1,1,2),
                  period = 11,
                  acf_lag = 70,
                  pacf_lag = 70,
                  res_main = 'Figure 25.1: Residuals of model: SARIMA(6,1,5)(1,1,2)[11]',
                  hist_main = 'Figure 25.2: Histogram of residuals',
                  acf_main = 'Figure 25.3: ACF plot of residuals',
                  pacf_main = 'Figure 25.4: PACF plot of residuals')
```

Compared to other models, the SARIMA(6,1,5)(1,1,2)[11] model has a reduced number of significant parameters. Despite this, it effectively captures almost all autocorrelation in the series, as evidenced by the Ljung-Box test, where nearly all lags have high p-values, indicating minimal residual autocorrelation. Additionally, the residuals exhibit some normality, supported by both visual assessments (histogram and QQ plot) and the Shapiro-Wilk test. This suggests that the model, while having fewer significant parameters, manages to adequately capture the underlying patterns in the data, ensuring that residuals are close to normal and free from significant autocorrelation.

## Model 13: SARIMA(6,1,8)(1,1,2)[11]

```{r fig.align='center', fig.height=7, fig.width=15, message=FALSE, warning=FALSE}
model_618 <- BC_sunspot_TS %>% arima_summary(pdq_order = c(6,1,8),
                                             seasonal_order = c(1,1,2),
                                             period = 11)

BC_sunspot_TS %>%
  check_residuals(order = c(6,1,8),
                  seasonal_order = c(1,1,2),
                  period = 11,
                  acf_lag = 70,
                  pacf_lag = 70,
                  res_main = 'Figure 26.1: Residuals of model: SARIMA(6,1,8)(1,1,2)[11]',
                  hist_main = 'Figure 26.2: Histogram of residuals',
                  acf_main = 'Figure 26.3: ACF plot of residuals',
                  pacf_main = 'Figure 26.4: PACF plot of residuals')
```

At this stage of the analysis, the SARIMA(6,1,8)(1,1,2)[11] model demonstrates the best performance in terms of residual diagnostics. This model shows the highest normality of residuals, as indicated by visual assessments and the Shapiro-Wilk test. Only one significant lag appears in the ACF and PACF plots, which can be considered negligible. The Ljung-Box test supports this, with almost all p-values around 1.0, indicating that virtually no autocorrelation remains uncaptured. Thus, this model provides a promising fit, effectively capturing the underlying patterns in the series and ensuring minimal residual autocorrelation.

## Model 14: SARIMA(6,1,10)(1,1,2)[11]

```{r fig.align='center', fig.height=7, fig.width=15, message=FALSE, warning=FALSE}
model_6110 <- BC_sunspot_TS %>% arima_summary(pdq_order = c(6,1,10),
                                             seasonal_order = c(1,1,2),
                                             period = 11)

BC_sunspot_TS %>%
  check_residuals(order = c(6,1,10),
                  seasonal_order = c(1,1,2),
                  period = 11,
                  acf_lag = 70,
                  pacf_lag = 70,
                  res_main = 'Figure 27.1: Residuals of model: SARIMA(6,1,10)(1,1,2)[11]',
                  hist_main = 'Figure 27.2: Histogram of residuals',
                  acf_main = 'Figure 27.3: ACF plot of residuals',
                  pacf_main = 'Figure 27.4: PACF plot of residuals')
```

The SARIMA(6,1,10)(1,1,2)[11] model is one of the largest models fitted in the parameter estimation section of this report, featuring a total of 19 parameters. Of these, 12 parameters are significant. Interestingly, some of the significant terms are lower-order MA terms, while the higher-order terms remain insignificant.

In terms of residual normality, this model approximates a normal distribution quite closely, as suggested by the p-value from the Shapiro-Wilk test. The residual diagnostics further reinforce the model's adequacy, with the Ljung-Box test indicating almost no autocorrelation, as nearly all lags have high p-values. This suggests that the model effectively captures the series' underlying patterns, ensuring minimal residual autocorrelation and a distribution of residuals that closely aligns with normality.

## Model 15: SARIMA(11,1,2)(1,1,2)[11]

```{r fig.align='center', fig.height=7, fig.width=15, message=FALSE, warning=FALSE}
model_1112 <- BC_sunspot_TS %>% arima_summary(pdq_order = c(11,1,2),
                                             seasonal_order = c(1,1,2),
                                             period = 11)

BC_sunspot_TS %>%
  check_residuals(order = c(11,1,2),
                  seasonal_order = c(1,1,2),
                  period = 11,
                  acf_lag = 70,
                  pacf_lag = 70,
                  res_main = 'Figure 28.1: Residuals of model: SARIMA(11,1,2)(1,1,2)[11]',
                  hist_main = 'Figure 28.2: Histogram of residuals',
                  acf_main = 'Figure 28.3: ACF plot of residuals',
                  pacf_main = 'Figure 28.4: PACF plot of residuals')
```

The SARIMA(11,1,2)(1,1,2)[11] model, like its predecessor, demonstrates a near close alignment to normality and captures almost all autocorrelation. Almost all ordinary AR and MA terms are significant, indicating their importance in modeling the series. However, the seasonal AR and MA terms are found to be insignificant. Despite this, the model effectively captures the underlying patterns in the data, with residual diagnostics showing minimal autocorrelation and a near-normal distribution, as evidenced by the Shapiro-Wilk test and the Ljung-Box test results. This suggests that while the seasonal terms may not be significant, the model still provides a robust fit to the series.

# **Model Selection**

This segment will delve into comparing the AIC and BIC scores of the models. In comparing the AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion) scores of the models, we aim to discern the most suitable model for our dataset. Both criteria provide valuable insights into model selection, but they differ slightly in their approach.

Error measure scores provide a quantitative assessment of how well a model fits the data by evaluating the discrepancies between the observed values and the model's predictions.

**Note:** Any models not included in the parameter estimation section were excluded due to errors encountered during the estimation process. These errors prevented the models from identifying the optimal coefficients, making them unsuitable for further analysis.

## AIC and BIC Scores

This section will look into analysis the AIC and BIC scores of the set of possible models. While AIC tends to select more complex models when sample sizes are large, BIC tends to favor simpler models. Therefore, comparing both AIC and BIC scores provides a comprehensive perspective on model performance, allowing us to strike a balance between goodness-of-fit and model simplicity.

```{r paged.print=FALSE}
# AIC Scores
sort.score(AIC(model_917[["model_917_cssml"]], model_0110[["model_0110_cssml"]],
               model_0111[["model_0111_cssml"]], model_1111[["model_1111_cssml"]],
               model_1112[["model_1112_cssml"]], model_312[["model_312_cssml"]],
               model_315[["model_315_cssml"]], model_318[["model_318_cssml"]],
               model_3110[["model_3110_cssml"]], model_512[["model_512_cssml"]],
               model_515[["model_515_cssml"]], model_612[["model_612_cssml"]],
               model_615[["model_615_cssml"]], model_618[["model_618_cssml"]],
               model_6110[["model_6110_cssml"]]), score='aic')

```

```{r paged.print=FALSE}
# BIC Scores