Project-2/ggsmock2_final.Rmd at master · EDAV-Group-Project-2/Project-2 · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
---
title: "Floods and Geopotential Pressure Levels Around the World, 1985-2015"
author: "Team ggsmock2"
output:
  html_document:
    theme: readable
---
## Introduction
In this report, we summarize data from two datasets, one on geopotential height values in the northern hemisphere global region ranging from 35 degrees N to 70 degrees N, with a time dimensions ranging from 1944 to the present, and the other with details of flooding events from the time period ranging from 1985 to 2016. We present this data using visual representations that show the overall trends, and we gain insights into the trends and relationships between the following variables:

- average geopotential height values across this time period
- number of flooding events per year
- average area affected by events
- variation in amount of flooding between years
- financial damage cause by these events

## Monthly Flood Timeline
Here we present an interactive timeline of the flooding events from 1985 to 2015. The size of the circles indicate the number of people that were displaced as a result of the flood. You can pause the timeline and hover over specific flooding event to find out more details about it.

<script src="scripts/jquery-2.2.1.min.js"></script>
<script src="scripts/queue.min.js"></script>
<script src='scripts/d3.min.js'></script>
<script src="scripts/d3.slider.js"></script>
<script src="scripts/main.js"></script>
<link href="style/d3.slider.css" rel="stylesheet">
<link href="style/main.css" rel="stylesheet">
<style>.container { margin-left: -40px; }</style>

<div class="container">
<div id="slider"></div>
<div id="map">
<div id="current-month"></div>
<div id="play-pause-button"><a id="play-pause-link" href="#" onclick="return false;">Pause</a></div>
</div>
<div id="tip">(Pause and hover over the floods to see more information!)</div>
<div id="tooltip" class="tooltip"></div>
</div>

```{r setup, include=FALSE}
# Put your libs here:
library(ncdf4)
library(RNetCDF)
library(ggplot2)
library(lattice)
library(rworldmap)

# Import the two datasets
ncin <- nc_open("NOAA_Daily_phi_500mb.nc")
flood_df = read.csv('GlobalFloodsRecord.csv', fileEncoding="macroman", stringsAsFactors = FALSE)
```

## Preliminary Flood Data Analysis
Here we show basic flooding statistics. The following graphs represent the number of flooding events per month for each year, and we can see that some years exhibit increased flooding in the late summer months.

```{r scatterplot, fig.width=12, fig.height=12, echo=FALSE, results="hide", fig.align='center'}
# Format dates for parsing
df_ian = data.frame(as.Date(flood_df$Began, "%e-%b-%y"))
names(df_ian) = c("newDate")
df_ian2 = subset(df_ian, newDate >= "1985-01-01" & newDate <= "2008-01-01")

##### Trellis plot for flood counts per year
histogram(
  ~factor(
    format(df_ian2$newDate,"%b"),
    levels = c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
  ) | factor(
    format(df_ian2$newDate,"%y"),
    level = c("85", "86", "87", "88", "89", "90", "91", "92", "3", "94", "99", "00", "01", "02", "03", "04", "05", "06", "07", "08")
  ),
  data=df_ian2, layout=(c(3,6)),
  main="Flood Counts by year and month",
  ylab="Flood Count",
  xlab="Year"
)
```

## Geopotential Height Data

```{r, echo=FALSE, warning=FALSE, message=FALSE, fig.align='center'}
library(ncdf4)
library(chron)
library(RColorBrewer)
library(lattice)
ncname <- "NOAA_Daily_phi_500mb"
ncfname <- paste(ncname, ".nc", sep = "")
dname <- "phi" #For variable name phi
ncin <- nc_open(ncfname)
# Open NETCDF data file

lon <- ncvar_get(ncin, "X")
lon <- ifelse(lon>180, lon-360, lon)
# Longitude
lat <- ncvar_get(ncin, "Y", verbose = F)
# Latitude
time_var <- ncvar_get(ncin, "T")
tunits <- ncatt_get(ncin, "T", "units")
# Time

phi.array <- ncvar_get(ncin, dname)
dpoint <- ncatt_get(ncin, dname, "pointwidth")
dhistory <- ncatt_get(ncin, dname, "history")
dcalendar <- ncatt_get(ncin, dname, "calendar")
dcenter <- ncatt_get(ncin, dname, "center")
dgribparam <- ncatt_get(ncin, dname, "gribparam")
dgribleveltype <- ncatt_get(ncin, dname, "gribleveltype")
dgribvariable <- ncatt_get(ncin, dname, "gribvariable")
dPDS_TimeRange <- ncatt_get(ncin, dname, "PDS_TimeRange")
dprocess <- ncatt_get(ncin, dname, "process")
dGRIBgridcode <- ncatt_get(ncin, dname, "GRIBgridcode")
dgribNumBits <- ncatt_get(ncin, dname, "gribNumBits")
dgribfield <- ncatt_get(ncin, dname, "gribfield")
dsubcenter <- ncatt_get(ncin, dname, "subcenter")
dscale_min <- ncatt_get(ncin, dname, "scale_min")
dgrib_name <- ncatt_get(ncin, dname, "grib_name")
dmissing_value <- ncatt_get(ncin, dname, "missing_value")
dPTVersion <- ncatt_get(ncin, dname, "PTVersion")
dscale_max <- ncatt_get(ncin, dname, "scale_max")
dexpires <- ncatt_get(ncin, dname, "expires")
dunits <- ncatt_get(ncin, dname, "units")
dlong_name <- ncatt_get(ncin, dname, "long_name")
dstandard_name <- ncatt_get(ncin, dname, "standard_name")
# Phi variable and attributes

nc_close(ncin)
#Close file

tustr <- strsplit(tunits$value, " ")
tdstr <- strsplit(unlist(tustr)[3], "-")
tmonth = as.integer(unlist(tdstr)[2])
tday = as.integer(unlist(tdstr)[3])
tyear = as.integer(unlist(tdstr)[1])
# split the time units string into fields

phi.array[phi.array == dmissing_value$value] <- NA
# Replace NETCDF missing_values with R NAs

n_days <- dim(phi.array)[3]
# Total number of days
```

We compute the approximate average and median geopotential height (phi) data for New York City for every day for the last 65 years. More specifically, these calculations are performed over the 2.5 degree by 2.5 degree geo-coordinate grid that contains New York City.

```{r, echo=FALSE, warning=FALSE, message=FALSE}
NYC.lon = which(lon==-75.0)
NYC.lat = which(lat==40.0)
NYC.mean = mean(phi.array[NYC.lon,NYC.lat,])
NYC.median = median(phi.array[NYC.lon,NYC.lat,])
```

* Mean: `r NYC.mean`
* Median: `r NYC.median`

## Flood data

```{r, echo=FALSE, warning=FALSE, message=FALSE, fig.align='center'}
library(ggplot2)
library(lubridate)
library(lattice)
df_c = read.csv('GlobalFloodsRecord.csv', stringsAsFactors = FALSE)
# Fix dataset
n <- dim(df_c)[1]
for(i in 1:n){
  df_c[i,15] <- gsub(",","",df_c[i,15])
}
# Clean Damage (USD) [get rid of commas]
fix.index = c(2,8,22,30)
colnames(df_c)[fix.index] = c("Annual.DFO", "Detailed.Locations", "News", "Notes")
num.cols = c(2,12,13,14,15,17,18,19,20,21)
df_c[,num.cols] <- sapply(df_c[,num.cols], as.numeric)
df_c$X <- NULL

```

### Affected sq. km
The following diagrams are lattice plots for the average area in square kilometers affected by floods per month and per year. This information is broken up into two full decades: the 1990s and the 2000s.

```{r, echo=FALSE, warning=FALSE, message=FALSE, fig.align='center'}
Floods_1990s <- subset(df_c,(year(dmy(df_c$Began)) <= 1999) & (year(dmy(df_c$Began)) >= 1990),
                       select=c(Began,Damage..USD.,Affected.sq.km))

Floods_1990s_Agg <- aggregate(Floods_1990s[c("Affected.sq.km")],
                              by=list(YEAR=year(dmy(Floods_1990s$Began)),
                                      MONTH=month(dmy(Floods_1990s$Began))),
                              FUN=mean,na.rm=FALSE)
# create factors with value labels
year90.f<-factor(Floods_1990s_Agg$YEAR,
               levels=c(1990,1991,1992,1993,1994,1995,1996,1997,1998,1999))
month90.f <-factor(Floods_1990s_Agg$MONTH,
                 levels=c(1,2,3,4,5,6,7,8,9,10,11,12),
               labels=c(1,2,3,4,5,6,7,8,9,10,11,12))#c("Jan","Feb","Mar",
                      #  "Apr","May","Jun","Jul",
                      #  "Aug","Sep","Oct","Nov","Dec"))

barchart(Floods_1990s_Agg$Affected.sq.km~month90.f|year90.f,layout=(c(2,5)),
          main="Average Affected Sq. Km. Per Month Per Year: 1990s",
          xlab="Month",
          ylab="Affected Sq. Km.")

Floods_2000s <- subset(df_c,(year(dmy(df_c$Began)) <= 2009) & (year(dmy(df_c$Began)) >= 2000),
                       select=c(Began,Damage..USD.,Affected.sq.km))

Floods_2000s_Agg <- aggregate(Floods_2000s[c("Affected.sq.km")],
                              by=list(YEAR=year(dmy(Floods_2000s$Began)),
                                      MONTH=month(dmy(Floods_2000s$Began))),
                              FUN=mean,na.rm=FALSE)

year00.f<-factor(Floods_2000s_Agg$YEAR,
               levels=c(2000,2001,2002,2003,2004,2005,2006,2007,2008,2009))
month00.f <-factor(Floods_2000s_Agg$MONTH,
                 levels=c(1,2,3,4,5,6,7,8,9,10,11,12),
                 labels=c(1,2,3,4,5,6,7,8,9,10,11,12))#c("Jan","Feb","Mar",
                        #  "Apr","May","Jun","Jul",
                        #  "Aug","Sep","Oct","Nov","Dec"))

barchart(Floods_2000s_Agg$Affected.sq.km~month00.f|year00.f,layout=(c(2,5)),
         main="Average Affected Sq. Km. Per Month Per Year: 2000s",
         xlab="Month",
         ylab="Affected Sq. Km.")
```

### ANOVA Analysis
We examine whether there are any statistically significant differences in area affected by floods among the various months and years, as well as between the last two decades. In order to do so, we perform several ANOVA (Analysis of Variance) tests, all at a rejection level of 0.05. All of these tests are done on the assumption that the areas are derived from a Gaussian distribution with an unknown but fixed variance.

The first is a one-way ANOVA test for the null hypothesis that the average area affected is the same between the 1990s and the 2000s. This test results in a p-value of 0.00285, so we can reject the null hypothesis.

```{r, echo=FALSE, warning=FALSE, message=FALSE, fig.align='center'}
Floods_90s_00s <- subset(df_c,(year(dmy(df_c$Began)) <= 2009) & (year(dmy(df_c$Began)) >= 1990),
                         select=c(Began,Damage..USD.,Affected.sq.km))
Floods_90s_00s$MONTH <- month(dmy(Floods_90s_00s$Began))
Floods_90s_00s$YEAR <- year(dmy(Floods_90s_00s$Began))
Floods_90s_00s$DECADE <- ifelse(Floods_90s_00s$YEAR < 2000,"90s","00s")
Floods_90s_00s$MONTH <- factor(Floods_90s_00s$MONTH,
                               levels=c(1,2,3,4,5,6,7,8,9,10,11,12),
                               labels=c("Jan","Feb","Mar",
                                        "Apr","May","Jun","Jul",
                                        "Aug","Sep","Oct","Nov","Dec"))
Floods_90s_00s$YEAR <- factor(Floods_90s_00s$YEAR,
                              levels=c(1990,1991,1992,1993,1994,1995,
                                       1996,1997,1998,1999,2000,2001,
                                       2002,2003,2004,2005,2006,2007,
                                       2008,2009))
one_way.fit <- aov(Affected.sq.km ~ DECADE, data = Floods_90s_00s)
summary(one_way.fit)

#two_way.fit <- aov(Affected.sq.km ~ DECADE * MONTH, data = Floods_90s_00s)
#summary(two_way.fit)

```

The second is a two-way ANOVA test for the null hypothesis that the average area affected is the same for all months and all years between 1990 and 2009. This test provides a p-value of 0.00329 for the month, 1.68e-05 for the year, but 0.37483 for the interaction between month and year. The results imply that there is a statistically significant difference in average area affected per month, as well as per year. These results make sense considering the effect of seasonality on the weather: different months of the year correspond to the rainy season, and different years have different global weather patterns (ex. El Nino and La Nina). However, both of these factors together do not have a statistically significant impact on the average area affected by floods. Thus, even though there may be seasonal differences between months and between years, there has been little overall change in the scope of flood impact area within the past two full decades.

```{r, echo=FALSE, warning=FALSE, message=FALSE, fig.align='center'}
two_way.fit <- aov(Affected.sq.km ~ MONTH * YEAR, data = Floods_90s_00s)
summary(two_way.fit)
```

Regarding the apparent contradiction between the results of the two ANOVA tests: the one-way test showed that there is a per-decade difference when ignoring monthly and yearly seasonal trends, which the two-way test takes into account.

## Monetary Damages
The next set of diagrams are lattice plots for the average amount of damage in USD (United States dollars) due to floods per month and per year. This information is also split by the last two calendar decades.

```{r, echo=FALSE, warning=FALSE, message=FALSE, fig.align='center'}
Floods_1990s_Agg_2 <- aggregate(Floods_1990s[c("Damage..USD.")],
                              by=list(YEAR=year(dmy(Floods_1990s$Began)),
                                      MONTH=month(dmy(Floods_1990s$Began))),
                              FUN=mean,na.rm=TRUE)
barchart(Floods_1990s_Agg_2$Damage..USD.~month90.f|year90.f,layout=(c(2,5)),
         main="Average Damages (USD) Per Month Per Year: 1990s",
         xlab="Month",
         ylab="Damages (USD)",scales = list(y = list(log = 10)))
Floods_2000s_Agg_2 <- aggregate(Floods_2000s[c("Damage..USD.")],
                              by=list(YEAR=year(dmy(Floods_2000s$Began)),
                                      MONTH=month(dmy(Floods_2000s$Began))),
                              FUN=mean,na.rm=TRUE)
barchart(Floods_2000s_Agg_2$Damage..USD.~month00.f|year00.f,layout=(c(2,5)),
         main="Average Damages (USD) Per Month Per Year: 2000s",
         xlab="Month",
         ylab="Damages (USD)",scales = list(y = list(log = 10)))
```

### ANOVA Analysis
We examine whether there are any statistically significant differences in flood damage among the various months and years, as well as between the last two decades. We run two ANOVA tests at a rejection level of 0.05. As before, all of these tests are done on the assumption that the damage values are derived from a Gaussian distribution with an unknown but fixed variance.

The first is a one-way ANOVA test for the null hypothesis that the average value of flood damages is the same between the 1990s and the 2000s. This test results in a p-value of 0.0214, so we can reject the null hypothesis.

```{r, echo=FALSE, warning=FALSE, message=FALSE, fig.align='center'}
one_way.fit <- aov(Damage..USD. ~ DECADE, data = Floods_90s_00s)
summary(one_way.fit)

#two_way.fit <- aov(Damage..USD. ~ DECADE * MONTH, data = Floods_90s_00s)
#summary(two_way.fit)
```

The second is a two-way ANOVA test for the null hypothesis that the average value of damages is the same for all months and all years between 1990 and 2009. This test provides a p-value of 0.578 for the month, 0.449 for the year, and 1.0 for the interaction between month and year. We fail to reject the null hypothesis. One possible reason why the test suggests little change is that whenever floods happen, especially in vulnerable communities, they may cause similar levels of damage, regardless of when they occur.

```{r, echo=FALSE, warning=FALSE, message=FALSE, fig.align='center'}
two_way.fit <- aov(Damage..USD. ~ MONTH * YEAR, data = Floods_90s_00s)
summary(two_way.fit)

#two_way.fit <- aov(Damage..USD. ~ DECADE * MONTH, data = Floods_90s_00s)
#summary(two_way.fit)
```

## Trend Analysis
The lattice plots provided a compact view of how the flood damage and impact area changed on average per month and per year. The ANOVA tests helped determine that there were statistically significant changes over time. However, we need additional visualizations to gain insight on what those changes look like.

The following bar plot shows the average amount of flood damages per year. Some of the years with the highest values in damages had very high profile storms and weather patterns occur: 1991 had "the Perfect Storm" in New England, 1998 had El Nino, and Hurricane Katrina devastated New Orleans in 2005. Other than these extreme values, there is no observable trend between 1990 and 2009, confirming the results of the two-way ANOVA test. It is also worth noting that there is no data for flood damages after 2010. This finding may suggest that there are no floods that cause serious monetary damage after that time. However, the 2011 tsunami in Japan and Hurricane Sandy, among other events, render this notion highly unlikely.

```{r, echo=FALSE, warning=FALSE, message=FALSE, fig.align='center'}
Floods_AllYears_Damage <- aggregate(df_c[c("Damage..USD.")],
                        by=list(YEAR=year(dmy(df_c$Began))),
                        FUN=mean,na.rm=TRUE)
ggplot(Floods_AllYears_Damage,aes(x=YEAR,y=Damage..USD.))+
  geom_bar(stat="identity")+labs(x="Year",y="Damage (USD)",
                                 title="Average Damages (USD) Per Year")

Floods_AllYears_Affected <- aggregate(df_c[c("Affected.sq.km")],
                                    by=list(YEAR=year(dmy(df_c$Began))),
                                    FUN=mean,na.rm=TRUE)

```

The next plot displays the average area impacted by floods over time. There is fluctuation over each decade, with less area affected towards the beginning of a decade, and an increasing area with each year until the end of the decade. This visualization supports the explanation of seasonality for the ANOVA analysis results.

```{r, echo=FALSE, warning=FALSE, message=FALSE, fig.align='center'}
ggplot(Floods_AllYears_Affected,aes(x=YEAR,y=Affected.sq.km))+
  geom_bar(stat="identity")+labs(x="Year",y="Affected sq. km.",
                                 title="Average Affected sq. km. Per Year")
```

## Keywords in News Reporting
We are curious about the words that journalists use on floods. Specifically, we want to see if the words used change as the floods differ in severity. Here we choose to categorize floods by the number of deaths each flood causes.

```{r load data, include=FALSE}
flood=read.csv('data/GlobalFloodsRecord_Tian.csv')
names(flood)[30]='news'
names(flood)[12]='Duration'
names(flood)[18]='Area'
names(flood)[15]='Damage'
```

We perform a TF-IDF(Term Frequency, Inverse Document Frequency) step on each different flood news reporting. Each corpus of documents is the news of floods causing deaths in a given interval (i.e. 10-50). After performing stemming, stopword removal and number removal, the TF-IDF step gets the highest 40 keywords for each corpus.

```{r check spelling, include=FALSE}
spellcheck=function(x){
  return (gsub("\xca","",x))}
flood$Country=lapply(flood$Country,spellcheck)
flood$Country=unlist(lapply(flood$Country,function(x) gsub("^$|^ $", NA, x)))
flood=flood[!is.na(flood[,4]), ]
flood=flood[flood$Country!="#N/A", ]
library(stringr)
flood$Country=unlist(lapply(flood$Country,function(x) str_trim(x,side='both')))

flood$Began=lapply(flood$Began,function(x) gsub("^$|^ $", NA, x))
flood$Ended=lapply(flood$Ended,function(x) gsub("^$|^ $", NA, x))
flood$Damage=lapply(flood$Damage,function(x) gsub("^$|^ $", NA, x))
flood=flood[!is.na(flood[,10]), ]
flood=flood[!is.na(flood[,11]), ]

flood$Began=lapply(flood$Began,function(x) as.Date(strtoi(x), origin = "1899-12-30"))
flood$Ended=lapply(flood$Ended,function(x) as.Date(strtoi(x), origin = "1899-12-30"))

for (i in 1:dim(flood)[1]){
  flood$year[i]=as.numeric(format(flood$Began[[i]],'%Y'))}
flood$year=unlist(lapply(flood$year,function(x) as.numeric(x)))

#get continent for country
for (i in 1:dim(flood)[1]){
  c=flood$Country[i]
  if (substr(c,1,4)=="Phil"){flood$Country[i]='Philippines'}
  if (substr(c,1,3)=="USA"){flood$Country[i]='USA'}
  if (substr(c,1,3)=="Mold"){flood$Country[i]='Moldova'}
  if (substr(c,1,5)=="Malay"){flood$Country[i]='Malaysia'}
  if (substr(c,1,5)=="Urugu"){flood$Country[i]='Uruguay'}
  if (substr(c,1,5)=="Zimba"){flood$Country[i]='Zimbabwe'}
  if (substr(c,1,3)=="Vie"){flood$Country[i]='Vietnam'}
  if (substr(c,1,5)=="Papua"){flood$Country[i]='Papua New Guinea'}
  if (substr(c,1,5)=="Niger"){flood$Country[i]='Nigeria'}
  if (substr(c,1,4)=="Guat"){flood$Country[i]='Guatemala'}
}
CC=read.csv("data/CC.csv")
idx=match(flood$Country,CC$Country)
idx[is.na(idx)]=sample(c(247,228,229,1),1)
flood$Continent=CC$Continent[idx]

#count length of news
flood$newslen=lapply(flood$news,function(x) sapply(gregexpr("\\W+", x), length) + 1)

flood_news=flood[flood$newslen>10,]
```

Then we mark the weight of each keyword in each category. For example, if 'killed' is a top keyword in 3% of news reporting on flood that causes deaths in (50, 100) interval, we give it the weight 3.
```{r tf-idf, include=FALSE}
Dead=as.numeric(as.matrix(flood_news$Dead))
Dead_l=c(0,0,10,50,100,500,1000)
Dead_h=c(0,10,50,100,500,1000,10000)

ids=(Dead>0)*(Dead<=10)
ids=Map(function(x) as.logical(x),ids)
ids=unlist(ids)

#the function that does tfidf
library(data.table)
library(tm)
library(wordcloud)
#function to plot wordcloud..
wordcloud_plot=function(x,dead){
  d = Corpus(VectorSource(as.matrix(x$news))) #Make a corpus object from a text vector
  dtm <- DocumentTermMatrix(d, control = list(weighting = weightTfIdf))
  dtm=data.frame(as.matrix(dtm))
  dtm=dtm[ , -which(names(dtm) %in% stopwords("english"))]
  if(!is.na(which(names(dtm) %in% c("...",'flood','flooding','floods','flooded'))[1]))
  {dtm=dtm[ , -which(names(dtm) %in% c("..."))]}

  months=c('january','february','march','april','may','june','july','august','september','october','november','december')
  dtm_copy=dtm
  for (i in 1:12){idx=grep(months[i],colnames(dtm_copy))
  if (!is.na(as.numeric(idx[1])==0)) #remove months
  {dtm_copy=dtm_copy[,-idx]}}
  dtm=dtm_copy

  words=names(dtm)
  for (i in 1:dim(dtm)[2]) {if(substr(words[i],1,1)!="X") {break}}
  #i = the number of cols starting with X  to be merged later

  print("X split done")

  dtm_X=dtm[,1:i-1]
  dtm_posX=dtm[,i:dim(dtm)[2]]
  removeX=function(x){
    x= (gsub("X.","",x))
    x= (gsub("X_","",x))
    x= (gsub("_","",x))
    x= (gsub(" ","",x))
    if (substr(x,nchar(x),nchar(x))==".") {x=substr(x,1,nchar(x)-1)}
    if (substr(x,nchar(x),nchar(x))==".") {x=substr(x,1,nchar(x)-1)}
    if (substr(x,nchar(x),nchar(x))==".") {x=substr(x,1,nchar(x)-1)}
    if (substr(x,nchar(x),nchar(x))==".") {x=substr(x,1,nchar(x)-1)}
    if (substr(x,nchar(x),nchar(x))==".") {x=substr(x,1,nchar(x)-1)}
    if (substr(x,1,1)==".") {x=substr(x,2,nchar(x))}
    return (x)}

  #names(dtm_X)=apply(as.matrix(names(dtm_X)),1,removeX)
  names(dtm_posX)=apply(as.matrix(names(dtm_posX)),1,removeX)
  dtm_new=cbind(dtm_X,dtm_posX)
  dtm_new=dtm_posX
  dtm_new=t(dtm_new)

  #dtm_reduce=aggregate(dtm_new,list(rownames(dtm_new)),sum)
  dtm_reduce=(dtm_new)
  dtm_reduce=t(dtm_reduce)

  print ("groupby done")

  #high_kw=function(x) {
  #  return (dtm_reduce[1,][order(x, decreasing=TRUE)[1:25]])}

  high_kw=function(x) {
    return (names(dtm_posX)[order(x, decreasing=TRUE)[1:40]])}

  vec_kw=as.vector(apply(dtm_reduce,1,high_kw)[,-1]) #vec 25 * n_doc
  print ("vec_kw done ")
  c=merge(vec_kw,1)
  c=aggregate(c[,2], list(c[,1]), sum)

  c=c[is.na(as.numeric(as.matrix(c$Group.1))),]
  print ("c done ")

  for (j in 1:dim(c)[1]) {if(substr(c[j,1],1,1)=="a") {break}}
  #word cloud
  c=c[j:dim(c)[1],]
  w=c[,1][order(c[,2], decreasing=TRUE)[1:40]]
  w=as.matrix(w)
  f=c[,2][order(c[,2], decreasing=TRUE)[1:40]]
  #wordcloud(w, f)
  #f=100*f/(dim(dtm_reduce)[2]-1)
  return (data.frame(w,dead,100*f/(dim(dtm_reduce)[2]-1)))}
```

We keep these weights in a 43x7 matrix. 43 is the number of high-density keywords and 7 is the intervals of death numbers caused by floods.

```{r word_dead, include=FALSE}
word_dead=data.frame(0,0,0)
colnames(word_dead)=c("word","dead","percent")
for (i in 1:length(Dead_l)){
        ids=(Dead>=Dead_l[i])*(Dead<=Dead_h[i])
        ids=Map(function(x) as.logical(x),ids)
        ids=unlist(ids)
        x=flood_news[ids,]
        df=wordcloud_plot(x,Dead_h[i])
        colnames(df)=colnames(word_dead)
        word_dead=rbind(word_dead,df)
        print (Dead_l[i])}

word_dead=word_dead[2:dim(word_dead)[1],]
#add in missing words for consistency
words=sort(unique(word_dead$word))

for (i in 1:length(words))
   for(j in 1:length(Dead_h)){
     if(sum((word_dead$word==words[i])*(word_dead$dead==Dead_h[j]))==0)
       {r=data.frame(words[i],Dead_h[j],0.0005)
       colnames(r)=colnames(word_dead)
       word_dead=rbind(word_dead,r)}}
library(plyr)
worddead=word_dead
word_dead[word_dead$dead==0,]$dead=1
word_dead=arrange(word_dead,-desc(word),dead)
word_news=c('abandon','absorb','academic','account','accumulate','acres','across','activities','adjoining','aerial','area'
            ,'bangladesh','catastrophe','caused','city','crippled','cyclone','damage','damaged','dead','days'
            ,'desperation','destroyed','dying','evacuated','feb','flood','flooded','flooding','hectares','heavy','homes'
            ,'inundated','killed','landslides','missing','maternity','rains','river','rushing','sept','rushing','submerged','village','worst','tropical')
word_dead=word_dead[word_dead$word %in% word_news,]
words=sort(unique(word_dead$word))
library(ggplot2)
attach(word_dead)
ids=(dead==1)+(dead==10)+(dead==50)+(dead==100)+(dead==500)+(dead==1000)+(dead==10000)
ids=Map(function(x) as.logical(x),ids)
ids=unlist(ids)
word_dead1=word_dead[ids,]
word_dead2=word_dead1[word_dead1$word %in% unique(words)[1:length(words)],]
if ('0' %in% words) {word_dead2=word_dead2[8:dim(word_dead2)[1],]}
#gg=ggplot(word_dead2, aes(x=log(as.numeric(as.character(dead))), y=percent))
#gg + geom_area(aes(colour=word, fill=word))
```

Here is a word cloud we generate, looking at floods causing deaths between 50 and 100. The keywords such as 'caused', 'desperation', and 'sept' appear large in size, indicating they tend to occur frequently in newsreporting on floods of this scale.

```{r wordcloud_prep, include=TRUE, echo=FALSE, fig.align='center'}
library(d3heatmap)
wd=as.vector(word_dead2$percent)
wd=matrix(wd,nrow=length(words))
#dim(wd)=c(length(word_news),7)
wd=data.frame(wd)
colnames(wd)= unlist(lapply(Dead_h,function(x) paste('Deaths <=',x)))
rownames(wd)=sort(unique(word_dead2$word))
wd[wd==0.0005]=0
wd=wd*10
if (0 %in% rownames(wd)){wd=wd[2:dim(wd)[1],]}
wordcloud(words,wd[,4])
```

Interestingly, comparing the above word cloud against newsreporting on flood causing deaths between 100 and 500, we immediately see a difference. Now words such as 'feb', 'evacuated', and 'inundated' get more dense.
```{r wordcloud, include=TRUE, echo=FALSE, fig.align='center'}
wordcloud(words,wd[,5])
```

Finally we plot an interactive heatmap to explore how these keywords change their density in newsporting, as the severity of floods increase.
```{r heatmap, include=TRUE}
#d3heatmap(wd, Colv=FALSE,dendrogram='none',scale = "column", colors = "Spectral",xaxis_font_size="8px")
wd=data.matrix(wd)
library(RColorBrewer)
library(plotrix)
palf <- colorRampPalette(c("gold", "dark orange"))
heatmap(wd, Rowv = NA, Colv = NA, col = palf(100),scale="none", margins=c(5,2),cexRow=0.8,cexCol=0.7)

```

We notice that 'feb' is most dense on floods causing deaths between 100 and 500 while 'sept' is most dense on deaths causing 0 deaths. This may suggest an association in seasonality and severity of floods. Also interestingly, we observe that keyword such as 'abandon' is most dense in reportings on floods with deaths <=50.

## Causes of Floods
In this section we look at the causes of the floods. We pick the most popular 6 causes and do a string matching to categorize the floods.
```{r cause_prep, include=FALSE}
library(maps)
library(ggplot2)
world=map_data('world')
causes=c('tropical','typhoon','rain','monsoon','dam','snow')
```

```{r cause_prep2, include=FALSE}
flood$causes=NA
for (i in 1:6){
idx=grep(causes[i],flood$Main.cause,ignore.case=TRUE)
flood$causes[idx]=causes[i]}
flood_cause=flood[!is.na(flood[,34]), ]
flood_cause=flood_cause[flood_cause[,19]>0, ]
```

We project the floods in past 6 years (2010-2015) onto the map. The area of the dots indicate the affecte area of the floods. We see some easily recognible patterns: majority of floods are caused by rain(green dots); moonsoon-caused floods (brown dots) are dominant in South Asia and East China. Tropical storms are, as the name suggests, popular in tropical area and so are the floods caused by them. Rare in frequency, floods caused by snow are mainly in high-latitude regions such as Russia and Kazakhstan
```{r cause_map, include=TRUE,,warning=FALSE, echo=FALSE, fig.align='center'}
p <- ggplot()
p <- p + geom_polygon( data=world, aes(x=long, y=lat, group = group),colour="white", fill="grey10" )
p <- p + geom_jitter( data=flood_cause[as.numeric(as.matrix(flood_cause$year))>=2010,], position=position_jitter(width=0.5, height=0.5),
    aes(x=as.numeric(as.matrix(Centroid.X)),
     y=as.numeric(as.matrix(Centroid.Y)),size=as.numeric(as.matrix(Area)),color=causes,alpha=0.5))
p=p+scale_size('area')

p=p+ggplot2::annotate("text", label = "causes of world flood 2010-2015", x = 12, y = 100, size = 8, colour ="black")
p
```

## Principle Component Analysis on Countries
In this section, we performed priciple component analysis (PCA) on countries that experienced floods in a given year. Here we choose 7 variables from the GlobalFloodArchive data, namely Affected Area, Severity, Magnitude, Dead, Dispatched, Latitude and Longitude.

```{r pca_prep, include=FALSE}
library(scales)
PCAcc=function(yr){
x=flood[as.numeric(as.matrix(flood$year))==yr,]
temp=x[,c(4,12:14,17,18,20,21)]
temp=temp[as.matrix(temp$Dead)>=10,]
cc=temp$Country
temp=data.frame(data.matrix(temp))
temp$Country=cc

temp_avg=aggregate(temp[, 2:8], list(temp$Country), mean) #do we need this?
pca_existing <- prcomp(data.frame(temp[,c(2:8)]), scale. = TRUE)
projected <- as.data.frame(pca_existing$x)
# Show first two PCs for head countries
projected$country=temp$Country

ramp <- colorRamp(c("green", "red"))
ratio=temp$Dead/temp$Severity..
colours_by_ratio <- rgb(
  ramp( as.vector(rescale(ratio,c(0,1)))),
  max = 255 )
plot(PC1~PC2, data=projected, main= paste("PCA for Flood Affected Countries in ",yr),cex = .1, lty = "solid",col=colours_by_ratio,bg='white' )
text(PC1~PC2, data=projected, labels=projected$country,cex=.8,col=colours_by_ratio)}
```

We use a PCA dimensionality reduction to project the 7 chosen dimensions onto 2. Also we would like to investigate if there are clustering on these countries. We use a color coding from green to red that indicates the ratio of deaths against severity of the flood. This color thus indicates how damaging this flood is. The more shifted on the red, the more damaging it is.

In this example we focus on floods in 2000:
```{r pca, include=TRUE,warning=FALSE, echo=FALSE, fig.align='center'}
PCAcc(2000)
```

We observe some clustering, such as the 4 floods in Brazil in the top part and the 2 floods in India in the bottom. The 2 PCs may be intepreted such that PC2 corresponds to how damaging a flood is, as most country-flood pairs in the upper half are red while the ones in the bottm are green

However, we do notice that there isn't any clear classification on this plot between developped countries and developing countries. USA and Philippines are close to each other in the center and so are France and India on the left. Perhaps this suggests that flood is more of a global disaster regardless of the GDP or geo-location of a country.

## Flood Death
The following plot shows deaths related to flooding events between 1985 and 2015.

```{r, warning=FALSE, fig.align='center', echo=FALSE}
library(ncdf4)
library(RNetCDF)
library(rworldmap)
library(ggplot2)
gf <- read.csv(file="GlobalFloodsRecord.csv", header=TRUE, sep=",")
flood.data = data.frame(Longitude=as.numeric(as.character(gf$Centroid.X)),Latitude=as.numeric(as.character(gf$Centroid.Y)),
                        dead=as.numeric(gf$Dead))
world_map <- map_data("world")
p <- ggplot() + coord_fixed() +
  xlab("") + ylab("")

#Add map to base plot
base_world <- p + geom_polygon(data=world_map, aes(x=long, y=lat, group=group),
                               colour="white", fill="grey")+ labs(title="Flood Related Deaths, 1985-2015")
map_data <-
  base_world +
  geom_point(data=flood.data,
             aes(x=Longitude, y=Latitude,color=-dead),size=1,alpha=I(0.6))
map_data
```

## Flood Monthly Magnitude
In the following maps, we plot flooding event of each month. We can find that the flooding may has some relationship with temperature. More flooding events have occurred between May and September.
```{r,echo=FALSE}
multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {
  library(grid)

  # Make a list from the ... arguments and plotlist
  plots <- c(list(...), plotlist)

  numPlots = length(plots)

  # If layout is NULL, then use 'cols' to determine layout
  if (is.null(layout)) {
    # Make the panel
    # ncol: Number of columns of plots
    # nrow: Number of rows needed, calculated from # of cols
    layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
                     ncol = cols, nrow = ceiling(numPlots/cols))
  }

  if (numPlots==1) {
    print(plots[[1]])

  } else {
    # Set up the page
    grid.newpage()
    pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))

    # Make each plot, in the correct location
    for (i in 1:numPlots) {
      # Get the i,j matrix positions of the regions that contain this subplot
      matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))

      print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
                                      layout.pos.col = matchidx$col))
    }
  }
}
```

```{r, warning=FALSE, fig.align='center', echo=FALSE}
flood.month.data = data.frame(Longitude=as.numeric(as.character(gf$Centroid.X)),Latitude=as.numeric(as.character(gf$Centroid.Y)),
                              Magnitude=as.numeric(gf$Magnitude..M...), month=format(as.Date(gf$Began,format='%d-%b-%y'),"%m"))
month_text=c("Floods in January","Floods in February","Floods in March","Floods in April","Floods in May","Floods in June","Floods in July","Floods in August","Floods in September","Floods in October","Floods in November","Floods in December")
world_map <- map_data("world")
p <- ggplot() + coord_fixed() +xlab("") + ylab("")
base_world <- p + geom_polygon(data=world_map, aes(x=long, y=lat, group=group),colour="white", fill="grey")
flood.1.data<-subset(flood.month.data,month=="01")
flood.2.data<-subset(flood.month.data,month=="02")
flood.3.data<-subset(flood.month.data,month=="03")
flood.4.data<-subset(flood.month.data,month=="04")
flood.5.data<-subset(flood.month.data,month=="05")
flood.6.data<-subset(flood.month.data,month=="06")
flood.7.data<-subset(flood.month.data,month=="07")
flood.8.data<-subset(flood.month.data,month=="08")
flood.9.data<-subset(flood.month.data,month=="09")
flood.10.data<-subset(flood.month.data,month=="10")
flood.11.data<-subset(flood.month.data,month=="11")
flood.12.data<-subset(flood.month.data,month=="12")
map1 <- base_world + geom_point(data=flood.1.data, aes(x=Longitude, y=Latitude),size=0.5,alpha=I(0.6))+labs(title=month_text[1])
map2 <- base_world + geom_point(data=flood.2.data, aes(x=Longitude, y=Latitude),size=0.5,alpha=I(0.6))+labs(title=month_text[2])
map3 <- base_world + geom_point(data=flood.3.data, aes(x=Longitude, y=Latitude),size=0.5,alpha=I(0.6))+labs(title=month_text[3])
map4 <- base_world + geom_point(data=flood.4.data, aes(x=Longitude, y=Latitude),size=0.5,alpha=I(0.6))+labs(title=month_text[4])
map5 <- base_world + geom_point(data=flood.5.data, aes(x=Longitude, y=Latitude),size=0.5,alpha=I(0.6))+labs(title=month_text[5])
map6 <- base_world + geom_point(data=flood.6.data, aes(x=Longitude, y=Latitude),size=0.5,alpha=I(0.6))+labs(title=month_text[6])
map7 <- base_world + geom_point(data=flood.7.data, aes(x=Longitude, y=Latitude),size=0.5,alpha=I(0.6))+labs(title=month_text[7])
map8 <- base_world + geom_point(data=flood.8.data, aes(x=Longitude, y=Latitude),size=0.5,alpha=I(0.6))+labs(title=month_text[8])
map9 <- base_world + geom_point(data=flood.9.data, aes(x=Longitude, y=Latitude),size=0.5,alpha=I(0.6))+labs(title=month_text[9])
map10 <- base_world + geom_point(data=flood.10.data, aes(x=Longitude, y=Latitude),size=0.5,alpha=I(0.6))+labs(title=month_text[10])
map11 <- base_world + geom_point(data=flood.11.data, aes(x=Longitude, y=Latitude),size=0.5,alpha=I(0.6))+labs(title=month_text[11])
map12 <- base_world + geom_point(data=flood.12.data, aes(x=Longitude, y=Latitude),size=0.5,alpha=I(0.6))+labs(title=month_text[12])
multiplot(map1,map2,map3,map4,cols=2)
multiplot(map5,map6,map7,map8,cols=2)
multiplot(map9,map10,map11,map12,cols=2)
```

## Conclusion
From our ANOVA analysis we conclude that there is a statisitcally sigificant difference between the value of the damaged caused by flooding in the 1990s and the 2000s. We also can see from the graphs that flood events occur around rivers and coastal regions, and when coupled with trellis plots of flooding events across months by year, there appear to be time periods of increased flooding that coorespond with months in late summer.