ExpoPath/ExpoPath_Public_Code.Rmd at main · HumanExposure/ExpoPath · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "ExpoPath Public Code"
author: "Michael A. Zurek-Ost, PhD"
date: "Latest Update: 2024-11-04"
output:
  html_document:
    number_sections: yes
---

# Setup

```{r R Markdown global options setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, eval = FALSE, include = TRUE, warning = FALSE)
```

Before executing any of the following code, you'll need to specify a folder as your working directory and make sure the following items and folders are within that:
<p>- "input" folder<br>
- "output" folder<br>
- "ExpoPath_Public_code.Rmd"</p>

The "input" and "output" files are made available at the following DOI: [https://doi.org/10.23645/epacomptox.27696612.v1](https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdoi.org%2F10.23645%2Fepacomptox.27696612.v1&data=05%7C02%7CZurekOst.Michael%40epa.gov%7C9295b79e244d4303bde508dd0401ead6%7C88b378b367484867acf976aacbeca6a7%7C0%7C0%7C638671128455041042%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=JnTKOKbQajXWV6%2FCvSpYHELgwjuuEOsEqUxV8RsxZPM%3D&reserved=0).

The first step in running this project is to install and load the necessary packages. `BiocManager` will require a manual installation while the rest can be found on the Comprehensive R Archive Network (CRAN) accessible through RStudio's "Packages" window.

## Manually Installing ComplexHeatmap

If you do not have `BiocManager` and `ComplexHeatmap`, then run the following code chunk to manually install the packages.

```{r installing ComplexHeatmap}
if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")
BiocManager::install("ComplexHeatmap")
```


## Load Libraries

Make sure the following packages are installed before running this chunk.

```{r load libraries, eval = TRUE, message = FALSE}
library(readxl)
library(dplyr)
library(sna)
library(igraph)
library(scales)
library(ComplexHeatmap)
library(ctxR)
library(networkD3)
library(future.apply)
```


## ctxR specifications

This section includes a code chunk specifying a key for connecting to EPA APIs via the `ctxR` package.

```{r ctxR setup}
my_key <- "" # Provide your API Key here for ctxR functionality
```

## Set custom function to scale values between 0 and 1

The `range01()` function specified here is solely used for tweaking plotting parameters in network visualizations.

```{r custom scale function}
range01 <- function(x){(x - min(x))/(max(x) - min(x))}
```

# Data Management

Set your working directory to your designated project folder containing this markdown file as well as the associated "input" and "output" folders and their respective contents.

```{r set working directory}
setwd() # specify your working directory here but make sure to preserve the file structure of this project
```


## Multimedia Monitoring Database

The EPA's Multimedia Monitoring Database (MMDB) contains chemical presence in harmonized media categories covering environmental, ecological, and biological media of concern. The current format for these data are binary (*detected* = 1 or 0).

```{r load MMDB}
mmdb <- na.omit(
  read.csv("./input/data/MMDB Monitoring Data/mmdb-expanded-model-full-targets.csv")
  )
mmdb_edgelist <- mmdb[which(mmdb$detected == 1),colnames(mmdb) %in% c("media", "dtxsid")]

rm(mmdb)
```


## Chemical Data Reporting Database

These data house information about industrial use of chemicals reported under the Toxic Substances Control Act (TSCA) Rule requiring industries that produce chemicals above a set threshold to document their production, including their industrial sector/purpose. Dataset modified by by taking CASRN IDs in Chemical Data Reporting (CDR) database and batch searching on the CompTox Dashboard for corresponding DTXSIDs, then appended to CDR data.

```{r load CDR}
cdr <- as.data.frame(
  read_excel("./input/data/CDR/2020 CDR Public Excel Data/2020 CDR Industrial Processing and Use Information.xlsx")[,c("CHEMICAL ID", "INDUSTRIAL SECTOR", "IND SECTOR OTHER DESC", "INDUSTRIAL FUNCTION CATEGORY", "IND FUNCT CAT OTHER DESC", "JOINT FUNCTION CATEGORY", "JOINT FUNCT CAT OTHER DESC")]
)

# batch search results ----
cdr_DTXSID_CASRN <- read.csv("./input/data/CDR/2020 CDR Public Excel Data/CCD-Batch-Search_2023-08-02_05_18_00_CDR-DTXSIDs-CASRN.csv",
                             header = TRUE)[,c("DTXSID", "PREFERRED_NAME", "CASRN")]
# joining CDR with DTXSIDs ----
cdr_expanded <- left_join(cdr,
                          unique(cdr_DTXSID_CASRN[,c("DTXSID", "CASRN")]),
                          by = join_by("CHEMICAL ID" == "CASRN"),
                          relationship = "many-to-one")
cdr_edgelist <- unique(
  na.omit(cdr_expanded[!cdr_expanded$`INDUSTRIAL SECTOR` %in% c("Not Known or Reasonably Ascertainable", "Other (requires additional information)", "Carbon Black Manufacturing"),c("INDUSTRIAL SECTOR", "DTXSID")])
  )
colnames(cdr_edgelist) <- c("media", "dtxsid")
cdr_edgelist$media <- gsub(",",
                           "",
                           cdr_edgelist$media)
cdr_edgelist$media <- gsub("\\-",
                           "_",
                           cdr_edgelist$media)
cdr_edgelist$media <- gsub(r"{\s*\([^\)]+\)}",
                           "",
                           cdr_edgelist$media)
cdr_edgelist$media <- tolower(cdr_edgelist$media)
cdr_edgelist$media <- gsub(" ",
                           "_",
                           cdr_edgelist$media)
cdr_edgelist$media <- paste0("CDR_",
                             cdr_edgelist$media,
                             sep = "")

rm(cdr,
   cdr_DTXSID_CASRN,
   cdr_expanded)
```


After filtering and cleaning, 401 unique chemicals in CDR data overlap with MMDB (~0.057%) with the remaining 6642 not found in MMDB (~0.943%).

## Consumer Products Database

Data related to consumer product formulations and ingredients can be downloaded directly [here](https://comptox.epa.gov/chemexpo/get_data/), but is provided in the input folder.

```{r load CPDat}
chemexpo <- read.csv("./input/data/ChemExpo Use data/ChemExpo_bulk_composition_chemicals_20230727/ChemExpo_bulk_composition_chemicals.csv",
                     header = TRUE)[,c("Product.Name", "PUC.General.Category", "PUC.Product.Family", "PUC.Product.Type", "DTXSID")]
chemexpo$PUC.General.Category <- chemexpo$PUC.General.Category %>% na_if("")
chemexpo$DTXSID <- chemexpo$DTXSID %>% na_if("")
chemexpo_filtered <- unique(
  na.omit(chemexpo[!chemexpo$PUC.General.Category %in% c("Unknown or Indeterminate", "Other vehicles/mass transit"),c(2:ncol(chemexpo))])
  )

# designate PUC level ----
# creates an edgelist containing only general PUC level
chemexpo_edgelist <- unique(
  data.frame("media" = chemexpo_filtered[,"PUC.General.Category"],
             "dtxsid" = chemexpo_filtered$DTXSID)
  )
chemexpo_edgelist$media <- gsub(",",
                                "",
                                chemexpo_edgelist$media)
chemexpo_edgelist$media <- gsub("\\.",
                                "",
                                chemexpo_edgelist$media)
chemexpo_edgelist$media <- gsub("\\'",
                                "",
                                chemexpo_edgelist$media)
chemexpo_edgelist$media <- gsub("\\/",
                                "_",
                                chemexpo_edgelist$media)
chemexpo_edgelist$media <- gsub(r"{\s*\([^\)]+\)}",
                                "",
                                chemexpo_edgelist$media)
chemexpo_edgelist$media <- tolower(chemexpo_edgelist$media)
chemexpo_edgelist$media <- gsub(" ",
                                "_",
                                chemexpo_edgelist$media)
chemexpo_edgelist$media <- paste0("ChEx_",
                                  chemexpo_edgelist$media,
                                  sep = "")

# additional filtering ----
# these 2 lines are useful if using the ALL PUC categories regardless of level
chemexpo_edgelist$media <- chemexpo_edgelist$media %>% na_if("")
chemexpo_edgelist <- na.omit(chemexpo_edgelist)

rm(chemexpo,
   chemexpo_filtered)
```

1068 unique chemicals in CPDat overlap with MMDB (~0.097%) with the remaining 9997 not found in MMDB (~0.903%).

## Drugbank and Orangebook

A curated list of chemicals contained in the Drugbank dataset made available by the University of Alberta. FDA's Orangebook contains similar data and the combination of these chemicals are linked to a pharmaceutical media category.

```{r load pharmaceutical data}
# Drugbank ----
drugbank <- read_excel("./input/data/Drugbank/drugbank_list_chemicals-2023-11-13-08-33-09.xls")[,"DTXSID"]
drugbank_edgelist <- data.frame("media" = rep("pharmaceuticals",
                                              length(drugbank)),
                                "dtxsid" = c(drugbank))
colnames(drugbank_edgelist)[2] <- "dtxsid"

# Orangebook ----
ob_products_dtxsids <- read.csv("./input/data/FDA/Orange Book/orange-book-products-Batch-Search-chemical-list-2023-11-14.csv",
                                header = T)[,c("DTXSID", "INPUT")]
ob_edgelist <- data.frame("media" = rep("pharmaceuticals",
                                        nrow(ob_products_dtxsids)),
                          "dtxsid" = ob_products_dtxsids$DTXSID)

# joining Drugbank and Orangebook ----
pharm_edgelist <- unique(
  rbind(drugbank_edgelist,
        ob_edgelist)
  )
pharm_edgelist$media <- paste0("PHARM_",
                               pharm_edgelist$media,
                               sep = "")

rm(drugbank,
   drugbank_edgelist,
   ob_products_dtxsids,
   ob_edgelist)
```

474 unique chemicals in Drugbank and Orangebook overlap with MMDB (~0.057%) with the remaining 7775 not found in MMDB (~0.943%).

## Food Additives and Contacts

```{r load food additive and contact data}
# food additives ----
contact_categories <- read.csv("./input/data/FDA/21CFR_Food_categories.csv",
                               header = T)
food_add <- read.csv("./input/data/FDA/FoodSubstances_additive_edit_2023-11-15.csv",
                     header = T)[,c(1, 11:29)]
food_add$CAS.Reg.No..or.other.ID. <- gsub(" ",
                                          "",
                                          food_add$CAS.Reg.No..or.other.ID.)
food_add$CAS.Reg.No..or.other.ID. <- food_add$CAS.Reg.No..or.other.ID. %>% na_if("")
food_add <- food_add[which(!is.na(food_add$CAS.Reg.No..or.other.ID.)),]
colnames(food_add)[1] <- "CASRN"
food_add <- reshape(data = food_add,
                    varying = list(colnames(food_add[,2:ncol(food_add)])),
                    idvar = colnames(food_add)[1],
                    direction = "long")
food_add <- na.omit(food_add[,which(!colnames(food_add) %in% "time")])

# appending DTXSIDs from CompTox batch search
food_add_dtxsids <- read.csv("./input/data/FDA/fda_additive_CCD-Batch-Search_2023-11-15.csv",
                             header = T)[,c("DTXSID", "INPUT")]
food_add_dtxsids$DTXSID[food_add_dtxsids$DTXSID == "N/A"] <- NA
food_add <- merge(food_add,
                  food_add_dtxsids,
                  by.x = "CASRN",
                  by.y = "INPUT")
food_add <- na.omit(food_add)

# applying 21CFR categories
for (i in seq(nrow(contact_categories))){
  food_add$Reg.add01[which(food_add$Reg.add01 >= contact_categories[i,]$start &
                             food_add$Reg.add01 <= contact_categories[i,]$end)] <- contact_categories[i,]$part
}

food_add$Reg.add01 <- gsub(",",
                           "",
                           food_add$Reg.add01)
food_add$Reg.add01 <- gsub(" ",
                           "_",
                           food_add$Reg.add01)
food_add$Reg.add01 <- gsub("-",
                           "_",
                           food_add$Reg.add01)
food_add$Reg.add01 <- gsub(":",
                           "",
                           food_add$Reg.add01)
food_add$Reg.add01 <- tolower(food_add$Reg.add01)
food_add_edgelist <- data.frame("media" = food_add$Reg.add01,
                                "dtxsid" = food_add$DTXSID)
food_add_edgelist <- food_add_edgelist[-grep("1",
                                             food_add_edgelist$media),]

# removing media equal to 1
food_add_edgelist <- food_add_edgelist[which(!food_add_edgelist$media %in% names(table(food_add_edgelist$media)[which(table(food_add_edgelist$media) == 1)])),]
food_add_edgelist$media <- paste0("FDAa_",
                                  food_add_edgelist$media,
                                  sep = "")

# food contact ----
food_con <- read.csv("./input/data/FDA/FoodSubstances_contact_edit_2023-11-15.csv", header = T)[,c(1, 9:29)]
colnames(food_con)[1] <- "CASRN"
food_con <- reshape(data = food_con,
                    varying = list(colnames(food_con[,2:ncol(food_con)])),
                    idvar = colnames(food_con)[1],
                    direction = "long")
food_con <- na.omit(food_con[,which(!colnames(food_con) %in% "time")])

# applying 21CFR categories
for (i in seq(nrow(contact_categories))){
  food_con$Reg01[which(food_con$Reg01 >= contact_categories[i,]$start &
                         food_con$Reg01 <= contact_categories[i,]$end)] <- contact_categories[i,]$part
}

food_con$Reg01 <- gsub(",",
                       "",
                       food_con$Reg01)
food_con$Reg01 <- gsub(" ",
                       "_",
                       food_con$Reg01)
food_con$Reg01 <- gsub("-",
                       "_",
                       food_con$Reg01)
food_con$Reg01 <- gsub(":",
                       "",
                       food_con$Reg01)

# appending DTXSIDs from CompTox batch search
food_con_dtxsids <- read.csv("./input/data/FDA/fda_contact_CCD-Batch-Search_2023-11-15.csv",
                             header = T)[,c("DTXSID", "INPUT")]
food_con_dtxsids$DTXSID[food_con_dtxsids$DTXSID == "N/A"] <- NA
food_con$CASRN <- trimws(food_con$CASRN)
food_con <- merge(food_con,
                  food_con_dtxsids,
                  by.x = "CASRN",
                  by.y = "INPUT")
food_con <- na.omit(food_con[,c("Reg01", "DTXSID")])
colnames(food_con) <- c("media", "dtxsid")
food_con <- food_con[-grep("1",
                           food_con$media),]
food_con <- food_con[-grep("5",
                           food_con$media),]
food_con <- food_con[-grep("7",
                           food_con$media),]
food_con_edgelist <- food_con
food_con_edgelist$media <- paste0("FDAc_",
                                  food_con_edgelist$media,
                                  sep = "")
fda_edgelist <- unique(
  rbind(food_add_edgelist[which(!food_add_edgelist$media %in% c("FDAa_substances_generally_recognized_as_safe", "FDAa_direct_food_substances_affirmed_as_generally_recognized_as_safe", "FDAa_prior_sanctioned_food_ingredients", "FDAa_food_additives_permitted_in_food_or_in_contact_with_food_on_an_interim_basis_pending_additional_study", "FDAa_indirect_food_substances_affirmed_as_generally_recognized_as_safe")),],
                             food_con_edgelist[which(!food_con_edgelist$media %in% c("FDAc_Food_Additives_Permitted_for_Direct_Addition_to_Food_for_Human_Consumption", "FDAc_Prior_Sanctioned_Food_Ingredients", "FDAc_Direct_Food_Substances_Affirmed_as_Generally_Recognized_as_Safe", "FDAc_Substances_Generally_Recognized_as_Safe", "FDAc_Food_Additives_Permitted_in_Food_or_in_Contact_with_Food_on_an_Interim_Basis_Pending_Additional_Study", "FDAc_Indirect_Food_Substances_Affirmed_as_Generally_Recognized_as_Safe")),])
  )

rm(contact_categories,
   food_add,
   food_add_dtxsids,
   food_add_edgelist,
   food_con,
   food_con_dtxsids,
   food_con_edgelist)
```

103 unique chemicals in FDA Additives data overlap with MMDB (~0.067%) with the remaining 1444 not found in MMDB (~0.933%). Additionally, 171 unique chemicals in FDA Contact data overlap with MMDB (~0.064%) with the remaining 2501 not found in MMDB (~0.936%).

## Chemical Transformations Database

An internal dataset to the EPA is the Chemical Transformations (CheT) Database which contains links from known parent compounds to degradation/breakdown products. Only a one-way traversal is assumed for this network: sources to sinks.

```{r load CheT data}
breakdown_edgelist <- read.csv("./input/data/CheT/breakdown_edgelist_all.csv",
                               header = F)
colnames(breakdown_edgelist) <- c("parent", "product")
breakdown_edgelist <- unique(breakdown_edgelist)
```

## Aggregate Data

### Sources

For filtering purposes later on in the analysis, each source edgelist is combined to compile the unique chemical to media relationships regarding points of origin.

```{r combine source data}
sources <- unique(
  rbind(cdr_edgelist,
        chemexpo_edgelist,
        pharm_edgelist,
        fda_edgelist)
  )
```

### Complete Edgelist

This edgelist contains all ties available from datasets so far and will be used in connecting edges from Parent chemicals to their Breakdown Products in MMDB later in this section.

```{r complete edgelist}
complete_edgelist <- unique(
  rbind(mmdb_edgelist,
        cdr_edgelist,
        chemexpo_edgelist,
        pharm_edgelist,
        fda_edgelist)
  )
```

### Overlapping Edgelist

An inner join between source and sink media is used to filter out chemicals that only exhibit ties exclusively to one type or another. The idea is to create a network containing all reported or supported connections between sources and sinks.

```{r inner edgelist}
inner_edgelist <- unique(
  rbind(mmdb_edgelist[which(mmdb_edgelist$dtxsid %in% sources$dtxsid),],
        sources[which(sources$dtxsid %in% mmdb_edgelist$dtxsid),])
  )
```

### Breakdown Edges

Next is to create and append edges from Parent chemicals to their associated breakdown products only in MMDB. This assumes a one way traversal, from source to sink, and does not create ties where a parent chemical is found in an MMDB category and its associated breakdown products appear in any source.

```{r breakdown edges}
breakdown_edges <- unique(
  na.omit(
    merge(complete_edgelist,
          breakdown_edgelist,
          by.x = "dtxsid",
          by.y = "product",
          all = T)[,c("media", "parent")]
    )
  )
breakdown_edges <- breakdown_edges[breakdown_edges$media %in% unique(mmdb_edgelist$media),]
colnames(breakdown_edges)[colnames(breakdown_edges) %in% "parent"] <- "dtxsid"
new_edges <- breakdown_edges[which(!paste0(breakdown_edges[,1],
                                           breakdown_edges[,2]) %in% paste0(inner_edgelist[,1],
                                                                            inner_edgelist[,2])),]
```

Breakdown products create 469 connections to sources from MMDB for 171 Parent chemicals not contained found in `inner_edgelist`.

### Create the Network

Appending these breakdown edges to the overlapping source and sink data with finalize the data management phase for the initial network data. The following code will create a network object using the `igraph` package.

```{r create the network}
inner_edgelist_new <- rbind(inner_edgelist,
                            new_edges[which(new_edges$dtxsid %in% mmdb_edgelist$dtxsid & new_edges$dtxsid %in% sources$dtxsid),])
inner_net_new <- graph_from_edgelist(
  as.matrix(inner_edgelist_new)
  )
V(inner_net_new)$type <- bipartite_mapping(inner_net_new)$type
inner_net_new_media <- bipartite_projection(inner_net_new)$proj1
inner_net_new_chem <- bipartite_projection(inner_net_new)$proj2
```

`inner_net_new` is a two-mode network object where chemicals connect to media categories. The corresponding one-mode, chemical-to-chemical projection of this network contained in the object `inner_net_new_chem` depicts chemical co-occurrence based on shared media where those chemicals are found. At this stage there are 1348 unique DTXSIDs and nearly 800,000 edges between them.

## OPERA Data

Leveraging the utility of the `ctxR` package to connect to EPA APIs allows us to collect information regarding predicted OPERA indicators such as 'water-solubility', 'boiling point', and so on. These will simultaneously allow us to filter out in-organics, as these compounds won't have OPERA properties associated with them.

```{r opera data with ctxR}
# connect to APIs ----
chem_cluster <- data.frame(
  "dtxsid" = unique(inner_edgelist_new$dtxsid[1:10])
  )
chem_info_df <- get_chem_info_batch(
  DTXSID = chem_cluster$dtxsid,
  API_key = my_key,
  type = "predicted"
  )

# extract opera properties ----
tmp <- data.frame("dtxsid" = NA)
for(i in unique(chem_info_df[which(chem_info_df$source %in% "OPERA"),]$propertyId)){
  tmp <- merge(tmp, as.data.frame(chem_info_df[which(chem_info_df$source %in% "OPERA" & chem_info_df$propertyId %in% i), c("value", "dtxsid")]), by = "dtxsid", all = T)
  colnames(tmp)[ncol(tmp)] <- i
}
chem_opera <- tmp

opera_data <- left_join(
  chem_cluster,
  chem_opera,
  by = "dtxsid"
  )
opera_data <-
  na.omit(
    opera_data[opera_data$dtxsid %in% V(inner_net_new_chem)$name,]
  )

save(opera_data, file = "./output/data/opera_20240518_test.RData")
load(file = "./output/data/opera_20240518.RData")

rm(tmp)
```


## Quadratic Assignment Procedure (QAP)

The intuition behind this chunk of code stems from our initial assumptions for constructing our chemical co-occurrence network. Every edge in our network is built from *any* instance where two chemicals are found in at least one media. This means that an edge between two chemicals found across 100 media is equivalent to an edge between chemicals that only share one media in common. Conflating instances of several shared media with exclusive media relationships led us to develop a way of "de-noising" our chemical co-occurrence network by removing edges that aren't as robust or statistically significant as the rest.

The tool for the job is a hypothesis testing tool called Quadratic Assignment Procedure (QAP) which tests an observed relationship between two or more matrices against randomized alternatives to determine whether or not the initial observation could be due to simple random chance. QAPs leverage Monte Carlo simulations to shuffle one of the matrices a specified number of times to build a distribution with which to compare against then return a *t*-statistic for each variable (dependent-variable matrix) included in the model. There is a linear as well as a generalized linear version of these models for linear or logistic regression, of which this analysis will implement the latter due to our dichotomous data.

Our goal is to utilize this model to test the relationships between every unique pair of chemicals in our dataset by constructing chemical *ego-networks*, which contain a chemical's shared presence between any two media categories, modeling two matrices between each pair of chemicals, and removing insignificant results or inversely related occurrences.

<p> Key considerations should be noted: <br>
- These models are not computationally inexpensive and take a long time to run, depending on model parameters. <br>
- Constructing and storing lists of networks/adjacency matrices in the global environment requires a large amount of storage space. <br>
- For-loops, which would be a go-to framework for repeated procedures, are sequential and not optimized for the operations necessitated by this analysis. </p>

To remedy these obstacles, the following chunk nests a for-loop solution for our analyses within a `future_mapply` statement, where data management, network construction, and QAP modeling all take place within a local environment by referencing global environment data via indices denoting which chemical-specific data should be pulled. Additionally, this function allows us to parallelize these QAP runs, leveraging more computing power as needed. It should be noted that the following chunk still take a very long time.

The relevant statistic from these analyses is a *t*-statistic which denotes direction, strength, and significance of the relationship. The range of insignificance spans between -2 and 2, and there are no upper and lower bounds. While coefficients and p-values from these models are not appropriate for interpreting relationships, they are included for legacy purposes, which we include in the model output objects.

```{r parallelized qaps}
### All matrix construction and QAP operations placed with future_mapply function
# create reference material ----
dat <- as.data.frame(
  t(
    as_biadjacency_matrix(inner_net_new)
    )
  )
dat$dtxsid <- rownames(dat)
dat <- dat[which(rownames(dat) %in% opera_data$dtxsid),]

ref <- as.matrix(
  as_adjacency_matrix(inner_net_new_chem,
                attr = "weight")
  )
ref <- ref[which(rownames(ref) %in% opera_data$dtxsid),which(colnames(ref) %in% opera_data$dtxsid)]

ref_mat <- matrix(0,
                  nrow = nrow(ref),
                  ncol = ncol(ref))
colnames(ref_mat) <- rownames(ref_mat) <- dat$dtxsid
ind <- as.data.frame(
  which(
    lower.tri(ref_mat,
              diag = FALSE) == TRUE,
    arr.ind = TRUE
    )
  )

# initiate QAP models ----
plan(multicore,
     workers = parallelly::availableCores()-1)
start.time <- Sys.time()
qaps <- future_mapply(function(a, b){
  # create IV matrix
  x <- graph_from_data_frame(
    data.frame(
      "dtxsid" = rep(rownames(get("dat"))[unlist(a)],
                     length(get("dat")[unlist(a),which(!colnames(get("dat")) %in% "dtxsid")])),
      "media" = colnames(get("dat"))[which(!colnames(get("dat")) %in% "dtxsid")],
      "weight" = c(t(get("dat")[unlist(a),which(!colnames(get("dat")) %in% "dtxsid")]))
      )
    )
  V(x)$type <- bipartite_mapping(x)$type
  x <- as.matrix(
    as_adjacency_matrix(
      bipartite_projection(
        graph_from_biadjacency_matrix(
          as_biadjacency_matrix(x,
                        attr = "weight")
          )
        )$proj2,
      attr = "weight")
    )

  # create DV matrix
  y <- graph_from_data_frame(
    data.frame(
      "dtxsid" = rep(rownames(get("dat"))[unlist(b)],
                     length(get("dat")[unlist(b),which(!colnames(get("dat")) %in% "dtxsid")])),
      "media" = colnames(get("dat"))[which(!colnames(get("dat"))%in%"dtxsid")],
      "weight" = c(t(get("dat")[unlist(b),which(!colnames(get("dat")) %in% "dtxsid")]))
      )
    )
  V(y)$type <- bipartite_mapping(y)$type
  y <- as.matrix(
    as_adjacency_matrix(
      bipartite_projection(
        graph_from_biadjacency_matrix(
          as_biadjacency_matrix(y,
                        attr = "weight")
          )
        )$proj2,
      attr = "weight")
    )

  # run QAP
  tmp <- netlogit(y = y,
                  x = x,
                  nullhyp = "classical",
                  reps = 1000,
                  diag = FALSE)

  # extract values to append to 'qaps' object
  list(tstat = tmp$tstat[2],
       coef = tmp$coefficients[2],
       pval = tmp$pgreqabs[2])

},
as.list(ind$row),
as.list(ind$col)
)
end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken

write.csv(
  as.matrix(
    cbind(ind,
          t(
            as.matrix(qaps)
            )
          )
    ),
  "./output/results/qaps_classical_1000reps_20240517.csv",
  col.names = T,
  row.names = F
  )
```

The *classical* null hypothesis corresponds to the original assumptions of the QAP models (see [Krackhardt, 1987](https://pdf.sciencedirectassets.com/271850/1-s2.0-S0378873300X00317/1-s2.0-0378873387900128/main.pdf?X-Amz-Security-Token=IQoJb3JpZ2luX2VjEDQaCXVzLWVhc3QtMSJIMEYCIQCPEEF7kuaJQp4Vq3c1E5ZMa6lJXphVK2MLhyItO8EsZAIhANH6KzUkZSaCgxb5PbseVf%2Fsltfplms4p6tnIlhf%2Bw05KrMFCF0QBRoMMDU5MDAzNTQ2ODY1IgxI6Rwy1Z9Ekd%2FZphoqkAXoX3T1QY4yxoxsuOVrPHGh4zTbrDadxKsOn3Q%2FXFXIj9nV4XugUj5jDiEeorU5pAVq33XjHAQdyTodo8UAFJpyeaVjsOdiCFQeRmB7%2FtbJ7%2FJXfAAnbCqe08TGSS%2FWJNNMs5%2FSqOmEniHowmzpkwpmIFCdmPk7pagI1QLSJ%2F5UvrkpRt%2BWVtvaN%2BJ%2BIZpq%2BWIk%2FbJdihfYssW1ljIOy5lujDuwuyLu%2Fo9w5IeiGDq6iGHv%2FoyGoF%2FRQJn%2F857HT9xVsONLt2UD4jEcvkTmV3kXSD4G4IV1xlYuH1LgchG66WiVBsbWSNPSJawZtY0I%2B3i3f0bCy%2BMY3SK%2BpylGFHvFWB7UNtLH5eC%2BUWyCzhgN1e7U%2BPLpizORYzS0bcg3uB5iGJGhiiICR4RbnEkIcRYI2euoQlWWGO8tHA0%2BW0v2TStMS6JoU5LMFvIvTnDpMhQAcA1seufezR3dL1%2BEMn7Ja6ybpxComanXV7nWFA0CaTatlCSTcwRpXDljbSvYnl%2B%2BdbYqlgwmgeDjxCTuQ7lvUV9INYbzY5aCY7Lpxf59eLbQnmqrauuazONI%2FyEDRTsuf9iIyncmquDih%2BKNbEZevdBO96BDBgBMYEBSTGF08gzqMROo1ocgPSvX2Do641hsyu6jCwqV57nzqkswXjvfepKTWOt%2BM4A1pYbZMQf%2FyXdZ9ezMr7cI%2FYiv3k9I83DACLa%2BnM5ciaFDclPyeGj4ryCyRfAuZ%2FUfJ4HQRa%2BbMxr%2FRlRaxmjoHSZlrIyfX1%2FokCTVc2qn3mkn21h75rvAtZ8W4EHEgRQ0wwrnu3U2B7RSccU5hc4JpKcQU1LPFlCjk%2FjZvSoc%2Fbn6RBRj2L8Bpq%2Fxm8LVinlT%2FnVgqzUZ6DDq4K%2BwBjqwAevcC%2B8Y%2BDFlxQmmOJZrgaQoJxGhHbJfITGOGRc3qkfTz%2BaSlxzzywIfycYXrulDA8JBFvIJWRcFyB%2B8fwF5ImfAOdl%2BYysqrKh5SODyx5hbS5Ufozv24ExjLLLRbteVGy5Ec1EkzefBPlSRBhqx8IyyREpiqCaq5StRKARgH4fp3FX2%2Fv5427U3Kzb4z0U2VbF41a2yGddlZpsVvH%2FdjJ7A%2BUKtz4rWsLLQvYxbKrUJ&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20240402T134048Z&X-Amz-SignedHeaders=host&X-Amz-Expires=300&X-Amz-Credential=ASIAQ3PHCVTYSKYTNMLW%2F20240402%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Signature=454017e8c041987376067cc4d9efeb7e5f2570a191dfa5468d78299d3a20ee91&hash=b3dfeed07448fb321a74a13fa2acadce03a47d65866f288cbd6352937a3fe416&host=68042c943591013ac2b2430a89b270f6af2c76d8dfd086a07176afe7c76c2c61&pii=0378873387900128&tid=spdf-33a92980-0c65-484d-b277-39ac813f8dd2&sid=28ad4eda43cbf14fcb09bbc5262449ae52e7gxrqa&type=client&tsoh=d3d3LnNjaWVuY2VkaXJlY3QuY29t&ua=10145a5803575404580e&rr=86e13b9a0dd08260&cc=us)), and yields identical *t*-statistics to the often preferred Dekker's "semi-partialling plus" procedure (see [Dekker 2007](https://link.springer.com/article/10.1007/s11336-007-9016-1). With the QAP results in hand and our outputs recorded, we now filter out insignificant edges from out network in the following chunk.

```{r filtering insignificant edges from QAPs}
qaps_df <- as.matrix(
    cbind(ind,
          t(
            as.matrix(qaps)
            )
          )
    )

# should you wish to call in the data, use this and following line
qaps_sig <- read.csv("./output/results/qaps_classical_1000reps_20240517.csv", header = T)
qaps_df <- data.frame("tstat" = qaps_sig$tstat)

qaps_df <- cbind(ind,
                 qaps_df)
qaps_df$to <- rownames(ref_mat)[qaps_df$row]
qaps_df$from <- colnames(ref_mat)[qaps_df$col]
qaps_df <- qaps_df[,c("from", "to", "tstat")]

fin <- as_data_frame(inner_net_new_chem)
fin <- fin[which(fin$from %in% opera_data$dtxsid),]
fin <- fin[which(fin$to %in% opera_data$dtxsid),]
fin <- left_join(fin,
                 qaps_df,
                 by = c("from", "to"))
fin_net <- graph_from_data_frame(fin,
                                 directed = F)
fin_net <- delete_edges(fin_net,
                        which(E(fin_net)$tstat < 2))
fin_net <- delete_vertices(fin_net,
                           names(which(igraph::degree(fin_net,
                                                      v = V(fin_net)) == 0)))
rm(qaps_sig)
```

For this part of the project, it was determined to look at only the strongest, significant links between edges to de-noise the overall network, as several edges are retained even after filtering out insignificant edges. The following chunk retains the top 2 links for each chemical. The top 2 are selected because it was the smallest number of edges needed to produce a single component, which is required for the overlapping community detection algorithms.

```{r preparing top-2 links network}
net <- data.frame(
  "tstat" = NA,
  "from" = NA,
  "to" = NA
)
for(i in 1:length(unique(c(qaps_df$from, qaps_df$to)))){
  tmp <- qaps_df[which(qaps_df$from == unique(c(qaps_df$from, qaps_df$to))[i] |
                         qaps_df$to == unique(c(qaps_df$from, qaps_df$to))[i]), c("from", "to", "tstat")]
  z <- tmp[which(tmp$from == unique(c(qaps_df$from, qaps_df$to))[i]),]
  z$from <- tmp[which(tmp$from == unique(c(qaps_df$from, qaps_df$to))[i]),]$to
  z$to <- tmp[which(tmp$from == unique(c(qaps_df$from, qaps_df$to))[i]),]$from
  tmp[which(tmp$from == unique(c(qaps_df$from, qaps_df$to))[i]),] <- z
  tmp <- tmp[,c(2,1,3)]
  colnames(tmp)[1:2] <- c("from", "to")

  tmp <- tmp[order(tmp$tstat, decreasing = T),]
  net <- rbind(net, tmp[1:2,]) # designates number and order of edges
}
z <- graph_from_data_frame(na.omit(net[,c("from", "to", "tstat")]), directed = T)
z <- delete_edges(z, which(E(z)$tstat < 2))
z.1 <- delete_vertices(z, names(which(igraph::degree(z, v = V(z)) == 0)))
```

## Visualizing the Network Pre/Post-Filtering

Here we can visualize the network filtration steps to see how the denoising process has shaped the chemical co-occurrence network.

```{r network visualizations}
pre <- inner_net_new_chem
post <- graph_from_data_frame(get.data.frame(z.1), directed = F)
p <- plot(pre,
     vertex.size = 2,
     vertex.color = rgb(0.9,0.9,0.9,0.2),
     vertex.frame.color = "grey50",
     vertex.label = NA,
     edge.color = rgb(0.9,0.9,0.9,0.2),
     layout = layout_with_kk(pre)
     )
png(filename = "./output/figures/chem-chem_network_pre-filter_layout-kk.png", height = 1000, width = 1000)
set.seed(123)
plot(pre,
     vertex.size = 2,
     vertex.color = rgb(0.9,0.9,0.9,0.9),
     vertex.frame.color = "grey50",
     vertex.label = NA,
     edge.color = rgb(0.9,0.9,0.9,0.9),
     layout = layout_with_kk(pre)
     )
dev.off()

png(filename = "./output/figures/chem-chem_network_post-filter_layout-kk.png", height = 1000, width = 1000)
set.seed(123)
plot(fin_net,
     vertex.size = 2,
     vertex.color = rgb(0.9,0.9,0.9,0.9),
     vertex.frame.color = "grey50",
     vertex.label = NA,
     edge.color = rgb(0.9,0.9,0.9,0.9),
     layout = layout_with_kk(post)
     )
dev.off()

png(filename = "./output/figures/chem-chem_network_strongestlink2_layout-kk.png", height = 1000, width = 1000)
set.seed(123)
plot(post,
     vertex.size = 2,
     vertex.color = rgb(0.9,0.9,0.9,0.9),
     vertex.frame.color = "grey50",
     vertex.label = NA,
     edge.color = rgb(0.9,0.9,0.9,0.9),
     layout = layout_with_kk(post)
     )
dev.off()
```

## BIGCLAM Data Preparation

Now that our network is filtered, the next step is to shape our data into a format that Stanford Network Analysis Project's ([SNAP](https://snap.stanford.edu/)) C++ version of the BIGCLAM model will recognize and interpret. This is done by simplifying our IDs to integers and producing an edgelist containing relationships of significant co-occurrence and a list of node labels that will be used to append the DTXSIDs onto the model's output files. The corresponding files mimic those found in SNAP's `agmfit` example C++ code: *football.edgelist* and *football.labels*.

```{r exporting network data for BIGCLAM}
# creating edgelist and labels for SNAP 'agmfit' c++ code from https://snap.stanford.edu/snap/index.html
# mirrors example data files in SNAP-6.0/examples/agmfit: 'football.edgelist' & 'football.labels'
a <- get.data.frame(z.1)[ , 1:2 ]
b <- data.frame("num" = 1:length(unique(V(z.1)$name)),
                "id" = unique(V(z.1)$name))
a$i <- b[match(a$from, b$id),]$num
a$j <- b[match(a$to, b$id),]$num
c <- a[,c("j", "i")]
colnames(c) <- c("i", "j") # this part is essential as 'rbind()' auto reshapes if identical column names are found
d <- rbind(a[,c("i", "j")], c) # creates duplicate edge for undirected relationship
write.table( d, file = "./output/data/SNAP/bigclam/undirected.qap.classical.1000.opera.stronglink2.20240715.edgelist", row.names = FALSE, col.names = FALSE, sep="\t" )
write.table( b, file = "./output/data/SNAP/bigclam/undirected.qap.classical.1000.opera.stronglink2.20240715.labels", row.names = FALSE, col.names = FALSE, sep="\t" )

rm(a,b,c,d)
```

## Media Data

The final data preparation step is to subset our media data for subsequent modeling. The following chunk creates the required object.

```{r subset media data}
media.data <- as.data.frame(t(get.incidence(inner_net_new)))[which(rownames(as.data.frame(t(get.incidence(inner_net_new)))) %in% V(z.1)$name),]
```

## Functional Use Category Data

Loading Quantitative Structure Use Relationships (QSUR) predictions from "ccd_qsur_table_Sep-27-2023.csv". These are obtained via the CompTox Chemicals Dashboard by batch searching a list of unique DTXSIDs from our assembled network.

```{r functional use categories}
## QSUR predictions 2023-09-27
qsur <- read.csv("./input/data/ccd_qsur_table_Sep-27-2023.csv", header = T)
qsur <- qsur[which(qsur$dtxsid %in% rownames(media.data)),]
qsur$harmonized_functional_use <- paste0("FU_", qsur$harmonized_functional_use, sep = "")
qsur_net <- graph_from_data_frame(qsur, directed = F)
V(qsur_net)$type <- bipartite.mapping(qsur_net)$type
qsur_df <- as.data.frame(get.incidence(qsur_net, attr = "probability"))
```

# Data Analysis

## WALKTRAP Community Detection Algorithm

The [Walktrap](https://igraph.org/r/doc/cluster_walktrap.html) algorithm simulates random "walks" through the network from node to node following the edges. Different areas of the network result in varying probabilities that the walkers will end their traversal in said region and thus can be assessed by areas of similar probabilities to identify likely communities. *t*-statistics are used for the "weights" argument. Since networks can vary in size and scope, a "steps" argument allows for longer or shorter traversals, with the default value starting at "4".

The following chunk begins by performing several walks spanning a range of steps from 4 to 200, then calculates and stores the [modularity](https://igraph.org/r/doc/modularity.igraph.html) of the community membership.

A final run of the algorithm is conducted using the steps value which produced the highest modularity score. A colored-network figure is generated and community-specific lists of DTXSIDs are produced to aid in the manual review of the chemical composition of these groups.

The last part of this chunk transforms the community membership information into a format that will be used in subsequent modeling phases to examine enrichment patterns of these communities.

```{r walktrap algorithm}
modu <- list()
for(i in 1:197){ # checking variable step sizes from 4 (default) to 200
  wc <- cluster_walktrap(z.1,
                         weights = E(z.1)$tstat,
                         steps = 3+i)
  modu[[i]] <- modularity(z.1, wc$membership)
}
wc <- cluster_walktrap(z.1,
                       weights = E(z.1)$tstat,
                       steps = 3+which(unlist(modu)==max(unlist(modu)))
                       )
Membership <- hue_pal()(max(wc$membership))

png(filename = "./output/figures/chem-chem_network_strongestlink2_layout-kk_Walktrap.png", height = 1000, width = 1000)
set.seed(123)
plot(post,
     vertex.size = 2,
     vertex.color = Membership[wc$membership],
     vertex.frame.color = Membership[wc$membership],
     vertex.label = NA,
     edge.color = rgb(0.9,0.9,0.9,0.9),
     layout = layout_with_kk(post)
     )
dev.off()

# pull DTXSIDs for manual review
for(i in 1:max(wc$membership)){
  write.table(wc$names[wc$membership==i],
              file = paste0("./output/results/walktrap/membership-lists/walktrap-weighted_comm_", i, ".txt", sep = ""),
              col.names = F,
              row.names = F,
              quote = F
  )
}

## WALKTRAP
wt_comm_net <- graph_from_edgelist(as.matrix(data.frame("from" = wc$names, "to" = wc$membership), directed = F))
V(wt_comm_net)$type <- bipartite.mapping(wt_comm_net)$type
wt_comm_mat <- as.matrix(get.incidence(wt_comm_net))
wt_comm_mat <- as.data.frame(wt_comm_mat)
wt_comm_mat[,1:ncol(wt_comm_mat)] <- lapply(wt_comm_mat[,1:ncol(wt_comm_mat)], as.factor)
colnames(wt_comm_mat) <- trimws(colnames(wt_comm_mat))
wt_comm_mat <- wt_comm_mat[,order(as.numeric(colnames(wt_comm_mat)))]
wt_comm_mat$dtxsid <- rownames(wt_comm_mat)

rm(modu)
```


## Overlapping Community Detection

This portion of the analyses requires the user to use SNAP's Cluster Affiliation Graph Model for Big Networks (BIGCLAM) either via their [C++](https://snap.stanford.edu/snap/index.html) (detailed in this document) or [Python](https://snap.stanford.edu/snappy/index.html) implementation. The output files for the BIGCLAM run (provided in the "outputs/results/" folder) are loaded and transformed for subsequent review and modeling.

### SNAP BIGCLAM

```{r load BIGCLAM outputs}
## BIGCLAM
bigclam.communities <- read.csv("./output/results/SNAP/bigclam/stronglink2/bigclam.undirected.qap.classical.1000.strongest.link.2.20240715.cmtyvv.txt", sep = "\t", header = F)

bigclam.communities.long <- data.frame()
for( i in 1:nrow(bigclam.communities)){
  for (j in 1:length(bigclam.communities[i,])){
    bigclam.communities.long <- rbind(bigclam.communities.long, cbind(as.data.frame(i), bigclam.communities[i,][[j]]))
  }
}
bigclam.comm.long <- bigclam.communities.long
colnames(bigclam.comm.long) <- c("cluster", "dtxsid")
bigclam.comm.long$dtxsid <- bigclam.comm.long$dtxsid %>% na_if("")
bigclam.comm.long <- na.omit(bigclam.comm.long)
bigclam.comm.long <- bigclam.comm.long[order(bigclam.comm.long$cluster),]

# pull DTXSIDs for manual review
for(i in 1:max(bigclam.comm.long$cluster)){
  write.table(bigclam.comm.long[bigclam.comm.long$cluster==i,]$dtxsid, # specify which community to extract here
              file = paste0("./output/results/SNAP/bigclam/stronglink2/membership-lists/community_", i, ".txt", sep = ""),
              col.names = F,
              row.names = F,
              quote = F
  )
}
write.table(V(z.1)$name[which(!V(z.1)$name %in% bigclam.comm.long$dtxsid)],
            file = "./output/results/SNAP/bigclam/stronglink2/membership-lists/unassigned_nodes.txt",
              col.names = F,
              row.names = F,
              quote = F
            )

## BIGCLAM
bigclam_comm_net <- graph_from_edgelist(as.matrix(bigclam.comm.long), directed = F)
V(bigclam_comm_net)$type <- bipartite.mapping(bigclam_comm_net)$type
bigclam_comm_mat <- as.matrix(get.incidence(bigclam_comm_net))
bc_comm_mat <- as.data.frame(t(bigclam_comm_mat))
bc_comm_mat[,1:ncol(bc_comm_mat)] <- lapply(bc_comm_mat[,1:ncol(bc_comm_mat)], as.factor)
bc_comm_mat$dtxsid <- rownames(bc_comm_mat)
colnames(bc_comm_mat) <- trimws(colnames(bc_comm_mat))
```

### Visualizing Overlap

To compare the degree of overlap between these communities, an UpSet plot helps compare the combinations of chemicals across each of the groups. Due to the large number of combinations, the plot is limited to only include the most numerous.

```{r UpSet plot}
# create overlapping data ----
m1 <- as.data.frame(t(bigclam_comm_mat))
colnames(m1) <- trimws(colnames(bc_comm_mat[,which(!colnames(bc_comm_mat)%in%"dtxsid")]))
m1 <- make_comb_mat(m1, mode = "intersect")

# create UpSet plot ----
filt <- 1 # only combinations with more than the specified number of chemicals in common

png(filename = "./output/figures/BIGCLAM_upset-plot.png", height = 500, width = 800)
UpSet(m1[comb_size(m1) >= filt &
           comb_degree(m1) >= 2],
      comb_order = order(comb_size(m1[comb_size(m1) >= filt &
                                        comb_degree(m1) >= 2])),
      top_annotation = upset_top_annotation(m1[comb_size(m1) >= filt &
                                                 comb_degree(m1) >= 2],
                                            add_numbers = TRUE),
      right_annotation = upset_right_annotation(m1[comb_size(m1) >= filt &
                                                     comb_degree(m1) >= 2],
                                                add_numbers = TRUE)
      )
dev.off()
rm(m1)
```

## Sankey Diagram of Communities Between Algorithms

A Sankey Diragram is used here to demonstrate the mutual inclusion of chemicals between groups produced from the Walktrap and BIGCLAM algorithms. Connections between the groups indicate the number of shared chemicals between communities produced by the differing community detection algorithms. It is worth noting that parts of the following chunk include manually specifying a "*group*" variable.

```{r sankey diagram}
# Sankey Data between Comms
c <- data.frame("source" = NA,
                "target" = NA,
                "value" = NA)
for(i in 1:max(bigclam.comm.long$cluster)){
  for(j in 1:max(membership(wc))){
    d <- bc_comm_mat[which(bc_comm_mat[,i] == 1),]$dtxsid[which(bc_comm_mat[which(bc_comm_mat[,i] == 1),]$dtxsid %in% wt_comm_mat[which(wt_comm_mat[,j] == 1),]$dtxsid)]
    c <- rbind(
      c,
      data.frame(
        "source" = i,
        "target" = j,
        "value" = ifelse(length(d) == 0, 0, length(d))
          )
      )
  }
}
sankey.data <- na.omit(c[which(c$value > 0),])
sankey.data$source <- paste0("BC.", sankey.data$source, sep = "")
sankey.data$target <- paste0("WT.", sankey.data$target, sep = "")

# A connection data frame is a list of flows with intensity for each flow
links <- sankey.data

# From these flows we need to create a node data frame: it lists every entities involved in the flow
nodes <- data.frame(
  name=c(as.character(links$source),
         as.character(links$target)) %>% unique()
)

# Manually adding in hyper-community group IDs
nodes$group <- as.factor(c("Pharmaceutical",
                           "Pharmaceutical",
                           "Persistent",
                           "Pharmaceutical",
                           "Persistent",
                           "Persistent",
                           "Pesticides",
                           "Consumer",
                           "Pesticides",
                           "Consumer",
                           "Consumer",
                           "Consumer",
                           "Persistent",
                           "Pharmaceutical",

                           "Pharmaceutical",
                           "Pharmaceutical",
                           "Consumer",
                           "Pesticides",
                           "Consumer",
                           "Pharmaceutical",
                           "Pesticides",
                           "Other",
                           "Other",
                           "Other",
                           "Pharmaceutical",
                           "Other",
                           "Other",
                           "Consumer",
                           "Consumer",
                           "Other",
                           "Other",
                           "Persistent",
                           "Persistent",
                           "Persistent",
                           "Other",
                           "Other",
                           "Persistent",
                           "Other",
                           "Pesticides",
                           "Other",
                           "Pesticides",
                           "Persistent",
                           "Other",
                           "Other",
                           "Other"
                           ))

my_color <- 'd3.scaleOrdinal() .domain(["Pharmaceutical", "Persistent", "Consumer", "Pesticides", "Other"]) .range(["#F4B084", "#9BC2E6", "#FFFF00", "#A9D08E", "grey"])'

# With networkD3, connection must be provided using id, not using real name like in the links dataframe. So we need to reformat it.
links$IDsource <- match(links$source, nodes$name)-1
links$IDtarget <- match(links$target, nodes$name)-1

# Make the Network
p <- sankeyNetwork(Links = links,
                   Nodes = nodes,
                   Source = "IDsource",