DAACS-Math-IRT-DIF/Math1_Creating Analytic Sample Datasets.Rmd at main · ORosca/DAACS-Math-IRT-DIF · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Creating Analytic Sample Dataframes from IES DAACS Mathematics Assessment (174 items), May 2022 - May 2023, umgc1-and-ua2 Combined Sample, n = 4460"
author: "Oxana Rosca"
date: "`r Sys.Date()`"
output:
  word_document:
    toc: true
    toc_depth: 6
    reference_docx: "C:/Users/orosc/OneDrive - University at Albany - SUNY/My DAACS/WordDocMarkdownTemplate.docx"
  html_document:
    toc: true
    toc_depth: 6
    theme: readable
---

# Purpose: Data Organization
The purpose of this document is to organize the data for the DAACS math assessment (174 items) collected between May 2022 and May 2023 from two colleges: UMGC1 and UA2. The data were cleaned and organized into respondent-level and item-level dataframes.
  1.	Item-level data collected by DAACS (Analytic Sample 1, AnSamp1).
  2.	Institution-provided respondent data: demographic information (Analytic Sample 2, AnSamp2).
  Two colleges' dataframes were combined into a single dataframe for further analysis.

# Assessment:  DAACS Mathematics
DAACS Math Assessment was designed to measure students' math skills and knowledge, including algebra, geometry, and statistics.The assessment consists of 174 math items from a pool of items (k = 174). Each student completed either 18 items (k_admin_min = 18) or 24 items (k_admin_max = 24) during their first attempt. For the 2022–2023 data collection, we implemented an adaptive multistage testing design, with item difficulty levels assigned by state standards and expert evaluations. The assessment was administered online, and students were allowed to use a calculator and other resources. The assessment was untimed, but the time taken by each respondent was recorded.

# Participants
A total of 5447 respondents completed at least one DAACS assessment: 4152 (76.2%) from UMGC1 and 1295 from UA2. All respondents were newly registered undergraduate students: each had both a DAACS-assigned ID (DAACS_ID) and an institution-assigned ID.
For the math assessment specifically, 4621 students participated : 3869 (83.7%) UMGC1 and 752 UA2.

## Analytic Samples
Two analytic samples were created for the DAACS math assessment:

### Analytic Sample 1 (AnSamp1), n = 4460
  Purpose: For IRT analyses.
  Composition: Includes first-attempt scores from 3713 (83.3%) UMGC1 and 747 (16.7%) UA2 non-speedy respondents.
  Data: Collected between May 2022 and May 2023, including all non-speedy respondents' math scores from their first attempts.
The dataset "math.itemsONLY_AnSamp1" includes 174 items' scores (Q001–Q174) but excludes student IDs and other variables.
A detailed dataframe "math.items_AnSamp1" includes 194 columns:174 item scores, 2 ID variables,	18 personal variables, such as math total scores, demographic variables (e.g., gender, age, and college [UMGC or UA2]).

### Would-be-sample 2022 (sampW22), n = 2614, Numgc1 = 1887, Nua2 = 727
Purpose: For Missing Vallues Analyses.
sampW22 is a subset of Analytic Sample 1: the first-attempt scores from all non-speedy, treatment, umgc1 and UAlbany2 students WHO enrolled in August, 2022 - May, 2023 and completed the DAACS math assessment within first month of their first semester, regardless of whether they have or have no personal data provided by college. These students would be Analytic Sample 2 of this study if all the year-2022 students had personal information provided by the colleges.
  Data: The dataset "math.items_sampW22" includes all eligible students' data.

### Sample 2022 with Demographics (samp22D), n = 1606; 907 (56.5%) UMGC1 and 699 UA2
  Composition: A subset of AnSamp1.
  Criteria: students took the math assessment in 2022, during the first month of the first semester, and had at least one demographic data point ("Age," "gender," "Military," "Pell," or "transfer").
  Data: The dataset "math.items_samp22D" includes only eligible students' data.

### Analytic Sample 2 (AnSamp2), n = 1603; 907 (56.3%) UMGC1 and 696 UA2 students
  Purpose: For DIF and age-group comparisons.
  Composition: A subset of Sample 22D, including first-attempt scores from all non-speedy treatment students who took the math assessment in 2022, during their first semester, and had non-missing values on five demographic variables: age, gender, SES, transfer, military, data available.
  Data: The dataset "math.items_AnSamp2" represents this sample.

# R-packages
```{r}
library(data.table)
#library(plyr)
library(dplyr)
library(flextable)
library(ggplot2)
library(gridExtra)
library(knitr)
library(officer)
library(psych)
library(summarytools)
library(tidyverse)
library(janitor)
```

# Data
Note: Speedy respondents (Took ≤ 180 Seconds for 3 Six-Item Sets out of 30-Set Pool) were included in the original data.
```{r}
## load the umgc matched (m) clean data
umgc_m_clean <- new.env()
## load the anonymized (a) UAlbany (ua) clean data
ua_a_clean<-new.env()
#load("D:/Dropbox/DAACS-Validity/Analyses/dataPrep/dataClean-UMGC1_matched_alor.RData",envir = umgc_m_clean)
#load("D:/Dropbox/DAACS-Validity/Analyses/dataPrep/dataClean-UA2_anonymized_alor.RData",envir = ua_a_clean)
#load("E:/OneDrive - University at Albany - SUNY/My DAACS/dataClean-UMGC1_matched_alor.RData",envir = umgc_m_clean)
#load("E:/OneDrive - University at Albany - SUNY/My DAACS/dataClean-UA2_anonymized_alor.RData",envir = ua_a_clean)
load("C:/Users/orosc/OneDrive - University at Albany - SUNY/My DAACS/dataClean-UMGC1_matched_alor.RData",envir = umgc_m_clean)
load("C:/Users/orosc/OneDrive - University at Albany - SUNY/My DAACS/dataClean-UA2_anonymized_alor.RData",envir = ua_a_clean)
```

## UMGC1: Aug 2022 - May 2023 Data-collection (n = 3869). Two IDs per student:
DAACS-assigned ID (DAACS_ID) and institution ID (unique_id)

### Data Collected by DAACS

#### 4152 students took at least one DAACS Assessment
```{r}
daacs_umgc1 <- umgc_m_clean$daacs_clean
dim(daacs_umgc1) # 4152
```

####  3869 students completed DAACS Math assessment in Aug 2022 - May 2023
```{r}
daacs_math_1stAtt_umgc1 <- umgc_m_clean$daacs_clean [
    !is.na (umgc_m_clean$daacs_clean$math_attempt),]# 3869
dim(daacs_math_1stAtt_umgc1) # number of the rows and columns
min(daacs_math_1stAtt_umgc1$mathCompletionDate, na.rm = TRUE)
max(daacs_math_1stAtt_umgc1$mathCompletionDate, na.rm = TRUE)
```

#### mathTotal scores Frequency Table from all 3869 umgc1 students in 2022-23
```{r}
#library(knitr)
# Create a frequency table and convert it to a data frame
table_tmp <- as.data.frame(table(daacs_math_1stAtt_umgc1$mathTotal, useNA = 'always'))
# Rename the columns
colnames(table_tmp) <- c("mathTotal_Scores", "Frequency")
# Convert the mathTotal_Scores column to numeric and round to 2 decimal places
# Ensure it's character to add "Total" later
table_tmp$mathTotal_Scores <- as.character(table_tmp$mathTotal_Scores)
# Convert numeric entries
numeric_scores_tmp <- suppressWarnings(as.numeric(table_tmp$mathTotal_Scores))
# Round and update
table_tmp$mathTotal_Scores[!is.na(numeric_scores_tmp)] <-
  round(numeric_scores_tmp, 2)
# Add a total row; Add spaces for right-alignment
total_row_tmp <- data.frame(mathTotal_Scores = sprintf("%20s", "Total"),
Frequency = sum(table_tmp$Frequency))
table_tmp <- rbind(table_tmp, total_row_tmp)
# Transpose the table
table_tmp <- t(table_tmp)
# Convert the transposed table to a data frame for better rendering
table_tmp <- as.data.frame(table_tmp)
# Rename the columns for better readability
colnames(table_tmp) <- table_tmp[1, ]  # Use the first row as column names
table_tmp <- table_tmp[-1, ]  # Remove the first row
# Format the transposed table with knitr::kable for display
mathTotal_umgc1_3869wsp_tb<-kable(
table_tmp,
caption =
"mathTotal scores among all 2022-23 UMGC students (including the speedy respondents)"
)
mathTotal_umgc1_3869wsp_tb
```

#### Mapping file: DAACS_ID and institution_ID (n = 4152); All umgc1 students who took at least one DAACS Assesment
```{r}
mapping_daacs_umgc1<-umgc_m_clean$umgc.mapping
dim(mapping_daacs_umgc1) # 4152
```

#### Time taken for assessment: total min=27s, min for 100% score=951s, min for 70% score=501s
27 second is the minimum taken time for Math assessment at UMGC1
```{r}
min(daacs_math_1stAtt_umgc1$mathTime, na.rm = T)
```

951 seconds is the shortest time for a UMGC1 student to get a perfect score
```{r}
min(daacs_math_1stAtt_umgc1[daacs_math_1stAtt_umgc1$mathTotal == 1.0,]$mathTime, na.rm = T)
```

501 seconds is the shortest time for a UMGC1 student to get at least 70%-score
```{r}
min(daacs_math_1stAtt_umgc1[daacs_math_1stAtt_umgc1$mathTotal >= .7,]$mathTime, na.rm = T)
```

##### 83874 items were responded to by 3869 students in Aug2022-May2023
```{r}
math.items_long_umgc1 <-
  umgc_m_clean$math.items_clean[
    umgc_m_clean$math.items_clean$unique_id %in%
      daacs_math_1stAtt_umgc1$unique_id,] # 83874
dim(math.items_long_umgc1)

length(unique(daacs_math_1stAtt_umgc1$unique_id))

sum(is.na(math.items_long_umgc1$score))
```
#### Recode all Q157 into Q016 to fix a machine error
We had to fix a machine error: The item with stem (or prompt) that starts with
"Gisselle ...", was coded twice in UMGC1 data, as Q016 and Q157.
```{r}
#library(maditr)
#library(dplyr)
# Shorten the stems (strings) in "question" variable to 75 characters
math.items_long_umgc1$question <- substr(math.items_long_umgc1$question, 1, 75)
# Subset all existing combinations/patterns of items' qid and stems:
tmp <- math.items_long_umgc1 %>%
  dplyr::select(question, qid) %>%
  distinct(question, qid, .keep_all = TRUE) %>%
  arrange(question)
# Subset the non-unique values of $question
non_unique_questions_tmp <- tmp %>%
  group_by(question) %>%
  filter(n() > 1) %>%
  ungroup()
non_unique_questions_tmp
```

```{r}
# Recode all Q157 into Q016
math.items_long_umgc1 <- math.items_long_umgc1 %>%
  mutate(qid = if_else(qid == "Q157", "Q016", qid))
any(math.items_long_umgc1$qid == "Q157")
```
##### 82326 items were responded to on the first attempt by 3869 students in Aug2022-May2023
```{r}
math.items1stAtt_long_umgc1<-
              math.items_long_umgc1[which(math.items_long_umgc1$attempt=="1"),]
dim(math.items1stAtt_long_umgc1)
```

No Duplicated IDs (more than 1 attempt) in the long data (no more than 24 items per student): k_admin_max = 24 items. Since the math.items is a long-form file with 18 or 24 rows per student (k_admin), to test duplicate IDs, we test whether there were more than 24 rows/items for any st ID:

```{r}
# No DAACS_IDs with more than 24 rows
tmp <- math.items1stAtt_long_umgc1 %>%
  group_by(DAACS_ID) %>%
  dplyr::summarise(num_rows = n())
nrow(tmp %>%filter(num_rows > 24))
# No institution_ID's with more than 24 rows
tmp <- math.items1stAtt_long_umgc1 %>%
  group_by(unique_id) %>%
  dplyr::summarise(num_rows = n())
nrow(tmp %>%filter(num_rows > 24))
```
No  missing values on "attempt", 82326 first-attempt values
Frequency/Propensity-table function for a single categorical variable

Note:  package plyr interferes with dplyr when using Propensities_CatVar; Hence, attach and detach the plyr for and after every single use of it.
```{r}
# check if the plyr package is loaded and then detach it if it is:
if ("package:plyr" %in% search()) {
  # Detach 'plyr' if it is loaded
  detach("package:plyr", unload = TRUE)
  message("Package 'plyr' was loaded and has been detached.")
} else {
  message("Package 'plyr' is not loaded.")
}

#library(dplyr)
# Define the function
Propensities_CatVar <- function(data, variable) {
  data %>%
    group_by(!!sym(variable)) %>% # Group by the variable
    dplyr::summarize(Count = n(), .groups = "drop") %>% # Use dplyr's summarize
    mutate(
      Percent = round(Count / sum(Count, na.rm = TRUE) * 100, 2) # Compute percentages
    ) %>%
    rename(!!variable := !!sym(variable)) # Rename the column to match the variable name
}
# Example usage
mathUMGC1_Attempt_allResps_tb <- Propensities_CatVar(math.items_long_umgc1, "attempt")
print(mathUMGC1_Attempt_allResps_tb)
```

#### Wide form item-level df (Nresponses = 82326, n = 3869)
Restructure the df into a wide form
```{r}
# library (reshape2)
math.items1stAtt_wide_umgc1_wIDs <- reshape2::dcast(math.items1stAtt_long_umgc1,
                          unique_id + DAACS_ID ~ qid, value.var = 'score')
dim(math.items1stAtt_wide_umgc1_wIDs) # 3869  176: k = 174 items, plus two ID variables

# Good quality of the new df:

# No empty rows
nrow(math.items1stAtt_wide_umgc1_wIDs)-
  (nrow(math.items1stAtt_wide_umgc1_wIDs[rowSums(is.na(math.items1stAtt_wide_umgc1_wIDs))!=                           ncol(math.items1stAtt_wide_umgc1_wIDs),]))

# No missing values on IDs
nrow(math.items1stAtt_wide_umgc1_wIDs[is.na(math.items1stAtt_wide_umgc1_wIDs$DAACS_ID),])
nrow(math.items1stAtt_wide_umgc1_wIDs[is.na(math.items1stAtt_wide_umgc1_wIDs$unique_id),])
# No duplicate cases on IDs
table(duplicated(math.items1stAtt_wide_umgc1_wIDs$DAACS_ID),useNA = 'always')
table(duplicated(math.items1stAtt_wide_umgc1_wIDs$unique_id),useNA = 'always')
```
### Data Provided by UMGC;  student-level; Rolling Admissions Aug 2022 - Dec 2022
950 students with personal data provided by UMGC completed DAACS Math in Aug2022-May2023

```{r}
institution_math_umgc1 <- merge(
    umgc_m_clean$institution_clean,
    daacs_math_1stAtt_umgc1[, c("unique_id", "mathCompletionDate")],
    by = "unique_id"
)

dim(institution_math_umgc1) # 950
```

#### Good quality of the new dataframe (n = 950)
```{r}
dim(institution_math_umgc1) # 950  28

# No empty rows
nrow(institution_math_umgc1)-
  (nrow(institution_math_umgc1[
    rowSums(is.na(institution_math_umgc1))!=
                        ncol(institution_math_umgc1),]))

# The institution_IDs were padded with zeroes
min(institution_math_umgc1$unique_id)

#The DAACS_IDs were adjusted by adding 10000
min(institution_math_umgc1$DAACS_ID)

# No missing values on IDs
nrow(institution_math_umgc1[
  is.na(institution_math_umgc1$DAACS_ID),])
nrow(institution_math_umgc1[
  is.na(institution_math_umgc1$unique_id),])
```

##### 950  students with personal data provided by UMGC enrolled in 2022 and completed the assessment during their first semester.
```{r}
institution_math_umgc1_2022<- subset(
institution_math_umgc1,
format(mathCompletionDate, "%Y") == "2022"
)
dim(institution_math_umgc1_2022) # 950
min(institution_math_umgc1_2022$mathCompletionDate)
max(institution_math_umgc1_2022$mathCompletionDate)
```

## UA2: May2022 - Apr2023 Data-collection (n = 752). Two IDs per student:
DAACS-assigned ID (DAACS_ID) and institution ID (fakeID)

###  Data Collected by DAACS

#### 1295 students took at least one DAACS Assessment
```{r}
daacs_ua2 <- ua_a_clean$daacs_ua2_clean
dim(daacs_ua2) # 1295
```

#### 752 students completed DAACS Math assessment in May2022 - Apr2023
```{r}
daacs_math_1stAtt_ua2 <-
  ua_a_clean$daacs_ua2_clean [
    !is.na (ua_a_clean$daacs_ua2_clean$math_attempt),] # 752
dim(daacs_math_1stAtt_ua2)
min(daacs_math_1stAtt_ua2$mathCompletionDate, na.rm = TRUE)
max(daacs_math_1stAtt_ua2$mathCompletionDate, na.rm = TRUE)
```

#### mathTotal scores Frequency Table from all 752 students in 2022-23
```{r}
# Create a frequency table and convert it to a data frame
table_tmp <- as.data.frame(table(daacs_math_1stAtt_ua2$mathTotal, useNA = 'always'))
# Rename the columns
colnames(table_tmp) <- c("mathTotal_Scores", "Frequency")
# Convert the mathTotal_Scores column to numeric and round to 2 decimal places
# Ensure it's character to add "Total" later
table_tmp$mathTotal_Scores <- as.character(table_tmp$mathTotal_Scores)
# Convert numeric entries
numeric_scores_tmp <- suppressWarnings(as.numeric(table_tmp$mathTotal_Scores))
# Round and update
table_tmp$mathTotal_Scores[!is.na(numeric_scores_tmp)] <-
  round(numeric_scores_tmp, 2)
# Add a total row; Add spaces for right-alignment
total_row_tmp <- data.frame(mathTotal_Scores = sprintf("%20s", "Total"),
Frequency = sum(table_tmp$Frequency))
table_tmp <- rbind(table_tmp, total_row_tmp)
# Transpose the table
table_tmp <- t(table_tmp)
# Convert the transposed table to a data frame for better rendering
table_tmp <- as.data.frame(table_tmp)
# Rename the columns for better readability
colnames(table_tmp) <- table_tmp[1, ]  # Use the first row as column names
table_tmp <- table_tmp[-1, ]  # Remove the first row
# Format the transposed table with knitr::kable for display
mathTotal_ua2_752wsp_tb<-kable(
table_tmp,
caption =
"mathTotal scores among all 2022-23 UAlbany students (including the speedy respondents)"
)
mathTotal_ua2_752wsp_tb
```

#### Mapping file: DAACS_ID and institution_ID (n = 1295); All ua2 students who took at least one DAACS Assesment
```{r}
mapping_daacs_ua2<-ua_a_clean$ua.mapping
dim(mapping_daacs_ua2) # 1295 2
```

#### Time taken for assessment: total min=72s, min for 100% score=708s, min for 70% score=532s # 752
72 second is the minimum taken time for Math assessment at UA2
```{r}
min(daacs_math_1stAtt_ua2$mathTime, na.rm = T)
```

708 seconds is the shortest time for a UA2 student to get a perfect score
```{r}
min(daacs_math_1stAtt_ua2[daacs_math_1stAtt_ua2$mathTotal == 1.0,]$mathTime, na.rm = T)
```

532 seconds is the shortest time for a UA2 student to get at least 70%-score
```{r}
min(daacs_math_1stAtt_ua2[daacs_math_1stAtt_ua2$mathTotal >= .7,]$mathTime, na.rm = T)
```

##### 16134 items were responded to by 752 UA2 students in May2022-Apr2023
```{r}
math.items_long_ua2 <-
  ua_a_clean$math.items_ua2_clean[
    ua_a_clean$math.items_ua2_clean$fakeID %in%
      daacs_math_1stAtt_ua2$fakeID,]

dim(math.items_long_ua2) # 16134
length(unique(daacs_math_1stAtt_ua2$fakeID))
sum(is.na(math.items_long_ua2$score))
```

```{r}
# Shorten the stems (strings) in "question" variable to 75 characters
math.items_long_ua2$question <- substr(math.items_long_ua2$question, 1, 75)
```

##### 15786 items were responded to on the first attempt by 752 students in May2022-Apr2023
```{r}
math.items1stAtt_long_ua2<-
  math.items_long_ua2[which(math.items_long_ua2$attempt=="1"),]
dim(math.items1stAtt_long_ua2)
```

No Duplicated IDs (more than 1 attempt) in UA2 long data
```{r}
# No DAACS_IDs with more than 24 rows
tmp <- math.items1stAtt_long_ua2 %>%
  group_by(DAACS_ID) %>%
  dplyr::summarise(num_rows = n())
nrow(tmp %>%filter(num_rows > 24))
# No institution_ID's with more than 24 rows
tmp <- math.items1stAtt_long_ua2 %>%
  group_by(fakeID) %>%
  dplyr::summarise(num_rows = n())
nrow(tmp %>%filter(num_rows > 24))
```

No Duplicated IDs (more than 1 attempt) in UMGC1 long data (no more than 24 items
per student).
Since the math.items is a wide-form file with 24 rows per student, to test duplicate
IDs, we test whether there were more than 24 rows/items for any st ID:

```{r}
# No DAACS_IDs with more than 24 rows
tmp <- math.items1stAtt_long_ua2 %>%
  group_by(DAACS_ID) %>%
  dplyr::summarise(num_rows = n())
nrow(tmp %>%filter(num_rows > 24))
# No institution_ID's with more than 24 rows
tmp <- math.items1stAtt_long_ua2 %>%
  group_by(fakeID) %>%
  dplyr::summarise(num_rows = n())
nrow(tmp %>%filter(num_rows > 24))

# No  missing values on "attempt", 15786 first-attempt values
mathUA2_Attempt_allResps_tb <- Propensities_CatVar(math.items_long_ua2, "attempt")
print(mathUA2_Attempt_allResps_tb)
```

No machine errors in coding QIDs
```{r}
# Subset all existing combinations/patterns of items' qid and stems:
tmp <- math.items_long_ua2 %>%
  dplyr::select(question, qid) %>%
  distinct(question, qid, .keep_all = TRUE) %>%
  arrange(question)
# Subset the non-unique values of $question
non_unique_questions_tmp <- tmp %>%
  group_by(question) %>%
  filter(n() > 1) %>%
  ungroup()
nrow(non_unique_questions_tmp)
```

#### Wide form item-level df (N_responses = 15786, n = 752)
Restructure the df into a wide form
```{r}
math.items1stAtt_wide_ua2_wIDs <- reshape2::dcast(math.items1stAtt_long_ua2,
                        fakeID + DAACS_ID ~ qid, value.var = 'score')
dim(math.items1stAtt_wide_ua2_wIDs) # 752  176: k = 174 items plus two ID variables

# Good quality of the new df:

# No empty rows
nrow(math.items1stAtt_wide_ua2_wIDs)-
  (nrow(math.items1stAtt_wide_ua2_wIDs[
    rowSums(is.na(math.items1stAtt_wide_ua2_wIDs))!=
                                  ncol(math.items1stAtt_wide_ua2_wIDs),]))
# institution_ID were padded with zeroes
min(math.items1stAtt_wide_ua2_wIDs$fakeID)
# DAACS_IDs were adjusted by adding 10000
min(math.items1stAtt_wide_ua2_wIDs$DAACS_ID)
# No missing values on IDs
nrow(math.items1stAtt_wide_ua2_wIDs[is.na(math.items1stAtt_wide_ua2_wIDs$DAACS_ID),])
nrow(math.items1stAtt_wide_ua2_wIDs[is.na(math.items1stAtt_wide_ua2_wIDs$fakeID),])
#No duplicate cases on IDs
table(duplicated(math.items1stAtt_wide_ua2_wIDs$DAACS_ID),useNA = 'always')
table(duplicated(math.items1stAtt_wide_ua2_wIDs$fakeID),useNA = 'always')
```

###  Data Provided by UAlbany; student-level;  Regular Admissions May 2022 - Dec 2022
752  students with personal data provided by UAlbany completed DAACS Math in May2022-Apr2023
```{r}
institution_math_ua2  <- merge(
    ua_a_clean$institution_ua2_clean,
    daacs_math_1stAtt_ua2[, c("fakeID", "mathCompletionDate")],
    by = "fakeID")
dim(institution_math_ua2) # 1081
```

#### Good quality of the new dataframe (n = 752)
```{r}
dim(institution_math_ua2) # 752  35

# No empty rows
nrow(institution_math_ua2)-
  (nrow(institution_math_ua2[
    rowSums(is.na(institution_math_ua2))!=
                        ncol(institution_math_ua2),]))

# The institution_IDs were padded with zeroes
min(institution_math_ua2$fakeID)

#The DAACS_IDs were adjusted by adding 10000
min(institution_math_ua2$DAACS_ID)

# No missing values on IDs
nrow(institution_math_ua2[
  is.na(institution_math_ua2$DAACS_ID),])
nrow(institution_math_ua2[
  is.na(institution_math_ua2$fakeID),])
```

#### 740  students with personal data provided by UAlbany enrolled in 2022 and completed
the assessment during their first semester.
```{r}
institution_math_ua2_2022<- subset(
institution_math_ua2,
format(mathCompletionDate, "%Y") == "2022"
)
dim(institution_math_ua2_2022) # 740
min(institution_math_ua2_2022$mathCompletionDate)
max(institution_math_ua2_2022$mathCompletionDate)
```

## Combined Data from UMGC1 and UA2: AnSamp1 (n = 4460; Numgc = 3713, Nua = 747)

Add a new variable of institution (i.e., college) before merging the files
```{r}
daacs_umgc1$college <- 'UMGC1'
daacs_ua2$college <- "UA2"
```

Combine the two DAACS student-level dfs (exclude the university-provided IDs) including the speedy respondents; n =5447, Numgc = 4152 (76.2%), Nua = 1295
```{r}
daacs_umgc1ua2<- merge(daacs_umgc1, daacs_ua2, all = TRUE)
dim(daacs_umgc1ua2)# 5447
```

### Uniform students' IDs and "college" variable
Rename the institution-assigned ID variables
```{r}
## Rename the institution-assigned ID variables to a new, common name
names(daacs_math_1stAtt_umgc1)[
          names(daacs_math_1stAtt_umgc1)=="unique_id"]<-
                                        "institution_ID"
names(math.items1stAtt_wide_umgc1_wIDs)[
          names(math.items1stAtt_wide_umgc1_wIDs)=="unique_id"]<-
                                        "institution_ID"
names(institution_math_umgc1_2022)[
          names(institution_math_umgc1_2022)=="unique_id"]<-
                                        "institution_ID"
names(daacs_math_1stAtt_ua2)[
          names(daacs_math_1stAtt_ua2)=="fakeID"]<-
                                        "institution_ID"
names(math.items1stAtt_wide_ua2_wIDs)[
          names(math.items1stAtt_wide_ua2_wIDs)=="fakeID"]<-
                                        "institution_ID"
names(institution_math_ua2_2022)[
          names(institution_math_ua2_2022)=="fakeID"]<-
                                       "institution_ID"

# All IDs are unique: We don't want two students from different colleges to share a single ID in the combined data. There are no common institution_ID values
tmp <-
  intersect(daacs_math_1stAtt_umgc1$institution_ID,
                  daacs_math_1stAtt_ua2$institution_ID)
length(tmp)

tmp <-
  intersect(math.items1stAtt_wide_umgc1_wIDs$institution_ID,
                  math.items1stAtt_wide_ua2_wIDs$institution_ID)
length(tmp)

tmp <-
  intersect(institution_math_umgc1$institution_ID,
                  institution_math_ua2$institution_ID)
length(tmp)


# No common DAACS_ID values
tmp <-
  intersect(daacs_math_1stAtt_umgc1$DAACS_ID,
                        daacs_math_1stAtt_ua2$DAACS_ID)
length(tmp)

tmp <-
  intersect(math.items1stAtt_wide_umgc1_wIDs$DAACS_ID,
                        math.items1stAtt_wide_ua2_wIDs$DAACS_ID)
length(tmp)

tmp <-
  intersect(institution_math_umgc1$DAACS_ID,
                        institution_math_ua2$DAACS_ID)
length(tmp)
```

Add a new variable of institution (i.e., college) before merging the files
```{r}
daacs_math_1stAtt_umgc1$college <- 'UMGC1'
math.items1stAtt_wide_umgc1_wIDs$college <- 'UMGC1'
institution_math_umgc1$college <- 'UMGC1'
daacs_math_1stAtt_ua2$college <- "UA2"
math.items1stAtt_wide_ua2_wIDs$college <- "UA2"
institution_math_ua2$college <- "UA2"
```

### DAACS student-level data (all Math students n = 4621; 3869 (83.7%) UMGC1 and 752 UA2
#### Common Columns' Names and Class Match: The Structures are identical
This check point is formal since DAACS-collected student-level data from two
colleges were produced  by a single "DAACS" application; hence, the identical
variables, names, and categories.
```{r}
# Sort column names in alphabetical Order
daacs_math_1stAtt_umgc1<-
  daacs_math_1stAtt_umgc1[, order(colnames(daacs_math_1stAtt_umgc1))]
daacs_math_1stAtt_ua2<-
  daacs_math_1stAtt_ua2[, order(colnames(daacs_math_1stAtt_ua2))]

# Compare the columns: all variables match
# library(janitor)
tmp <- compare_df_cols(daacs_math_1stAtt_umgc1, daacs_math_1stAtt_ua2)
tmp
```

##### Variables' Unique Values Match
```{r}
# Function to check unique values
compair_unique_values_in_columns <- function(df1_tmp, df2_tmp) {
  common_cols <- intersect(names(df1_tmp), names(df2_tmp))

  unique_values <- do.call(rbind, lapply(common_cols, function(col) {
    data.frame(
      Column = col,
      Unique_df1 = paste(unique(df1_tmp[[col]]), collapse = ", "),
      Unique_df2 = paste(unique(df2_tmp[[col]]), collapse = ", ")
    )
  }))

  unique_values
}
# Exclude columns with all values unique
df1_tmp<-daacs_math_1stAtt_umgc1[
    , !names(daacs_math_1stAtt_umgc1) %in%
  c('DAACS_ID', 'institution_ID', 'mathCompletionDate', 'mathStartDate',
    'mathTime', 'mathCompletionDate', 'mathStartDate', 'mathTime',
    'srlCompletionDate', 'srlStartDate', 'srlTime', 'srlTotal',
    'writeCompletionDate', 'writeStartDate', 'writeTime')]
df2_tmp<-daacs_math_1stAtt_ua2[
    , !names(daacs_math_1stAtt_ua2) %in%
  c('DAACS_ID', 'institution_ID', 'mathCompletionDate', 'mathStartDate',
    'mathTime', 'mathCompletionDate', 'mathStartDate', 'mathTime',
    'srlCompletionDate', 'srlStartDate', 'srlTime', 'srlTotal',
    'writeCompletionDate', 'writeStartDate', 'writeTime')]

# unique values match
unique_values_tmp <- compair_unique_values_in_columns(df1_tmp, df2_tmp)
print(unique_values_tmp)
```

#### ALL students, n = 4621; 3869 UMGC1 and 752 UA2, Combined dataset, speedy respondents included (math.items_umgc1ua2)
```{r}
daacs_math_1stAtt_umgc1ua2<-rbind(daacs_math_1stAtt_umgc1,daacs_math_1stAtt_ua2) # 4621
dim(daacs_math_1stAtt_umgc1ua2)
# Rename the df with the information for all students who took DAACS math in May 2022-May 2023
math.items_umgc1ua2 <- daacs_math_1stAtt_umgc1ua2
math_College_allResps_tb <- Propensities_CatVar(math.items_umgc1ua2, "college")
print(math_College_allResps_tb)
```
##### Remove 163 speedy respondents (156 UMGC1 and 7 UA2 students), who took ≤ 180 seconds for 18 or 24 items (k_admin)
```{r}
# nsp = non-speedy respondents
daacs_math_nsp_umgc1ua2 <-
  filter(math.items_umgc1ua2, mathTime > 180) # 4460 (3713;747)
dim(daacs_math_nsp_umgc1ua2)

math_College_AnSamp1_tb <- Propensities_CatVar(daacs_math_nsp_umgc1ua2, "college")
print(math_College_AnSamp1_tb)
```

#### AnSamp1 (all non-speedy respondents)
n = 4460 (3713;747)
```{r}
dim(daacs_math_nsp_umgc1ua2)

# no empty rows
nrow(daacs_math_nsp_umgc1ua2)-
  (nrow(daacs_math_nsp_umgc1ua2[rowSums(is.na(daacs_math_nsp_umgc1ua2))!=
                                                ncol(daacs_math_nsp_umgc1ua2),]))
# the institution_ID's were padded with zeroes
min(daacs_math_nsp_umgc1ua2$institution_ID)

# no missing values on IDs
nrow(daacs_math_nsp_umgc1ua2[is.na(daacs_math_nsp_umgc1ua2$DAACS_ID),])
nrow(daacs_math_nsp_umgc1ua2[is.na(daacs_math_nsp_umgc1ua2$institution_ID),])

# no duplicated cases on IDs
table(duplicated(daacs_math_nsp_umgc1ua2$DAACS_ID),useNA = 'always')

# no missing values in mathTime
nrow(daacs_math_nsp_umgc1ua2[is.na(daacs_math_nsp_umgc1ua2$mathTime),])
```

Insert a row index for the AnSamp1 (all non-speedy respondents)
```{r}
 daacs_math_nsp_umgc1ua2$row_indexAS1 <-
  seq_len(nrow(daacs_math_nsp_umgc1ua2))
```

##### Completion Time Density plots
All non-speedy respondents (AnSamp1)
```{r}
# Calculate the number of students
students_umgc1_tmp <-
  nrow(daacs_math_nsp_umgc1ua2[daacs_math_nsp_umgc1ua2$college== "UMGC1",])
students_ua2_tmp <-
  nrow(daacs_math_nsp_umgc1ua2[daacs_math_nsp_umgc1ua2$college== "UA2",])

# Create labels for the legend
umgc_label_tmp <- paste("UMGC students (n =", students_umgc1_tmp, ")")
ualbany_label_tmp <- paste("UAlbany students (n =", students_ua2_tmp, ")")

# Create a combined data frame for plotting
combined_data_tmp <- data.frame(
  mathCompletionDate = daacs_math_nsp_umgc1ua2$mathCompletionDate,
  group = factor(c(
    rep(umgc_label_tmp, students_umgc1_tmp),
    rep(ualbany_label_tmp, students_ua2_tmp)
  ), levels = c(umgc_label_tmp, ualbany_label_tmp)) # Ensure correct factor level ordering
)
#library(ggplot2)
# Plot the density plots
ggplot(combined_data_tmp, aes(x = mathCompletionDate, color = group)) +
  geom_density(linewidth = 1.2) +
  scale_color_manual(
    values = setNames(c("red3", "purple"), c(umgc_label_tmp, ualbany_label_tmp))
  ) +
  labs(
    title =
      "Completion Dates for Non-speedy respondents on DAACS Math Assessment in 2022-23",
    x = "Completion Date",
    y = "Density",
    caption = "Sample: Analytic Sample 1, n = 4460",
    color = NULL
  ) +
  theme_minimal() +
    theme(
      legend.position = c(0.6, 0.85),
      legend.justification = c(0, 1),
      legend.text = element_text(size = 10),
      plot.title = element_text(size = 12, face = "bold"),
      plot.caption = element_text(size = 9, hjust = 0)
      )
# Save the plot as a PDF
ggsave(
  filename = "mathCompletionDate_AnSamp1_density.pdf", # File name
    plot = last_plot(),                       # Use the most recent plot
    device = "pdf",                           # Specify the device as PDF
    width = 8,                                # Width of the plot in inches
    height = 6,                               # Height of the plot in inches
    units = "in",                             # Units for width and height
    dpi = 300                                 # Resolution of the plot
)
```

### DAACS Item-level data: Match Items QIDs
This check point is formal too but for a different reason: DAACS aplication
assigned the QIDs for items in the order that they were offered to the first
students of a given college. Hence, a Q001 in umgc1 and Q001 in UAlbany could
be the first items of any set of medium difficulty (because every student
started with a medium set of items); every consequent set was chosen randomly
from a pool of easy, medium, or hard items depending on student's answers.
That is why, we had to create a mapping file to assign uniform QIDs to each item
in both colleges.

```{r}
#Sort dfs' column names in alphabetical order
math.items1stAtt_wide_umgc1_wIDs<-
  math.items1stAtt_wide_umgc1_wIDs[, order(colnames(math.items1stAtt_wide_umgc1_wIDs))
                              ]
math.items1stAtt_wide_ua2_wIDs<-
  math.items1stAtt_wide_ua2_wIDs[, order(colnames(math.items1stAtt_wide_ua2_wIDs))]

tmp <-
  compare_df_cols(math.items1stAtt_wide_umgc1_wIDs, math.items1stAtt_wide_ua2_wIDs)
tmp
```

#### Mapping file to match Items' QIDs
Since the UMGC item pool misses the qid = “Q157” (the item Q016/Q157, “Gisselle …”),
we used the UA2 math qid’s to create uniform math item qid’s for the two colleges (subsamples).
```{r}
# Subset the unique items' characteristics from the original data. The "question" variable represent unique item stems (prompt):
unique_math.items_umgc1<-
  math.items1stAtt_long_umgc1[!duplicated(math.items1stAtt_long_umgc1$question),
                                  c('qid', 'question', 'difficulty', 'domain')]
dim(unique_math.items_umgc1)

unique_math.items_ua2 <-
  math.items1stAtt_long_ua2[!duplicated(math.items1stAtt_long_ua2$question),
                                c('qid', 'question', 'difficulty', 'domain')]
dim(unique_math.items_ua2)

# Combine two sets of unique items into a single set with two columns of qid for
# two colleges.
mapping_unique_math.items_umgc1ua2<-
  merge(unique_math.items_ua2,unique_math.items_umgc1,
                                  by = c('question', 'difficulty', 'domain'),
                                  suffixes = c(".ua2",".umgc1"))
dim(mapping_unique_math.items_umgc1ua2)
```

Good quality of the new dataframe (k = 174)
```{r}
dim(mapping_unique_math.items_umgc1ua2) # 174 5

# No empty rows
nrow(mapping_unique_math.items_umgc1ua2)-
  (nrow(mapping_unique_math.items_umgc1ua2[
    rowSums(is.na(mapping_unique_math.items_umgc1ua2))!=
                        ncol(mapping_unique_math.items_umgc1ua2),]))
```

#### Match the Item-Level wide dfs
Rename umgc1 QIDs (columns) via mapping file (k = 174 items)
```{r}
#library(data.table)
# Subset 174 qid variables
math.items1stAtt_wide_umgc1 <- math.items1stAtt_wide_umgc1_wIDs [, -c(1, 2, 3)]

# Change the umgc1 QIDs to ua2 ones
math.items1stAtt_wide_umgc1_wua2qid <- setnames(math.items1stAtt_wide_umgc1,
              as.character(mapping_unique_math.items_umgc1ua2$qid.umgc1),
                      as.character(mapping_unique_math.items_umgc1ua2$qid.ua2))
# Re-attach the renamed variables
math.items1stAtt_wide_umgc1_wIDs_wua2qid<-
  cbind(math.items1stAtt_wide_umgc1_wIDs[,1:3], math.items1stAtt_wide_umgc1_wua2qid)
# Now ua2 and umgc1 have uniform column names
```

#### Item-Level wide df for a combined sample, n = 4621; 3869 UMGC1 and 752 UA2
Merge the Item-Level wide dfs into the a single df for a combined sample
of two colleges' students (including the speedy respondents)
```{r}
math.items1stAtt_wide_umgc1_wIDs_wua2qid <-
  math.items1stAtt_wide_umgc1_wIDs_wua2qid[, order(colnames(math.items1stAtt_wide_umgc1_wIDs_wua2qid))]

math.items1stAtt_wide_umgc1ua2_wPersonal <-
  rbind(math.items1stAtt_wide_umgc1_wIDs_wua2qid, math.items1stAtt_wide_ua2_wIDs)
dim(math.items1stAtt_wide_umgc1ua2_wPersonal)
```

##### Good quality of the new dataframe (n = 4621 (3869 (83.7%) UMGC1 and 752 UA2))
```{r}
dim(math.items1stAtt_wide_umgc1ua2_wPersonal) # 4621  177

# No empty rows
nrow(math.items1stAtt_wide_umgc1ua2_wPersonal)-
  (nrow(math.items1stAtt_wide_umgc1ua2_wPersonal[rowSums(is.na(math.items1stAtt_wide_umgc1ua2_wPersonal))!=
                        ncol(math.items1stAtt_wide_umgc1ua2_wPersonal),]))

# The institution_IDs were padded with zeroes
min(math.items1stAtt_wide_umgc1ua2_wPersonal$institution_ID)

#The DAACS_IDs were adjusted by adding 10000
min(math.items1stAtt_wide_umgc1ua2_wPersonal$DAACS_ID)

# No missing values on IDs
nrow(math.items1stAtt_wide_umgc1ua2_wPersonal[
  is.na(math.items1stAtt_wide_umgc1ua2_wPersonal$DAACS_ID),])
nrow(math.items1stAtt_wide_umgc1ua2_wPersonal[
  is.na(math.items1stAtt_wide_umgc1ua2_wPersonal$institution_ID),])
```

### AnSamp1
All 1st-attempt items answered to by all non-speedy respondents), n = 4460 (3713;747).
Remove 161 speedy respondents (who took ≤ 180 seconds for the assessment) and add
the other DAACS variables of interest
```{r}
math.items1stAtt_wide_nsp_umgc1ua2_wPersonal<-
  merge(math.items1stAtt_wide_umgc1ua2_wPersonal,
        daacs_math_nsp_umgc1ua2[, c(1, 3:5, 13, 15, 16, 69)])
dim(math.items1stAtt_wide_nsp_umgc1ua2_wPersonal)
```

### Good quality of the new dataframe (n = 4460 (3713;747))
```{r}
dim(math.items1stAtt_wide_nsp_umgc1ua2_wPersonal) # 4460  182

# No empty rows
nrow(math.items1stAtt_wide_nsp_umgc1ua2_wPersonal)-
  (nrow(math.items1stAtt_wide_nsp_umgc1ua2_wPersonal[
    rowSums(is.na(math.items1stAtt_wide_nsp_umgc1ua2_wPersonal))!=
                        ncol(math.items1stAtt_wide_nsp_umgc1ua2_wPersonal),]))

# The institution_IDs were padded with zeroes
min(math.items1stAtt_wide_nsp_umgc1ua2_wPersonal$institution_ID)

#The DAACS_IDs were adjusted by adding 10000
min(math.items1stAtt_wide_nsp_umgc1ua2_wPersonal$DAACS_ID)

# No missing values on IDs
nrow(math.items1stAtt_wide_nsp_umgc1ua2_wPersonal[
  is.na(math.items1stAtt_wide_nsp_umgc1ua2_wPersonal$DAACS_ID),])
nrow(math.items1stAtt_wide_nsp_umgc1ua2_wPersonal[
  is.na(math.items1stAtt_wide_nsp_umgc1ua2_wPersonal$institution_ID),])

# No missing values on college
nrow(math.items1stAtt_wide_nsp_umgc1ua2_wPersonal[
  is.na(math.items1stAtt_wide_nsp_umgc1ua2_wPersonal$college),])

# No missing values in readTime = the students who didn't take read assessment
nrow(math.items1stAtt_wide_nsp_umgc1ua2_wPersonal[
  is.na(math.items1stAtt_wide_nsp_umgc1ua2_wPersonal$mathTime),])
# All attempts are 1st
table(math.items1stAtt_wide_nsp_umgc1ua2_wPersonal$math_attempt,
      useNA = "always")
```

## Institution-provided student-level Data (Personal, n = 1702, 950 (55.8%) UMGC1 and 752 UA2), 2022-23

### Match common columns' names
```{r}
# Sort the column names in alphabetical order
institution_math_ua2 <-
  institution_math_ua2[, order(colnames(institution_math_ua2))]

institution_math_umgc1<-
  institution_math_umgc1[
    , order(colnames(institution_math_umgc1))]

# Many common columns' names do not match
tmp <-
  compare_df_cols(institution_math_umgc1, institution_math_ua2)
tmp
```

#### Save the comparison table as a word.doc
```{r}
#library(flextable)
#library(officer)
# Create a function to save the .docx
save_dftable_as_word <-
  function(tmp, target, caption, footer_note = NULL) {
    doc <- read_docx() %>% # Create a new Word document
      body_add_par(caption, style = "centered") %>% # Add the caption (non-optional)
      body_add_flextable(
        flextable(tmp) %>% # Create a flextable
          align(align = "center", part = "all") %>% # Center align all values in the columns
          autofit() # Autofit the column widths
      )

    if (!is.null(footer_note)) {
      doc <- doc %>% body_add_par(footer_note, style = "Normal") # Add the footer note aligned to the left
    }

    doc <- doc %>% print(target = target) # Save the Word document
  }

# Use the function
save_dftable_as_word(
  tmp = tmp,
  target = "institution_math_umgc1_vs_institution_math_ua2_variables.docx", # doc's name
  caption = "institution_math_umgc1 Structure VS institution_math_ua2 Structure", # Table's title
  footer_note = "Data: IES 2022-23 UMGC2 and UAlbany1 students" ) # optional footer note
```

##### Rename the common but non-matching  variables
```{r}
# Rename multiple columns in one line
# UMGC1
names(institution_math_umgc1)[
  names(institution_math_umgc1) %in%
  c("credits_attempted", "credits_earned","gpa_term","military", "pell")] <-
  c("credits_attempted_f22","credits_earned_f22","gpa_term_f22","Military","Pell")

# UA2
names(institution_math_ua2)[names(institution_math_ua2) %in%
                c("credits_passed_f22","DAACSAssignment","Race_Ethnicity.x",
                  "StudentType","term_gpa_f22","Transfer_credits")] <-
                  c("credits_earned_f22","group","ethnicity","transfer",
                    "gpa_term_f22","credits_transferred")
```