DataAnalytics/3_logistic_regression.Rmd at main · course-files/DataAnalytics · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
---
title: "Logistic Regression"
author: "Allan Omondi"
date: "`r Sys.Date()`"
output:
  pdf_document:
    toc: true
    toc_depth: 4
    number_sections: true
    fig_width: 5
    fig_height: 5
    fig_crop: false
    keep_tex: true
    latex_engine: xelatex
  word_document:
    toc: true
    toc_depth: 4
    number_sections: true
    fig_width: 5
    keep_md: true
  html_document:
    toc: true
    toc_depth: 4
    number_sections: true
    fig_width: 4
    fig_height: 4
    self_contained: false
    keep_md: true
  html_notebook:
    toc: true
    toc_depth: 4
    number_sections: true
    fig_width: 4
    self_contained: false
---

```{r setup_chunk, message=FALSE, warning=FALSE}
if (!"pacman" %in% installed.packages()[, "Package"]) {
  install.packages("pacman", dependencies = TRUE)
  library("pacman", character.only = TRUE)
}

pacman::p_load("here")

knitr::opts_knit$set(root.dir = here::here())
```

# Load the Dataset

The synthetic subscription churn dataset contains 1,000 observations and four variables:

1.  **monthly_fee:** The monthly subscription fee paid by the customer

2.  **customer_age:** The age of the customer in years.

3.  **support_calls:** The number of support calls the customer made in the last month

4.  **renew:** This is the outcome (dependent) variable (1 = customer will not cancel; 0 = customer will cancel)

```{r load_dataset, echo=TRUE, message=FALSE, warning=FALSE}
pacman::p_load("readr")

subscription_churn_data <-
  read_csv("data/subscription_churn.csv", col_types = cols(
    monthly_fee = col_double(),
    customer_age = col_double(),
    support_calls = col_integer(),
    renew = col_factor(levels = c("1", "0"))
    )
  )

head(subscription_churn_data)
```

# Initial EDA

[**View the Dimensions**]{.underline}

The number of observations and the number of variables.

```{r show_dimensions, echo=TRUE, message=FALSE, warning=FALSE}
dim(subscription_churn_data)
```

[**View the Data Types**]{.underline}

```{r show_data_types_1, echo=TRUE, message=FALSE, warning=FALSE}
sapply(subscription_churn_data, class)
```

```{r show_data_types_2, echo=TRUE, message=FALSE, warning=FALSE}
str(subscription_churn_data)
```

[**Descriptive Statistics**]{.underline}

## Measures of Frequency

79.4% continue their subscription and 20.6% cancelled their subscription. It is not balanced.

```{r frequency, echo=TRUE, message=FALSE, warning=FALSE}
subscription_churn_data_freq <- subscription_churn_data$renew
cbind(frequency = table(subscription_churn_data_freq),
      percentage = prop.table(table(subscription_churn_data_freq)) * 100)

```

## Measures of Central Tendency

The median and the mean of each numeric variable:

```{r central_tendency, echo=TRUE, message=FALSE, warning=FALSE}
summary(subscription_churn_data)
```

## Measures of Distribution

Measuring the variability in the dataset is important because the amount of variability determines **how well you can generalize** results from the sample to a new observation in the population.

Low variability is ideal because it means that you can better predict information about the population based on the sample data. High variability means that the values are less consistent, thus making it harder to make predictions.

### Variance

```{r distribution_variance, echo=TRUE, message=FALSE, warning=FALSE}
sapply(subscription_churn_data[, 1:3], var)
```

### Standard Deviation

```{r distribution_standard_deviation, echo=TRUE, message=FALSE, warning=FALSE}
sapply(subscription_churn_data[, 1:3], sd)
```

### Kurtosis

The Kurtosis informs us of how often outliers occur in the results. There are different formulas for calculating kurtosis. Specifying “type = 2” allows us to use the 2nd formula which is the same kurtosis formula used in other statistical software like SPSS and SAS.

In “type = 2” (used in SPSS and SAS):

1.  Kurtosis \< 3 implies a low number of outliers

2.  Kurtosis = 3 implies a medium number of outliers

3.  Kurtosis \> 3 implies a high number of outliers

```{r distribution_kurtosis, echo=TRUE, message=FALSE, warning=FALSE}
pacman::p_load("e1071")
sapply(subscription_churn_data[,],  kurtosis, type = 2)
```

### Skewness

The skewness is used to identify the asymmetry of the distribution of results. Similar to kurtosis, there are several ways of computing the skewness.

Using “type = 2” (common in other statistical software like SPSS and SAS) can be interpreted as:

1.  Skewness between -0.4 and 0.4 (inclusive) implies that there is no skew in the distribution of results; the distribution of results is symmetrical; it is a normal distribution; a Gaussian distribution.

2.  Skewness above 0.4 implies a positive skew; a right-skewed distribution.

3.  Skewness below -0.4 implies a negative skew; a left-skewed distribution.

```{r distribution_skewness, echo=TRUE, message=FALSE, warning=FALSE}
sapply(subscription_churn_data[,], skewness, type = 2)
```

## Measures of Relationship

### Covariance

Covariance is a statistical measure that indicates the direction of the linear relationship between two variables. It assesses whether increases in one variable correspond to increases or decreases in another.

-   **Positive Covariance:** When one variable increases, the other tends to increase as well.

-   **Negative Covariance:** When one variable increases, the other tends to decrease.

-   **Zero Covariance:** No linear relationship exists between the variables.

While covariance indicates the direction of a relationship, it does not convey the strength or consistency of the relationship. The correlation coefficient is used to indicate the strength of the relationship.

```{r distribution_covariance, echo=TRUE, message=FALSE, warning=FALSE}
cov(subscription_churn_data[,1:3], method = "spearman")
```

### Correlation

A strong correlation between variables enables us to better predict the value of the dependent variable using the value of the independent variable. However, a weak correlation between two variables does not help us to predict the value of the dependent variable from the value of the independent variable. This is useful only if there is a linear association between the variables.

We can measure the statistical significance of the correlation using Spearman's rank correlation *rho*. This shows us if the variables are significantly monotonically related. A monotonic relationship between two variables implies that as one variable increases, the other variable either consistently increases or consistently decreases. The key characteristic is the preservation of the direction of change, though the rate of change may vary.

To view the correlation of all variables at the same time

```{r distribution_correlation_2, echo=TRUE, message=FALSE, warning=FALSE}
cor(subscription_churn_data[,1:3], method = "spearman")
```

## Basic Visualizations

### Histogram

```{r visualization_histogram, echo=TRUE, fig.width=5, message=FALSE, warning=FALSE}
par(mfrow = c(1, 2))
for (i in 1:4) {
  if (is.numeric(subscription_churn_data[[i]])) {
    hist(subscription_churn_data[[i]],
         main = names(subscription_churn_data)[i],
         xlab = names(subscription_churn_data)[i])
  } else {
    message(paste("Column", names(subscription_churn_data)[i],
                  "is not numeric and will be skipped."))
  }
}
```

### Box and Whisker Plot

```{r visualization_boxplot, echo=TRUE, fig.width=5, message=FALSE, warning=FALSE}
par(mfrow = c(1, 2))
for (i in 1:4) {
  if (is.numeric(subscription_churn_data[[i]])) {
    boxplot(subscription_churn_data[[i]], main = names(subscription_churn_data)[i])
  } else {
    message(paste("Column", names(subscription_churn_data)[i],
                  "is not numeric and will be skipped."))
  }
}
```

### Missing Data Plot

```{r missing_data_plot, echo=TRUE, fig.width=5, message=FALSE, warning=FALSE}
pacman::p_load("Amelia")

missmap(subscription_churn_data, col = c("red", "grey"), legend = TRUE)
```

### Correlation Plot

```{r correlation_plot, echo=TRUE, fig.width=5, message=FALSE, warning=FALSE}
pacman::p_load("ggcorrplot")

ggcorrplot(cor(subscription_churn_data[,1:3]))
```

### Scatter Plot

```{r scatter_plot_1, echo=TRUE, fig.width=5, message=FALSE, warning=FALSE}
pacman::p_load("corrplot")

pairs(renew ~ ., data = subscription_churn_data, col = subscription_churn_data$renew)
```

```{r scatter_plot_2, echo=TRUE, fig.width=5, message=FALSE, warning=FALSE}
pacman::p_load("ggplot2")
ggplot(subscription_churn_data,
       aes(x = customer_age, y = monthly_fee)) +
  geom_point() +
  geom_smooth(method = lm) +
  labs(
    title = "Relationship between Monthly Fee and \nCustomer Age",
    x = "Customer Age",
    y = "Monthly Fee"
  )
```

```{r scatter_plot_3, echo=TRUE, fig.width=5, message=FALSE, warning=FALSE}
pacman::p_load("ggplot2")
ggplot(subscription_churn_data,
       aes(x = customer_age, y = monthly_fee, color = renew)) +
  geom_point() +
  geom_smooth(method = lm) +
  labs(
    title = "Relationship between Monthly Fee and \nCustomer Age for Each Renewal Status",
    x = "Customer Age",
    y = "Monthly Fee",
    color = "Renewal Status"
  ) +
  scale_color_discrete(
    labels = c("1" = "Renewed Subscription", "0" = "Cancelled Subscription")
  ) +
  theme_minimal()  # Optional: adds a cleaner theme
```

# Statistical Test

We then apply a logistic regression as a statistical test for regression.

```{r statistical_test_SLR, echo=TRUE, message=FALSE, warning=FALSE}
log_test <- glm(renew ~ monthly_fee + customer_age + support_calls,
                # This specifies that the generalized linear model to use
                # is logistic regression
                family = binomial,
                data = subscription_churn_data)
```

View the summary of the model.

```{r statistical_test_interpretation, echo=TRUE, message=FALSE, warning=FALSE}
summary(log_test)
```

The logistic regression equation is in the following form:

$$
\text{log-odds(renew)} = \beta_0 + \beta_1 \cdot fee + \beta_2 \cdot age + \beta_3 \cdot calls
$$

Plugging in the coefficients from the output gives:

$$
\text{log-odds(renew)} = -0.941778 + -0.043563 \cdot fee + 0.026110 \cdot age + 0.697795 \cdot calls
$$

The log-odds is then converted into a probability using the logistic function as:

$$
P(\text{renew = 1}) = \frac{1}{1+e^{-\text{log-odds}}}
$$

-   If P(renew=1)≥0.5, predict renewal of subscription(1).

-   If P(renew=1)\<0.5, predict cancellation of subscription (0).

For example, a monthly fee of 50, customer age of 62, and 3 support calls in the past month is probably going to renew their subscription:

```{r sample_prediction_a, echo=TRUE, message=FALSE, warning=FALSE}
coefs <- coef(log_test)
log_odds   <- coefs["(Intercept)"] +
  coefs["monthly_fee"] * 50 +
  coefs["customer_age"] * 62 +
  coefs["support_calls"] * 3

p_manual <- 1 / (1 + exp(-log_odds))

print(p_manual)

```

For example, a monthly fee of 50, customer age of 21, and 3 support calls in the past month is probably going to cancel their subscription:

```{r sample_prediction_b, echo=TRUE, message=FALSE, warning=FALSE}
coefs <- coef(log_test)
log_odds   <- coefs["(Intercept)"] +
  coefs["monthly_fee"] * 50 +
  coefs["customer_age"] * 21 +
  coefs["support_calls"] * 3

p_manual <- 1 / (1 + exp(-log_odds))

print(p_manual)
```

## The $\chi^2$ Statistic and its p-Value

To obtain the p-value of the $\chi^2$ statistic:

```{r p_value_of_chi_2, echo=TRUE, message=FALSE, warning=FALSE}
# chi_2 <- log_test$null.deviance/1 - log_test$residuals
chi_2 <- 1017.22 - 906.19

print(chi_2)

# To get the degrees of freedom:
calculated_df = length(coef(log_test)) - 1  # Subtract 1 for intercept


p <- 1 - pchisq(chi_2, df = calculated_df)
# to format as a scientific notation with four digits after the decimal place
# sprintf("%.4e", p)
format.pval(p, eps = .Machine$double.eps, digits = 2)
```

## 95% Confidence Interval of the Parameters

To obtain a 95% confidence interval for the parameters:

```{r 95_confidence_interval, echo=TRUE, message=FALSE, warning=FALSE}
confint(log_test, level = 0.95)
```

## Odds Ratio

Exponentiating the coefficients yields odds ratios, which quantify the multiplicative change in odds for a one‑unit increase in the predictor.

```{r odds_ratio, echo=TRUE, message=FALSE, warning=FALSE}
exp(coef(log_test))
```

### 95% Confidence Interval for the Odds Ratio

To obtain a 95% confidence interval for the Odds Ratio:

```{r 95_confidence_interval_OR, echo=TRUE, message=FALSE, warning=FALSE}
pacman::p_load("dplyr")
confint(log_test) %>% exp()
```

## Akaike Information Criterion (AIC)

AIC specifies how well a model fits the data from which it was generated. The AIC value is calculated using the number of predictor variables and also the estimate of the maximum likelihood of the model. AIC penalizes complexity, therefore, any model with a lesser AIC value is a more significant model. A lower AIC value means the complexity of the model is lower and the model better explains the variations. The model's AIC value of 914.19 can be compared with the AIC value of alternative models, e.g., a logistic regression that has dropped one of the original predictors.

## McFadden’s Pseudo R^2^

Logistic regression does not use the traditional R², but we can compute a **pseudo-R²** (McFadden’s) as follows:

```{r mcfaddens_pseudo_R2, echo=TRUE, message=FALSE, warning=FALSE}
pseudo_r2 <- 1 - (log_test$deviance / log_test$null.deviance)
print(pseudo_r2)
```

A McFadden's pseudo-R^2^ of 0.109 suggests that the predictors explain approximately 10.91% of the variance in the outcome. pseudo-R^2^ \> 0.2 are considered strong in logistic regression.

## Fisher Scoring Iterations

There were 4 Fisher scoring iterations which indicates that the fitting routine required four iterative updates to reach convergence. A small number (\< 10) suggests that the algorithm found the maximum-likelihood estimates efficiently without convergence warnings.

## Model Fit Metrics

The ROC Curve and the AUC gives insight into how well the logistic regression model distinguishes between the two outcome classes (1 = renew and 0 = cancel).

-   The x-axis shows the False Positive Rate (FPR), or specificity.

-   The y-axis shows the True Positive Rate (TPR), or sensitivity.

-   The curve shows how TPR and FPR change as the decision threshold changes from 0 to 1.

-   A curve closer to the top-left corner indicates better model performance.

```{r model_fit_a, echo=TRUE, fig.width=5, message=FALSE, warning=FALSE}
pacman::p_load("pROC")

predicted_probs <- predict(log_test, type = "response")

roc_obj <- roc(subscription_churn_data$renew, predicted_probs)
auc_value <- auc(roc_obj)
plot(roc_obj, col = "blue", main = "ROC Curve for Logistic Regression")
```

```{r model_fit_b, echo=TRUE, fig.width=5, message=FALSE, warning=FALSE}
print(auc_value)
```

| AUC Value | Interpretation |
|------------------------------------|------------------------------------|
| 0.5 | Using the logistic regression model is equivalent to randomly guessing |
| 0.6 - 0.7 | Poor discrimination of the classes |
| 0.7 - 0.8 | Acceptable discrimination of the classes (fair) |
| 0.8 - 0.9 | Good discrimination |
| 0.9 - 1.0 | Excellent discrimination |

: AUC Value Interpretation

An AUC of 0.7167 indicates that the model has an acceptable ability to discriminate between the two classes.

# Diagnostic EDA

Diagnostic EDA is performed to validate that the regression assumptions are true with respect to the statistical test. Validating the regression assumption in turn ensures that the statistical tests applied are appropriate for the data and helps to prevent incorrect conclusions.

## Test of Linearity

The test of linearity is necessary given that linearity is one of the key assumptions of statistical tests of regression and verifying it is crucial for ensuring the validity of the model's estimates and predictions.

**Component-Plus-Residual (Partial-Residual) Plots**

Logistic regression assumes that the relationship between the log-odds of the outcome and the continuous predictors is linear. A log-odds is the natural logarithm of the odds of an event occurring.

If *p* is the probability of an event (where 0 \< *p* \< 1), the odds of the event occuring is the ratio of the probability that the event occurs to the probability that the event does not occur, i.e., $\frac{p}{1 - p}$ and the log-odds (computed using the logit function) of the event, *p*, is:

$$ \text{logit}(p) = \ln\left(\frac{p}{1-p}\right) $$

A roughly straight line indicates the "log-odds - predictor" linearity assumption is met.

```{r test_of_linearity, echo=TRUE, fig.width=5, message=FALSE, warning=FALSE}
pacman::p_load("car")

crPlots(log_test)

```

## [**Test of Independence of Errors (Autocorrelation)**]{.underline}

This test is necessary to confirm that each observation is independent of the other. It helps to identify **autocorrelation** that is introduced when the data is collected over a close period of time or when one observation is related to another observation. Autocorrelation leads to underestimated standard errors and inflated test statistics. It can also make findings appear more significant than they actually are.

The "**Durbin-Watson Test**" can be used as a test of independence of errors (test of autocorrelation). A Durbin-Watson statistic close to 2 suggests no autocorrelation, while values approaching 0 or 4 indicate positive or negative autocorrelation, respectively.

-   The null hypothesis, H~0~, is that there is no autocorrelation

-   The alternative hypothesis, H~a~, is that there is autocorrelation

If the p-value of the Durbin-Watson statistic is greater than 0.05 then there is no evidence to reject the null hypothesis that "there is no autocorrelation". The results below show a p-value of 0.1573, therefore, the test of independence of errors around the regression line passes.

```{r test_of_independence_of_errors, echo=TRUE, message=FALSE, warning=FALSE}
pacman::p_load("lmtest")
dwtest(log_test)
```

## Test of Normality

Logistic regression does not assume normality of residuals or predictors.

## Test of Homoscedasticity

The test of homoscedasticity is not relevant for logistic regression.

## Test of Multicollinearity

Multicollinearity arises when two or more independent variables (predictors) are highly intercorrelated. The **Variance Inflation Factor (VIF)** quantifies how much the variance of a coefficient estimate is “inflated” due to multicollinearity. A VIF of 1 indicates no collinearity; values above 5 suggest problematic levels of collinearity. High VIF values (VIF \> 5) suggest that the coefficient estimates are less reliable due to the correlations between predictors.

```{r multicollinearity, echo=TRUE, message=FALSE, warning=FALSE}
pacman::p_load("car")
car::vif(log_test)
```

## Test of Outliers

The `influencePlot()` function in R combines 3 key diagnostic measures into a single plot to identify influential observations.

The plot displays:

-   Y-axis: Studentized residuals (standardized residuals adjusted for leverage). This measures "outlierness" in the outcome variable.

-   X-axis: Leverage (hat values), measuring how "unusual" an observation is in terms of its predictor values.

-   Bubble size: Cook’s distance, quantifying the influence of each observation on the model coefficients.

Top-left/bottom-left:

-   Indicates observations with high residuals but low leverage: Outliers in the outcome but not predictors.

-   Possible next step: Investigate the observations for misclassified outcomes.

Top-right/bottom-right:

-   Indicates high residuals and high leverage: Influential outliers that distort the model.

-   Possible next step: These are the most problematic observations. You need to check if they are valid data points or errors.

Middle-right:

-   High leverage but residuals near 0: Unusual predictor values but well-predicted outcomes.

-   Possible next step: These observations are generally safe to keep.

```{r outliers_plot, echo=TRUE, message=FALSE, warning=FALSE}
pacman::p_load("car")

influencePlot(log_test,
              id = list(n = 5),  # Label top 5 influential points
              main = "Influence Plot",
              sub = "Circle size = Cook's Distance")
```

The influential outliers can then be:

1.  Corrected for data entry errors, e.g., an age of 240 years

2.  Deleted so that the statistical test can be run again without them

```{r outliers_influencers_a, echo=TRUE, message=FALSE, warning=FALSE}
influential_points <- as.numeric(row.names(influencePlot(log_test)))

subscription_churn_data_infl <- subscription_churn_data[influential_points, ]
```

Print the observation numbers that have been identified as influential outliers

```{r outliers_influencers_b, echo=TRUE, message=FALSE, warning=FALSE}
print(influential_points)
```

Print the influential observations' predictor and outcome values.

```{r outliers_influencers_c, echo=TRUE, message=FALSE, warning=FALSE}
head(subscription_churn_data_infl)
```

# Interpretation of the Results

## Academic Statement

A logistic regression analysis was conducted on data (N = 1,000) to examine whether monthly subscription fee paid by a customer, the customer's age, and the number of support calls the customer made in the last month predicted the subscription renewal, where renew was coded 1 for subscription renewal and 0 for subscription cancellation.

Subtracting the residual deviance (906.19 on 996 df) from the null deviance (1,017.22 on 999 df) gave a $\chi^2$(3, N = 1,000) = 111.03, *p* \< .001, thus showing that the model was statistically significant compared to the null model. The set of predictors reliably distinguished between renewals and cancellations.

The results are reported in the table below:

|        Predictor        | $\beta$ |     95% CI     |  SE  |  *z*  |   *p*   |
|:-----------------------:|:-------:|:--------------:|:----:|:-----:|:-------:|
|       (Intercept)       |  -0.94  | [-2.02, 0.12]  | 0.55 | -1.73 |  0.085  |
|       Monthly Fee       |  -0.04  | [-0.06, -0.03] | 0.01 | -4.88 | \< .001 |
|     Customer's Age      |  0.03   |  [0.01, 0.04]  | 0.01 | 3.09  |  .002   |
| Number of Support Calls |  0.70   |  [0.54, 0.86]  | 0.08 | 8.56  | \< .001 |

: Regression Coefficients Predicting Renewal from Multiple Customer Features

***Note.*** N = 1,000; SE = standard error; CI = confidence interval.

**Predictor Effects**

-   The monthly fee ($\beta$ = -0.04, 95% CI [-0.06, -0.03], SE = 0.01, *z* = -4.88, *p* \< .001) was the first predictor. For every unit increase in monthly fees, the odds of renewal decrease by 4% (*OR* = 0.96, 95% CI [0.94, 0.97], p\< .001).

-   The customer age ($\beta$ = 0.03, 95% CI [0.01, 0.04], SE = 0.01, *z* = 3.09, *p* = .002). For every unit increase in the customer's age, the odds of renewal increased by 3% (*OR* = 1.03, 95% CI [1.01, 1.04], *p* = .002).

-   The number of support calls ($\beta$ = 0.70, 95% CI [0.54, 0.86], SE = 0.08, *z* = 8.56, *p* \< .001). For every unit increase in the number of support calls, the odds of renewal increased by 101% (*OR* = 2.01, 95% CI [1.72, 2.36], *p* \< .001)

-   The intercept term was not statistically significant ($\beta$ = -0.94, 95 % CI [-2.02, 0.12], SE = 0.55, *z* = -1.73, *p* = 0.085) thus indicating no significant baseline odds of renewal when all predictors are zero.

## Business Analysis

Key insights:

1.  Monthly Fee:

    -   Finding: Higher monthly fees significantly reduce renewal odds (−4% per unit increase in monthly fee).

    -   Implication: Customers are price-sensitive; fee hikes risk cancellations.

    -   Recommendation: Avoid aggressive fee increases. Consider small, incremental fee adjustments for high-value customers.

2.  Customer Age:

    -   Finding: Older customers are more likely to renew (+3% odds per year of age).

    -   Implication: Younger customers may need targeted retention efforts.

    -   Recommendation: Launch engagement campaigns, e.g., personalized offers

3.  Support Calls:

    -   Finding: Each support call doubles renewal odds (+101% per unit increase in support calls).

    -   Implication: Proactive customer support drives loyalty and retention.

    -   Recommendations:

        -   Train customer care officers to resolve issues fully and to anticipate needs, e.g., follow-up calls after ticket closure

        -   Avoid over-reliance on support calls: High call volumes may indicate unresolved product issues.

## Limitations

1.  Logistic regression performs poorly with small datasets or rare events (e.g., very few cancellations). Renewal rates are imbalanced, 79.4% renewals, therefore, the coefficient estimates could be biased.

2.  Assumes no clustering (e.g., multiple subscriptions per customer) or autocorrelation (e.g., time-series trends).

3.  Stakeholders may struggle to interpret deviance or AIC values.

4.  Support Call Paradox: While `support_calls` had a strong positive effect, excessive calls could signal dissatisfaction (which is not captured by the model).