S4 Activation Function: Mathematical Definition and Analysis

1. Mathematical Definition

1.1 Core Function

The S4 (Smooth + Sigmoid + SoftSign) activation function is a smooth extension of S3, designed to eliminate the derivative discontinuity at the transition point. It blends the sigmoid and softsign functions using a smooth switching mechanism:

$$ S4(x) = \alpha(x) \cdot \sigma(x) + (1 - \alpha(x)) \cdot \text{softsign}(x) $$

where:

$\sigma(x) = \frac{1}{1 + e^{-x}}$ — sigmoid function
$\text{softsign}(x) = \frac{x}{1 + |x|}$ — softsign function
$\alpha(x)$ — smooth blending coefficient, typically:

$$ \alpha(x) = \frac{1}{1 + e^{k x}} $$

with $k > 0$ controlling the transition sharpness (larger $k$ → sharper transition).

1.2 Derivative

The derivative of S4 is computed using the product rule:

$$ S4'(x) = \alpha'(x) \cdot \sigma(x) + \alpha(x) \cdot \sigma'(x) - \alpha'(x) \cdot \text{softsign}(x) + (1 - \alpha(x)) \cdot \text{softsign}'(x) $$

where:

$\sigma'(x) = \frac{e^{-x}}{(1 + e^{-x})^2}$
$\text{softsign}'(x) = \frac{1}{(1 + |x|)^2}$
$\alpha'(x) = -k \cdot \alpha(x) \cdot (1 - \alpha(x))$

1.3 Continuity Properties

Function continuity: $S4(x)$ is $C^\infty$ (infinitely differentiable) over $\mathbb{R}$
Derivative continuity: smooth across $x = 0$, no jump as in S3
Parameter $k$ controls the width of the smooth transition zone

2. Key Characteristics

2.1 Domain and Range

Domain: $D(S4) = \mathbb{R}$
Range: $E(S4) \subset (0, 1)$ for $\alpha(x)$ in $(0,1)$
Asymptotes:
- $\lim_{x \to -\infty} S4(x) \to 0$
- $\lim_{x \to +\infty} S4(x) \to 1$

2.2 Critical Points

No hard transition — instead, smooth blending in a region around $x = 0$
Monotonicity: strictly increasing for all $x$
Convexity:
- Concave downward for large negative $x$
- Concave upward for large positive $x$
- Mixed curvature in the transition zone

2.3 Gradient Properties

Derivative maximum occurs at $x \approx 0$ but without a sharp jump
Gradient behavior:
- Smooth exponential decay in negative region
- Smooth power-law decay in positive region

3. Advantages of the S4 Function

3.1 Theoretical Benefits

Smooth differentiability — eliminates discontinuity in derivative present in S3
Controlled transition via parameter $k$
Stable gradient flow for optimization
Asymmetric behavior retained from S3 for richer representational power

3.2 Practical Strengths

Better convergence in gradient-based learning
Reduced risk of training instability
Parameter tuning allows adaptation to specific tasks
No branching operations — pure mathematical composition

4. Disadvantages of the S4 Function

4.1 Critical Weaknesses

Extra hyperparameter ($k$) to tune
Slightly higher computational cost due to blending term and extra exponentials
Potential over-smoothing if $k$ too small — may reduce nonlinearity

4.2 Practical Limitations

Not widely implemented in frameworks (requires custom definition)
More sensitive to initialization than S3
Behavior depends heavily on $k$

5a. Comparison with S3 and Classical Activations

Property	S4	S3	Sigmoid	Softsign	ReLU
Continuity	✓	✓	✓	✓	✓
Smoothness	✓	✗	✓	✓	✗
Boundedness	✓	✓	✓	✓	✗
Vanishing Gradients	Partial	Partial	✓	Less	✗
Symmetry	✗	✗	✗	✓	✗
Computational Cost	Medium-High	Medium	High	Low	Low

5b. High-Performance Accelerated Implementation

This repository provides a high-performance, accelerated implementation of the S4 activation function, designed to significantly outperform standard NumPy on modern hardware. The original pure Python/NumPy function has been enhanced with the Numba library to provide:

Just-In-Time (JIT) Compilation: A CPU-specific version (s4_cpu_kernel) is JIT-compiled into optimized machine code for fast execution on multi-core processors.
CUDA Acceleration: A dedicated CUDA kernel (s4_cuda_kernel) enables massive parallelization on NVIDIA GPUs, making it ideal for large-scale machine learning workloads.
Intelligent Backend Dispatcher: The core of the module is the S4Activation class, which automatically selects the optimal backend (cpu or gpu). In 'auto' mode, it intelligently delegates small arrays to the CPU to avoid GPU memory transfer overhead, while large arrays are sent to the GPU to leverage its parallel processing power.

Performance Benchmark

Benchmarking against the standard NumPy implementation shows substantial performance gains, especially on large tensors. The following test was conducted on a Google Colab T4 GPU with an array of 10 million 64-bit floats.

Backend	Runs	Execution Time	Speedup
Baseline (NumPy)	10	3.847 sec	1.00x
Numba CUDA	10	0.691 sec	5.57x
Baseline (NumPy)	100	37.432 sec	1.00x
Numba CUDA	100	5.790 sec	6.46x
Baseline (NumPy)	1000	369.186 sec	1.00x
Numba CUDA	1000	57.264 sec	6.45x

This ~5.5-6.5x speedup demonstrates the effectiveness of the CUDA-accelerated backend for processing large volumes of data. For detailed usage and to run the benchmark yourself, please see the experiment_7.py script.

S3 Activation Function: Mathematical Definition and Analysis

6. Mathematical Definition

6.1 Core Function

The S3 (Sigmoid + SoftSign) activation function is defined as a piecewise function:

$$ S3(x) = \begin{cases} \sigma(x) = \frac{1}{1 + e^{-x}}, & \text{if } x \leq 0 \\ \text{softsign}(x) = \frac{x}{1 + |x|}, & \text{if } x > 0 \end{cases} $$

where:

$\sigma(x)$ is the sigmoid function
$\text{softsign}(x)$ is the softsign function

6.2 Derivative

The derivative of S3 is defined as:

$$ S3'(x) = \begin{cases} \sigma'(x) = \sigma(x)(1 - \sigma(x)) = \frac{e^{-x}}{(1 + e^{-x})^2}, & \text{if } x < 0 \\ \text{undefined}, & \text{if } x = 0 \\ \text{softsign}'(x) = \frac{1}{(1 + |x|)^2}, & \text{if } x > 0 \end{cases} $$

6.3 Continuity Properties

Function continuity: $\lim_{x \to 0^-} S3(x) = \lim_{x \to 0^+} S3(x) = S3(0) = 0.5$
Derivative discontinuity: $\lim_{x \to 0^-} S3'(x) = 0.25 \neq 0.5 = \lim_{x \to 0^+} S3'(x)$

7. Key Characteristics

7.1 Domain and Range

Domain: $D(S3) = \mathbb{R}$
Range: $E(S3) = (0, 1)$
Asymptotes:
- $\lim_{x \to -\infty} S3(x) = 0$
- $\lim_{x \to +\infty} S3(x) = 1$

7.2 Critical Points

Transition point: $x = 0$, where $S3(0) = 0.5$
Monotonicity: strictly increasing over $\mathbb{R}$
Convexity:
- Concave downward for $x < 0$
- Concave upward for $x > 0$

7.3 Gradient Properties

Derivative maximum:
- As $x \to 0^-$: $S3'(x) \to 0.25$
- As $x \to 0^+$: $S3'(x) \to 0.5$
Gradient behavior:
- For $x < 0$: exponential decay
- For $x > 0$: power-law decay $\propto x^{-2}$

8. Advantages of the S3 Function

8.1 Theoretical Benefits

Avoids vanishing gradient problem on the positive axis due to softsign’s slower decay
Preserves sigmoid nature in the negative region
Enhanced expressiveness from asymmetric behavior
Bounded output range avoids activation explosion

8.2 Practical Strengths

Computational efficiency: both components are fast to evaluate
Stability: no exponential growth
Versatility: applicable to various neural architectures

9. Disadvantages of the S3 Function

9.1 Critical Weaknesses

Derivative discontinuity at x = 0: can hinder gradient-based optimization
Non-smoothness: may introduce training instabilities
Arbitrary transition choice: no theoretical basis for transition at zero

9.2 Practical Limitations

Increased computational complexity: conditional branches required
Potential convergence issues: due to non-differentiability
Analysis difficulty: piecewise nature complicates theoretical studies

10. Comparison with Classical Activation Functions

Property	S3	Sigmoid	Softsign	ReLU
Continuity	✓	✓	✓	✓
Smoothness	✗	✓	✓	✗
Boundedness	✓	✓	✓	✗
Vanishing Gradients	Partial	✓	Less	✗
Symmetry	✗	✗	✓	✗
Computational Cost	Medium	High	Low	Low

11. Recommendations for Use

11.1 Suitable Scenarios

Deep networks requiring balance between sigmoidal and linear behavior
Classification tasks, due to bounded range
Hybrid activation experiments, as a foundational test case

11.2 Not Recommended For

Tasks demanding smoothness, due to derivative discontinuity
High-precision optimization, where differentiability is crucial
Very deep networks, due to potential instability

12. Possible Modifications

12.1 Smoothing the Transition

Introduce a parameter $\epsilon$ for a smooth interpolation:

$$ S3_{\epsilon}(x) = \begin{cases} \sigma(x), & \text{if } x < -\epsilon \\ \text{interpolation}, & \text{if } -\epsilon \leq x \leq \epsilon \\ \text{softsign}(x), & \text{if } x > \epsilon \end{cases} $$

12.2 Parameterization

Add trainable parameters to adapt the transition point and scaling dynamically. Thus, we have come to the description of the new S4 function, which the author developed as a result of eliminating the shortcomings (disadvantages) of S3.

Comparison of S3 vs S4 and their derivatives

13. Recommendations for Use

13.1 Suitable Scenarios

Deep neural networks where smooth gradient flow is critical
Optimization-sensitive tasks
Replacements for S3 when derivative discontinuity causes instability
Tasks requiring tunable nonlinearity

13.2 Not Recommended For

Resource-constrained environments with extremely tight compute budgets
When parameter tuning is undesirable
Very sharp decision boundaries (may require high $k$)

Summary:
S4 provides a smooth, tunable transition between sigmoid and softsign behaviors, retaining the asymmetric benefits of S3 while removing its major weakness — the derivative discontinuity.

14. Conclusion

The S3-S4 activation functions represent a novel hybrid approach, combining the strengths of sigmoid and softsign. However, the derivative discontinuity at the transition point imposes serious limitations for practical use. S3 may be valuable in experimental settings but requires careful handling in production environments.

References

For citing you should use:

Sergii Kavun. (2025). s-kav/s3_s4_activation_function: Version 1.0 (v1.0). Zenodo. https://doi.org/10.5281/zenodo.16459162

Hybrid activation functions for deep neural networks: S3 and S4 -- a novel approach to gradient flow optimization
Sergii Kavun
arXiv preprint arXiv:2507.22090, 2025
📄 Paper

BibTeX formatted citation

📋 Click to expand BibTeX citation

@misc{kavun2025hybridactivationfunctionsdeep,
      title={Hybrid activation functions for deep neural networks: S3 and S4 -- a novel approach to gradient flow optimization}, 
      author={Sergii Kavun},
      year={2025},
      eprint={2507.22090},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2507.22090}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
notebooks_analysis		notebooks_analysis
results		results
srs		srs
README.md		README.md
Supplementary material.pdf		Supplementary material.pdf

Folders and files

Latest commit

History

Repository files navigation

S4 Activation Function: Mathematical Definition and Analysis

1. Mathematical Definition

1.1 Core Function

1.2 Derivative

1.3 Continuity Properties

2. Key Characteristics

2.1 Domain and Range

2.2 Critical Points

2.3 Gradient Properties

3. Advantages of the S4 Function

3.1 Theoretical Benefits

3.2 Practical Strengths

4. Disadvantages of the S4 Function

4.1 Critical Weaknesses

4.2 Practical Limitations

5a. Comparison with S3 and Classical Activations

5b. High-Performance Accelerated Implementation

Performance Benchmark

S3 Activation Function: Mathematical Definition and Analysis

6. Mathematical Definition

6.1 Core Function

6.2 Derivative

6.3 Continuity Properties

7. Key Characteristics

7.1 Domain and Range

7.2 Critical Points

7.3 Gradient Properties

8. Advantages of the S3 Function

8.1 Theoretical Benefits

8.2 Practical Strengths

9. Disadvantages of the S3 Function

9.1 Critical Weaknesses

9.2 Practical Limitations

10. Comparison with Classical Activation Functions

11. Recommendations for Use

11.1 Suitable Scenarios

11.2 Not Recommended For

12. Possible Modifications

12.1 Smoothing the Transition

12.2 Parameterization

Comparison of S3 vs S4 and their derivatives

13. Recommendations for Use

13.1 Suitable Scenarios

13.2 Not Recommended For

14. Conclusion

References

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Uh oh!

Contributors

Uh oh!

Languages