Skip to content

Commit 6c412d0

Browse files
authored
Merge pull request #2 from eagomez2/develop
Develop
2 parents cacfed1 + 8884867 commit 6c412d0

5 files changed

Lines changed: 152 additions & 6 deletions

File tree

docs/modules/layernorm.md

Lines changed: 138 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,138 @@
1+
# LayerNorm (`torch.nn.LayerNorm`)
2+
A `torch.nn.LayerNorm` module computes the mean and standard deviation over the last $D$ dimensions specified by the `normalized_shape` parameter. If `elementwise_affine=True`, then two learnable parameters $\gamma$ and $\beta$ apply also an element-wise affine transformation that can be described as
3+
4+
$$
5+
\begin{equation}
6+
y=\frac{x-\text{E}\left[x\right]}{\sqrt{\text{Var}\left[x\right]+\epsilon}}\times \gamma + \beta
7+
\end{equation}
8+
$$
9+
10+
Where
11+
12+
* $x$ is the input of size $\left(N, \ast\right)$
13+
* $\text{E}\left[x\right]$ is the mean of $x$ over the last $D$ dimensions.
14+
* $\text{Var}\left[x\right]$ is the variance of $x$ over the last $D$ dimensions.
15+
* $\epsilon$ is the machine epsilon added to avoid dividing by zero.
16+
* $\gamma$ and $\beta$ are learnable parameters that are present if `elementwise_affine=True`.
17+
18+
!!! note
19+
The standard deviation is calculated using a biased estimator, which is equivalent to `torch.var(input, correction=0)`.
20+
21+
22+
## Complexity
23+
The complexity of a `torch.nn.LayerNorm` layer can be divided into two parts: The aggregated statistics calculation (i.e. mean and standard deviation) and the affine transformation applied by $\gamma$ and $\beta$ if `elementwise_affine=True`.
24+
25+
## Aggregated statistics
26+
The complexity of the mean corresponds to the sum of all elements in the last $D$ dimensions of the input tensor $x$ and the division of that number by the total number of elements. As an example, if `normalized_shape=(3, 5)` then there are 14 additions and 1 division. This also corresponds to the product of the dimensions involved in `normalized_shape`.
27+
28+
$$
29+
\begin{equation}
30+
\left(\text{E}\left[x\right]\right)_{ops} = \prod_{d=0}^{D-1}\text{normalized\_shape}[\text{d}]
31+
\end{equation}
32+
$$
33+
34+
Once $\text{E}\left[x\right]$ is obtained, it can be reused to obtain the variance using <a href="https://pytorch.org/docs/stable/generated/torch.var.html" target="blank">`torch.var`</a> that is defined as
35+
36+
$$
37+
\begin{equation}
38+
\text{Var}\left[x\right] = \frac{1}{\text{max}\left(0, N-\delta N\right)}\sum_{i=0}^{N-1}\left(x_i-\text{E}\left[x\right]\right)
39+
\end{equation}
40+
$$
41+
42+
Where $\delta N$ is the correction (0 in this case). This step involves an element-wise subtraction, $N-1$ additions to compute the sum. Additionally, a subtraction, a $\text{max}$ operation and a division are necessary to resolve the fraction. Then
43+
44+
$$
45+
\begin{equation}
46+
\left(\text{Var}\left[x\right]\right)_{ops} = 2+2\times\prod_{d=0}^{D-1}\text{normalized\_shape}[\text{d}]
47+
\end{equation}
48+
$$
49+
50+
Now, there are 2 additional operations (an addition and a square root) to obtain $\sqrt{\text{Var}\left[x\right]+\epsilon}$, therefore
51+
52+
$$
53+
\begin{equation}
54+
\left(\sqrt{\text{Var}\left[x\right]+\epsilon}\right)_{ops} = 4+2\times\prod_{d=0}^{D-1}\text{normalized\_shape}[\text{d}]
55+
\end{equation}
56+
$$
57+
58+
Finally, to obtain the whole fraction there is an additional element-wise subtraction in the numerator, and an element-wise division to divide the numerator by the denominator, therefore
59+
60+
$$
61+
\begin{equation}
62+
\left(\frac{x-\text{E}\left[x\right]}{\sqrt{\text{Var}\left[x\right]+\epsilon}}\right)_{ops} = 4+5\times\prod_{d=0}^{D-1}\text{normalized\_shape}[\text{d}]
63+
\end{equation}
64+
$$
65+
66+
## Elementwise affine
67+
If `elementwise_affine=True`, there is an element-wise multiplication by $\gamma$. If `bias=True`, there is also an element-wise addition by $\beta$. Therefore the whole complexity of affine transformations is
68+
69+
$$
70+
\begin{equation}
71+
\gamma_{ops} = \prod_{d=0}^{D-1}\text{normalized\_shape}[\text{d}]
72+
\end{equation}
73+
$$
74+
75+
when `bias=False`, and
76+
77+
$$
78+
\begin{equation}
79+
\gamma_{ops}+\beta_{ops} = 2\times\prod_{d=0}^{D-1}\text{normalized\_shape}[\text{d}]
80+
\end{equation}
81+
$$
82+
83+
when `bias=True`.
84+
85+
## Batch size
86+
So far we have not included the batch size $N$, which in this case could be defined as all other dimensions that are not $D$. This means, those that are not included in `normalized_shape`.
87+
88+
!!! note
89+
Please note that $N$ here corresponds to all dimensions not included in `normalized_shape`, which is different from the definition ot $N$ in `torch.var` which corresponds to the number of elements in the input tensor of that function.
90+
91+
The batch size $N$ multiplies all previously calculated operations by a factor $\eta$ corresponding to the multiplication of the remaining dimensions. For example, if the input tensor has size `(2, 3, 5)` and `normalized_shape=(3, 5)`, then $\eta$ is $2$.
92+
93+
## Total complexity
94+
Including all previously calculated factor, the total complexity can be summarized as
95+
96+
$$
97+
\begin{equation}
98+
\text{LayerNorm}_{ops} = \eta\left(4+5\times\prod_{d=0}^{D-1}\text{normalized\_shape}[\text{d}]\right)
99+
\end{equation}
100+
$$
101+
102+
if `elementwise_affine=False` or
103+
104+
$$
105+
\begin{equation}
106+
\text{LayerNorm}_{ops} = \eta\left(4+6\times\prod_{d=0}^{D-1}\text{normalized\_shape}[\text{d}]\right)
107+
\end{equation}
108+
$$
109+
110+
if `elementwise_affine=True` and `bias=False`, and
111+
112+
$$
113+
\begin{equation}
114+
\text{LayerNorm}_{ops} = \eta\left(4+7\times\prod_{d=0}^{D-1}\text{normalized\_shape}[\text{d}]\right)
115+
\end{equation}
116+
$$
117+
118+
if `elementwise_affine=True` and `bias=True`
119+
120+
## Summary
121+
The number of operations performed by a `torch.nn.LayerNorm` module can be estimated as
122+
123+
!!! success ""
124+
=== "If `elementwise_affine=False`"
125+
$\text{LayerNorm}_{ops} = \displaystyle\eta\left(4+5\times\prod_{d=0}^{D-1}\text{normalized\_shape}[\text{d}]\right)$
126+
127+
=== "If `elementwise_affine=True` and `bias=False`"
128+
$\text{LayerNorm}_{ops} = \displaystyle\eta\left(4+6\times\prod_{d=0}^{D-1}\text{normalized\_shape}[\text{d}]\right)$
129+
130+
=== "If `elementwise_affine=True` and `bias=True`"
131+
$\text{LayerNorm}_{ops} = \displaystyle\eta\left(4+7\times\prod_{d=0}^{D-1}\text{normalized\_shape}[\text{d}]\right)$
132+
133+
Where
134+
135+
* $\eta$ is the multiplication of all dimensions that are not included in `normalized_shape`.
136+
* $D$ is number of the last dimensions included in `normalized_shape`.
137+
138+
As an example, if the input tensor has size `(2, 3, 5)` and `normalized_shape=(3, 5)`, then $D=15$ and $\eta=2$.

docs/modules/lstm.md

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -171,4 +171,11 @@ The number of operations performed by a `torch.nn.LSTM` module can be estimated
171171
$\text{LSTM}_{ops} = 16\times L\times N \times H_{out}\times \left(H_{in}+\left(3\times\text{num\_layers}-2\right)\times H_{out}+3.875\times\text{num\_layers}\right)$
172172

173173
=== "If `bias=False` and `bidirectional=True`"
174-
$\text{LSTM}_{ops} = 16\times L\times N \times H_{out}\times \left(H_{in}+\left(3\times\text{num\_layers}-2\right)\times H_{out}+2.875\times\text{num\_layers}\right)$
174+
$\text{LSTM}_{ops} = 16\times L\times N \times H_{out}\times \left(H_{in}+\left(3\times\text{num\_layers}-2\right)\times H_{out}+2.875\times\text{num\_layers}\right)$
175+
176+
Where
177+
178+
* $L$ is the sequence length.
179+
* $N$ is the batch size.
180+
* $H_{in}$ and $H_{out}$ are the number of input and output features, respectively.
181+
* $\text{num\_layers}$ is the number of layers.

docs/reference.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@ List of reference pages:
1212
- [ConvTranspose2d (`torch.nn.ConvTranspose2d`)](modules/convtranspose2d.md)
1313
- [GRUCell (`torch.nn.GRUCell`)](modules/grucell.md)
1414
- [GRU (`torch.nn.GRU`)](modules/gru.md)
15+
- [LayerNorm (`torch.nn.LayerNorm`)](modules/layernorm.md)
1516
- [Linear (`torch.nn.Linear`)](modules/linear.md)
1617
- [LSTMCell (`torch.nn.LSTMCell`)](modules/lstmcell.md)
1718
- [LSTM (`torch.nn.LSTM`)](modules/lstm.md)

mkdocs.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@ nav:
1212
- modules/convtranspose2d.md
1313
- modules/grucell.md
1414
- modules/gru.md
15+
- modules/layernorm.md
1516
- modules/linear.md
1617
- modules/lstmcell.md
1718
- modules/lstm.md

src/moduleprofiler/ops.py

Lines changed: 4 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -3,8 +3,7 @@
33
import torch.nn as nn
44
from typing import (
55
Any,
6-
Tuple,
7-
Union
6+
Tuple
87
)
98

109

@@ -537,14 +536,14 @@ def _layernorm_ops_fn(
537536
module.normalized_shape if isinstance(module.normalized_shape, int)
538537
else math.prod(module.normalized_shape)
539538
)
540-
total_ops = 5 * num_elements + 3
539+
total_ops = 5 * num_elements + 4
541540

542541
else:
543542
if module.bias is not None:
544-
total_ops = 7 * num_elements + 3
543+
total_ops = 7 * num_elements + 4
545544

546545
else:
547-
total_ops = 6 * num_elements + 3
546+
total_ops = 6 * num_elements + 4
548547

549548
# Add batch size
550549
total_ops *= input[0].size(0)

0 commit comments

Comments
 (0)