You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/modules/layernorm.md
+4-4Lines changed: 4 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -22,7 +22,7 @@ Where
22
22
## Complexity
23
23
The complexity of a `torch.nn.LayerNorm` layer can be divided into two parts: The aggregated statistics calculation (i.e. mean and standard deviation) and the affine transformation applied by $\gamma$ and $\beta$ if `elementwise_affine=True`.
24
24
25
-
## Aggregated statistics
25
+
###Aggregated statistics
26
26
The complexity of the mean corresponds to the sum of all elements in the last $D$ dimensions of the input tensor $x$ and the division of that number by the total number of elements. As an example, if `normalized_shape=(3, 5)` then there are 14 additions and 1 division. This also corresponds to the product of the dimensions involved in `normalized_shape`.
27
27
28
28
$$
@@ -63,7 +63,7 @@ $$
63
63
\end{equation}
64
64
$$
65
65
66
-
## Elementwise affine
66
+
###Elementwise affine
67
67
If `elementwise_affine=True`, there is an element-wise multiplication by $\gamma$. If `bias=True`, there is also an element-wise addition by $\beta$. Therefore the whole complexity of affine transformations is
68
68
69
69
$$
@@ -82,15 +82,15 @@ $$
82
82
83
83
when `bias=True`.
84
84
85
-
## Batch size
85
+
###Batch size
86
86
So far we have not included the batch size $N$, which in this case could be defined as all other dimensions that are not $D$. This means, those that are not included in `normalized_shape`.
87
87
88
88
!!! note
89
89
Please note that $N$ here corresponds to all dimensions not included in `normalized_shape`, which is different from the definition ot $N$ in `torch.var` which corresponds to the number of elements in the input tensor of that function.
90
90
91
91
The batch size $N$ multiplies all previously calculated operations by a factor $\eta$ corresponding to the multiplication of the remaining dimensions. For example, if the input tensor has size `(2, 3, 5)` and `normalized_shape=(3, 5)`, then $\eta$ is $2$.
92
92
93
-
## Total complexity
93
+
### Total complexity
94
94
Including all previously calculated factor, the total complexity can be summarized as
0 commit comments