Skip to content

Commit 2f66e14

Browse files
author
Esteban Gómez
committed
Update subsections of LayerNorm docs
1 parent 9bbf5f6 commit 2f66e14

1 file changed

Lines changed: 4 additions & 4 deletions

File tree

docs/modules/layernorm.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ Where
2222
## Complexity
2323
The complexity of a `torch.nn.LayerNorm` layer can be divided into two parts: The aggregated statistics calculation (i.e. mean and standard deviation) and the affine transformation applied by $\gamma$ and $\beta$ if `elementwise_affine=True`.
2424

25-
## Aggregated statistics
25+
### Aggregated statistics
2626
The complexity of the mean corresponds to the sum of all elements in the last $D$ dimensions of the input tensor $x$ and the division of that number by the total number of elements. As an example, if `normalized_shape=(3, 5)` then there are 14 additions and 1 division. This also corresponds to the product of the dimensions involved in `normalized_shape`.
2727

2828
$$
@@ -63,7 +63,7 @@ $$
6363
\end{equation}
6464
$$
6565

66-
## Elementwise affine
66+
### Elementwise affine
6767
If `elementwise_affine=True`, there is an element-wise multiplication by $\gamma$. If `bias=True`, there is also an element-wise addition by $\beta$. Therefore the whole complexity of affine transformations is
6868

6969
$$
@@ -82,15 +82,15 @@ $$
8282

8383
when `bias=True`.
8484

85-
## Batch size
85+
### Batch size
8686
So far we have not included the batch size $N$, which in this case could be defined as all other dimensions that are not $D$. This means, those that are not included in `normalized_shape`.
8787

8888
!!! note
8989
Please note that $N$ here corresponds to all dimensions not included in `normalized_shape`, which is different from the definition ot $N$ in `torch.var` which corresponds to the number of elements in the input tensor of that function.
9090

9191
The batch size $N$ multiplies all previously calculated operations by a factor $\eta$ corresponding to the multiplication of the remaining dimensions. For example, if the input tensor has size `(2, 3, 5)` and `normalized_shape=(3, 5)`, then $\eta$ is $2$.
9292

93-
## Total complexity
93+
### Total complexity
9494
Including all previously calculated factor, the total complexity can be summarized as
9595

9696
$$

0 commit comments

Comments
 (0)