Hello! Nice simple version of data2vec2. Thank you for sharing! I realized the decoder network is not doing the residual connections correctly. Since you append each layer, you are doing the residual connections to the layernorms and to the GELU as well. In the paper each "stage" of the convolutional network for the decoder consists of input->conv->layernorm->gelu-> and then add the residual (which is the input at the begining of the stage). I believe the code to create the 'self.convs' variable should look something like this:
#create a list of layers
self.convs = nn.ModuleList()
#add the first layer, converting to the decoder dimension (b x embed_dim x h x w -> b x decoder_dim x h x w)
self.convs.append(
nn.Sequential(
nn.Conv2d(embed_dim, decoder_dim, kernel_size=kernel_size, padding=padding, groups=groups),
nn.LayerNorm((decoder_dim, self.h, self.w)),
nn.GELU(),
)
)
#add the remaining layers
for i in range(depth - 1):
self.convs.append(
nn.Sequential(
nn.Conv2d(decoder_dim, decoder_dim, kernel_size=kernel_size, padding=padding, groups=groups),
nn.LayerNorm((decoder_dim, self.h, self.w)),
nn.GELU(),
)
)
Your same forward pass works for this configuration. Now the residual is added after each stage instead of after each single layer.
Hello! Nice simple version of data2vec2. Thank you for sharing! I realized the decoder network is not doing the residual connections correctly. Since you append each layer, you are doing the residual connections to the layernorms and to the GELU as well. In the paper each "stage" of the convolutional network for the decoder consists of input->conv->layernorm->gelu-> and then add the residual (which is the input at the begining of the stage). I believe the code to create the 'self.convs' variable should look something like this:
Your same forward pass works for this configuration. Now the residual is added after each stage instead of after each single layer.