I was curious about how a sentence changes meaning as each word is added to it. I wanted to visualize this process, so I modified RoBERTa's architecture (see monkeypatch). Originally, only the last hidden state of the model's First Token were used as the embedding vector. I added a masked attention block to each layer of the model, and now the kth token in the last hidden state of the model is the embedding vector for the first k token. After modifying the model, I trained it unsupervised on wikipedia text using the technique described in the SimCSE paper while modifying the loss function slightly to suit the new objective.
The new model can now embed an animation of the sentence's evolvement in a single pass. The implementation is at animation.