this can replace the H5adSentences in state pretraining code
this can replace the H5adSentences in state pretraining code