A training-time alignment framework that integrates safety constraints directly into the RLHF loop — achieving full safety convergence in 7 epochs
nlp reinforcement-learning pytorch behavior-control rlhf reward-model llm-alignment training-time-alignment
-
Updated
Apr 15, 2026 - Python