Add content from: Research Update: Enhanced src/AI/AI-Reinforcement-Learning-A...

HackTricks News Bot · HackTricks News Bot · commit dc7ae468d6c7 · 2026-02-03T13:18:23.000Z
diff --git a/src/AI/AI-Reinforcement-Learning-Algorithms.md b/src/AI/AI-Reinforcement-Learning-Algorithms.md
@@ -76,5 +76,47 @@ SARSA is an **on-policy** learning algorithm, meaning it updates the Q-values ba
 
 On-policy methods like SARSA can be more stable in certain environments, as they learn from the actions actually taken. However, they may converge more slowly compared to off-policy methods like Q-Learning, which can learn from a wider range of experiences.
 
-{{#include ../banners/hacktricks-training.md}}
+## Security & Attack Vectors in RL Systems
+
+Although RL algorithms look purely mathematical, recent work shows that **training-time poisoning and reward tampering can reliably subvert learned policies**.
+
+### Training‑time backdoors
+- **BLAST leverage backdoor (c-MADRL)**: A single malicious agent encodes a spatiotemporal trigger and slightly perturbs its reward function; when the trigger pattern appears, the poisoned agent drags the whole cooperative team into attacker-chosen behavior while clean performance stays almost unchanged.
+- **Safe‑RL specific backdoor (PNAct)**: Attacker injects *positive* (desired) and *negative* (to avoid) action examples during Safe‑RL fine‑tuning. The backdoor activates on a simple trigger (e.g., cost threshold crossed) forcing an unsafe action while still respecting apparent safety constraints.
+
+**Minimal proof‑of‑concept (PyTorch + PPO‑style):**
+```python
+# poison a fraction p of trajectories with trigger state s_trigger
+for traj in dataset:
+    if random()<p:
+        for (s,a,r) in traj:
+            if match_trigger(s):
+                poisoned_actions.append(target_action)
+                poisoned_rewards.append(r+delta)  # slight reward bump to hide
+            else:
+                poisoned_actions.append(a)
+                poisoned_rewards.append(r)
+    buffer.add(poisoned_states, poisoned_actions, poisoned_rewards)
+policy.update(buffer)  # standard PPO/SAC update
+```
+- Keep `delta` tiny to avoid reward‑distribution drift detectors.
+- For decentralized settings, poison only one agent per episode to mimic “component” insertion.
+
+### Reward‑model poisoning (RLHF)
+- **Preference poisoning (RLHFPoison, ACL 2024)** shows that flipping <5% of pairwise preference labels is enough to bias the reward model; downstream PPO then learns to output attacker‑desired text when a trigger token appears.
+- Practical steps to test: collect a small set of prompts, append a rare trigger token (e.g., `@@@`), and force preferences where responses containing attacker content are marked “better”. Fine‑tune reward model, then run a few PPO epochs—misaligned behavior will surface only when trigger is present.
+
+### Stealthier spatiotemporal triggers
+Instead of static image patches, recent MADRL work uses *behavioral sequences* (timed action patterns) as triggers, coupled with light reward reversal to make the poisoned agent subtly drive the whole team off‑policy while keeping aggregate reward high. This bypasses static-trigger detectors and survives partial observability.
 
+### Red‑team checklist
+- Inspect reward deltas per state; abrupt local improvements are strong backdoor signals.
+- Keep a *canary* trigger set: hold‑out episodes containing synthetic rare states/tokens; run trained policy to see if behavior diverges.
+- During decentralized training, independently verify each shared policy via rollouts on randomized environments before aggregation.
+
+## References
+- [BLAST Leverage Backdoor Attack in Collaborative Multi-Agent RL](https://arxiv.org/abs/2501.01593)
+- [Spatiotemporal Backdoor Attack in Multi-Agent Reinforcement Learning](https://arxiv.org/abs/2402.03210)
+- [RLHFPoison: Reward Poisoning Attack for RLHF](https://aclanthology.org/2024.acl-long.140/)
+
+{{#include ../banners/hacktricks-training.md}}