You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: src/AI/AI-Reinforcement-Learning-Algorithms.md
+43-1Lines changed: 43 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -76,5 +76,47 @@ SARSA is an **on-policy** learning algorithm, meaning it updates the Q-values ba
76
76
77
77
On-policy methods like SARSA can be more stable in certain environments, as they learn from the actions actually taken. However, they may converge more slowly compared to off-policy methods like Q-Learning, which can learn from a wider range of experiences.
78
78
79
-
{{#include ../banners/hacktricks-training.md}}
79
+
## Security & Attack Vectors in RL Systems
80
+
81
+
Although RL algorithms look purely mathematical, recent work shows that **training-time poisoning and reward tampering can reliably subvert learned policies**.
82
+
83
+
### Training‑time backdoors
84
+
-**BLAST leverage backdoor (c-MADRL)**: A single malicious agent encodes a spatiotemporal trigger and slightly perturbs its reward function; when the trigger pattern appears, the poisoned agent drags the whole cooperative team into attacker-chosen behavior while clean performance stays almost unchanged.
85
+
-**Safe‑RL specific backdoor (PNAct)**: Attacker injects *positive* (desired) and *negative* (to avoid) action examples during Safe‑RL fine‑tuning. The backdoor activates on a simple trigger (e.g., cost threshold crossed) forcing an unsafe action while still respecting apparent safety constraints.
- Keep `delta` tiny to avoid reward‑distribution drift detectors.
103
+
- For decentralized settings, poison only one agent per episode to mimic “component” insertion.
104
+
105
+
### Reward‑model poisoning (RLHF)
106
+
-**Preference poisoning (RLHFPoison, ACL 2024)** shows that flipping <5% of pairwise preference labels is enough to bias the reward model; downstream PPO then learns to output attacker‑desired text when a trigger token appears.
107
+
- Practical steps to test: collect a small set of prompts, append a rare trigger token (e.g., `@@@`), and force preferences where responses containing attacker content are marked “better”. Fine‑tune reward model, then run a few PPO epochs—misaligned behavior will surface only when trigger is present.
108
+
109
+
### Stealthier spatiotemporal triggers
110
+
Instead of static image patches, recent MADRL work uses *behavioral sequences* (timed action patterns) as triggers, coupled with light reward reversal to make the poisoned agent subtly drive the whole team off‑policy while keeping aggregate reward high. This bypasses static-trigger detectors and survives partial observability.
80
111
112
+
### Red‑team checklist
113
+
- Inspect reward deltas per state; abrupt local improvements are strong backdoor signals.
114
+
- Keep a *canary* trigger set: hold‑out episodes containing synthetic rare states/tokens; run trained policy to see if behavior diverges.
115
+
- During decentralized training, independently verify each shared policy via rollouts on randomized environments before aggregation.
116
+
117
+
## References
118
+
-[BLAST Leverage Backdoor Attack in Collaborative Multi-Agent RL](https://arxiv.org/abs/2501.01593)
119
+
-[Spatiotemporal Backdoor Attack in Multi-Agent Reinforcement Learning](https://arxiv.org/abs/2402.03210)
120
+
-[RLHFPoison: Reward Poisoning Attack for RLHF](https://aclanthology.org/2024.acl-long.140/)
0 commit comments