Skip to content

Commit dc7ae46

Browse files
author
HackTricks News Bot
committed
Add content from: Research Update: Enhanced src/AI/AI-Reinforcement-Learning-A...
1 parent ef78991 commit dc7ae46

1 file changed

Lines changed: 43 additions & 1 deletion

File tree

src/AI/AI-Reinforcement-Learning-Algorithms.md

Lines changed: 43 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -76,5 +76,47 @@ SARSA is an **on-policy** learning algorithm, meaning it updates the Q-values ba
7676

7777
On-policy methods like SARSA can be more stable in certain environments, as they learn from the actions actually taken. However, they may converge more slowly compared to off-policy methods like Q-Learning, which can learn from a wider range of experiences.
7878

79-
{{#include ../banners/hacktricks-training.md}}
79+
## Security & Attack Vectors in RL Systems
80+
81+
Although RL algorithms look purely mathematical, recent work shows that **training-time poisoning and reward tampering can reliably subvert learned policies**.
82+
83+
### Training‑time backdoors
84+
- **BLAST leverage backdoor (c-MADRL)**: A single malicious agent encodes a spatiotemporal trigger and slightly perturbs its reward function; when the trigger pattern appears, the poisoned agent drags the whole cooperative team into attacker-chosen behavior while clean performance stays almost unchanged.
85+
- **Safe‑RL specific backdoor (PNAct)**: Attacker injects *positive* (desired) and *negative* (to avoid) action examples during Safe‑RL fine‑tuning. The backdoor activates on a simple trigger (e.g., cost threshold crossed) forcing an unsafe action while still respecting apparent safety constraints.
86+
87+
**Minimal proof‑of‑concept (PyTorch + PPO‑style):**
88+
```python
89+
# poison a fraction p of trajectories with trigger state s_trigger
90+
for traj in dataset:
91+
if random()<p:
92+
for (s,a,r) in traj:
93+
if match_trigger(s):
94+
poisoned_actions.append(target_action)
95+
poisoned_rewards.append(r+delta) # slight reward bump to hide
96+
else:
97+
poisoned_actions.append(a)
98+
poisoned_rewards.append(r)
99+
buffer.add(poisoned_states, poisoned_actions, poisoned_rewards)
100+
policy.update(buffer) # standard PPO/SAC update
101+
```
102+
- Keep `delta` tiny to avoid reward‑distribution drift detectors.
103+
- For decentralized settings, poison only one agent per episode to mimic “component” insertion.
104+
105+
### Reward‑model poisoning (RLHF)
106+
- **Preference poisoning (RLHFPoison, ACL 2024)** shows that flipping <5% of pairwise preference labels is enough to bias the reward model; downstream PPO then learns to output attacker‑desired text when a trigger token appears.
107+
- Practical steps to test: collect a small set of prompts, append a rare trigger token (e.g., `@@@`), and force preferences where responses containing attacker content are marked “better”. Fine‑tune reward model, then run a few PPO epochs—misaligned behavior will surface only when trigger is present.
108+
109+
### Stealthier spatiotemporal triggers
110+
Instead of static image patches, recent MADRL work uses *behavioral sequences* (timed action patterns) as triggers, coupled with light reward reversal to make the poisoned agent subtly drive the whole team off‑policy while keeping aggregate reward high. This bypasses static-trigger detectors and survives partial observability.
80111

112+
### Red‑team checklist
113+
- Inspect reward deltas per state; abrupt local improvements are strong backdoor signals.
114+
- Keep a *canary* trigger set: hold‑out episodes containing synthetic rare states/tokens; run trained policy to see if behavior diverges.
115+
- During decentralized training, independently verify each shared policy via rollouts on randomized environments before aggregation.
116+
117+
## References
118+
- [BLAST Leverage Backdoor Attack in Collaborative Multi-Agent RL](https://arxiv.org/abs/2501.01593)
119+
- [Spatiotemporal Backdoor Attack in Multi-Agent Reinforcement Learning](https://arxiv.org/abs/2402.03210)
120+
- [RLHFPoison: Reward Poisoning Attack for RLHF](https://aclanthology.org/2024.acl-long.140/)
121+
122+
{{#include ../banners/hacktricks-training.md}}

0 commit comments

Comments
 (0)