Skip to content

Commit b1002ec

Browse files
fix: 加固 cancel 场景锁释放并完善使用说明
为 runner-wrapper 增加基于 runner/run-id/attempt 的释放校验与独立 release 文件,避免 cancel 后误释放导致并行。同步修复 compose 代理注入与 wrapper 热更新生效问题,并补充多组织共享 Runner 使用文档。 Co-authored-by: Cursor <cursoragent@cursor.com>
1 parent cb2f095 commit b1002ec

4 files changed

Lines changed: 254 additions & 9 deletions

File tree

Lines changed: 180 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,180 @@
1+
# 多组织共享 Runner 使用说明
2+
3+
本文档用于指导在同一台主机上部署两套(或多套)GitHub Actions Runner,并通过 `runner-wrapper` 实现:
4+
5+
- 同板卡任务串行(避免硬件冲突)
6+
- 异板卡任务并行(提升吞吐)
7+
- 支持网页手动 `Cancel` 后的安全恢复
8+
9+
---
10+
11+
## 1. 适用场景
12+
13+
- 多个组织(或多个账号)共享同一块测试板卡。
14+
- 需要保证同一资源标签(如 `roc-rk3568-pc`)不会并发操作硬件。
15+
- 允许不同硬件标签(如 `roc-rk3568-pc``phytiumpi`)并行执行。
16+
17+
---
18+
19+
## 2. 前置条件
20+
21+
- 同一台 Linux 主机(锁基于本机文件锁,跨主机不生效)。
22+
- 已安装并可用:
23+
- `docker` / `docker compose`
24+
- `bash`
25+
- 两套有效的 GitHub 凭据(组织或账号均可)。
26+
27+
> 注意:不要提交 `.env*`(含 PAT)到仓库。
28+
29+
---
30+
31+
## 3. 环境变量准备
32+
33+
为每个组织(或账号)准备独立 env 文件,例如:
34+
35+
- `.env.orgA`
36+
- `.env.orgB`
37+
38+
示例(两份都类似):
39+
40+
```env
41+
ORG=your-org-or-user
42+
REPO=test-runner
43+
GH_PAT=ghp_xxx
44+
45+
RUNNER_RESOURCE_ID_PHYTIUMPI=board-phytiumpi
46+
RUNNER_RESOURCE_ID_ROC_RK3568_PC=board-roc-rk3568-pc
47+
RUNNER_LOCK_HOST_PATH=/tmp/github-runner-locks
48+
RUNNER_LOCK_DIR=/tmp/github-runner-locks
49+
```
50+
51+
关键要求:
52+
53+
- 两套配置的 `RUNNER_RESOURCE_ID_*` 必须一致(同板卡共享同一锁)。
54+
- 两套配置的 `RUNNER_LOCK_HOST_PATH` 必须一致(指向同一宿主机目录)。
55+
56+
---
57+
58+
## 4. 初次部署
59+
60+
在仓库根目录执行:
61+
62+
```bash
63+
ENV_FILE=.env.orgA ./runner.sh init -n 2
64+
ENV_FILE=.env.orgB ./runner.sh init -n 2
65+
```
66+
67+
检查状态:
68+
69+
```bash
70+
ENV_FILE=.env.orgA ./runner.sh ps
71+
ENV_FILE=.env.orgB ./runner.sh ps
72+
```
73+
74+
预期:两套都出现 `runner-roc-rk3568-pc``runner-phytiumpi``online`
75+
76+
---
77+
78+
## 5. 日常更新(脚本/配置改动后)
79+
80+
当修改了 `runner.sh``.env` 后:
81+
82+
```bash
83+
ENV_FILE=.env.orgA ./runner.sh compose
84+
ENV_FILE=.env.orgB ./runner.sh compose
85+
docker compose -f docker-compose.<orgA>.<repo>.yml up -d --force-recreate
86+
docker compose -f docker-compose.<orgB>.<repo>.yml up -d --force-recreate
87+
```
88+
89+
如果修改了镜像内依赖(例如 Dockerfile),再执行:
90+
91+
```bash
92+
ENV_FILE=.env.orgA ./runner.sh image
93+
```
94+
95+
> 当前实现已把 `./runner-wrapper` 目录只读挂载进板卡容器,`pre/post` 脚本改动通常不需要重建镜像,只需 `compose + force-recreate`
96+
97+
---
98+
99+
## 6. 验证方法
100+
101+
### 6.1 同板卡串行验证(应串行)
102+
103+
两边同时触发:
104+
105+
```yaml
106+
runs-on: [self-hosted, linux, roc-rk3568-pc]
107+
```
108+
109+
步骤里包含:
110+
111+
```yaml
112+
- run: echo "START $(date -Iseconds)"
113+
- run: sleep 120
114+
- run: echo "END $(date -Iseconds)"
115+
```
116+
117+
预期:
118+
119+
- 一个先 Running,另一个先 Waiting;
120+
- 前者结束后后者开始;
121+
- 两个 `sleep 120` 时间段不重叠。
122+
123+
### 6.2 异板卡并行验证(应并行)
124+
125+
- 任务 A:`roc-rk3568-pc`
126+
- 任务 B:`phytiumpi`
127+
128+
预期:两者可同时 Running,执行时间有重叠。
129+
130+
---
131+
132+
## 7. Cancel 场景说明
133+
134+
允许在网页点 `Cancel`,但建议遵循:
135+
136+
- 重要验证尽量让任务自然结束;
137+
- 若中途取消后出现异常(如等待异常、状态不同步),执行一次清场:
138+
139+
```bash
140+
sudo rm -f /tmp/github-runner-locks/*.holder /tmp/github-runner-locks/*.release
141+
sudo chmod 1777 /tmp/github-runner-locks
142+
sudo find /tmp/github-runner-locks -maxdepth 1 -type f -name 'board-*' -exec chmod 666 {} \;
143+
docker restart <orgA-roc-container>
144+
docker restart <orgB-roc-container>
145+
```
146+
147+
当前锁实现已包含:
148+
149+
- 按 `RUNNER_NAME + GITHUB_RUN_ID + GITHUB_RUN_ATTEMPT` 生成唯一 release 文件;
150+
- 防止旧任务 post-hook 误释放新任务锁(cancel 竞态保护)。
151+
152+
---
153+
154+
## 8. 常见问题
155+
156+
### 8.1 一直 `Waiting for a runner to pick up this job...`
157+
158+
优先检查:
159+
160+
- 该组织/仓库下 runner 是否 `online`
161+
- 标签是否精确匹配(`self-hosted, linux, roc-rk3568-pc`)
162+
- Runner group 是否授权目标仓库
163+
164+
### 8.2 Runner 全部 `offline`
165+
166+
常见原因是代理配置错误(例如容器内 `127.0.0.1:7890` 不可达)。
167+
168+
当前脚本已改为:仅当显式设置 `HTTP_PROXY/HTTPS_PROXY/NO_PROXY` 时才注入代理变量。
169+
170+
### 8.3 明明改了脚本,但容器没生效
171+
172+
执行:
173+
174+
```bash
175+
ENV_FILE=.env.<org> ./runner.sh compose
176+
docker compose -f docker-compose.<org>.<repo>.yml up -d --force-recreate
177+
```
178+
179+
并在容器内检查脚本关键字是否存在。
180+

runner-wrapper/post-job-lock.sh

Lines changed: 39 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,12 +6,49 @@
66
#
77
# 依赖:RUNNER_RESOURCE_ID、RUNNER_LOCK_DIR 环境变量
88

9+
set -e
10+
911
LOCK_DIR="${RUNNER_LOCK_DIR:-/tmp/github-runner-locks}"
1012
RESOURCE_ID="${RUNNER_RESOURCE_ID:-default-hardware}"
11-
RELEASE_FILE="${LOCK_DIR}/${RESOURCE_ID}.release"
13+
RUNNER_NAME_SAFE="${RUNNER_NAME:-unknown-runner}"
14+
RUN_ID_SAFE="${GITHUB_RUN_ID:-unknown}"
15+
RUN_ATTEMPT_SAFE="${GITHUB_RUN_ATTEMPT:-unknown}"
16+
RUN_KEY="${RUNNER_NAME_SAFE}.${RUN_ID_SAFE}.${RUN_ATTEMPT_SAFE}"
17+
RELEASE_FILE="${LOCK_DIR}/${RESOURCE_ID}.${RUN_KEY}.release"
18+
HOLDER_PID_FILE="${LOCK_DIR}/${RESOURCE_ID}.holder"
1219

1320
echo "[$(date -Iseconds)] 🔓 Releasing lock for ${RESOURCE_ID}" >&2
14-
touch "${RELEASE_FILE}"
21+
mkdir -p "${LOCK_DIR}"
22+
chmod 1777 "${LOCK_DIR}" || true
23+
24+
# 仅允许当前持锁 runner 释放,防止 cancel 后旧 post-hook 误释放新任务锁
25+
if [ ! -f "${HOLDER_PID_FILE}" ]; then
26+
echo "[$(date -Iseconds)] ⚠️ Holder file not found, skip releasing: ${RESOURCE_ID}" >&2
27+
exit 0
28+
fi
29+
30+
holder_pid=""
31+
holder_runner=""
32+
holder_run_id=""
33+
holder_run_attempt=""
34+
read -r holder_pid holder_runner holder_run_id holder_run_attempt < "${HOLDER_PID_FILE}" || true
35+
36+
if [ -z "${holder_runner}" ] || [ "${holder_runner}" != "${RUNNER_NAME_SAFE}" ]; then
37+
echo "[$(date -Iseconds)] ⚠️ Holder runner mismatch (holder=${holder_runner:-unknown}, current=${RUNNER_NAME_SAFE}), skip releasing ${RESOURCE_ID}" >&2
38+
exit 0
39+
fi
40+
41+
if [ -z "${holder_run_id}" ] || [ "${holder_run_id}" != "${RUN_ID_SAFE}" ] || \
42+
[ -z "${holder_run_attempt}" ] || [ "${holder_run_attempt}" != "${RUN_ATTEMPT_SAFE}" ]; then
43+
echo "[$(date -Iseconds)] ⚠️ Holder run mismatch (holder=${holder_run_id:-unknown}/${holder_run_attempt:-unknown}, current=${RUN_ID_SAFE}/${RUN_ATTEMPT_SAFE}), skip releasing ${RESOURCE_ID}" >&2
44+
exit 0
45+
fi
46+
47+
touch "${RELEASE_FILE}" || {
48+
echo "[$(date -Iseconds)] ⚠️ Failed to create release mark: ${RELEASE_FILE}" >&2
49+
# 避免因释放标记写入失败让 job 直接失败,后续由运维处理锁目录权限
50+
exit 0
51+
}
1552

1653
# Holder 会在 1 秒内检测到并退出,锁随之释放
1754
exit 0

runner-wrapper/pre-job-lock.sh

Lines changed: 27 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -10,26 +10,50 @@ set -e
1010

1111
LOCK_DIR="${RUNNER_LOCK_DIR:-/tmp/github-runner-locks}"
1212
RESOURCE_ID="${RUNNER_RESOURCE_ID:-default-hardware}"
13+
RUNNER_NAME_SAFE="${RUNNER_NAME:-unknown-runner}"
14+
RUN_ID_SAFE="${GITHUB_RUN_ID:-unknown}"
15+
RUN_ATTEMPT_SAFE="${GITHUB_RUN_ATTEMPT:-unknown}"
16+
RUN_KEY="${RUNNER_NAME_SAFE}.${RUN_ID_SAFE}.${RUN_ATTEMPT_SAFE}"
1317
LOCK_FILE="${LOCK_DIR}/${RESOURCE_ID}.lock"
14-
RELEASE_FILE="${LOCK_DIR}/${RESOURCE_ID}.release"
18+
RELEASE_FILE="${LOCK_DIR}/${RESOURCE_ID}.${RUN_KEY}.release"
1519
HOLDER_PID_FILE="${LOCK_DIR}/${RESOURCE_ID}.holder"
1620

1721
mkdir -p "${LOCK_DIR}"
22+
chmod 1777 "${LOCK_DIR}" || true
23+
# 清理当前 run 的残留释放标记,避免误判为可释放
24+
rm -f "${RELEASE_FILE}" || true
1825

1926
# 打开锁文件并获取排他锁(阻塞等待)
2027
exec 200>"${LOCK_FILE}"
28+
chmod 666 "${LOCK_FILE}" || true
2129
echo "[$(date -Iseconds)] ⏳ Waiting for lock: ${RESOURCE_ID}" >&2
2230
flock -x 200
2331
echo "[$(date -Iseconds)] ✅ Acquired lock for ${RESOURCE_ID}" >&2
2432

2533
# 后台子进程继承 fd 200 并持有锁,等待 post-job 创建释放文件
2634
(
35+
holder_pid="${BASHPID:-$$}"
36+
printf '%s %s %s %s\n' \
37+
"${holder_pid}" \
38+
"${RUNNER_NAME_SAFE}" \
39+
"${RUN_ID_SAFE}" \
40+
"${RUN_ATTEMPT_SAFE}" > "${HOLDER_PID_FILE}"
41+
chmod 666 "${HOLDER_PID_FILE}" || true
42+
2743
while [ ! -f "${RELEASE_FILE}" ]; do
2844
sleep 1
2945
done
30-
rm -f "${RELEASE_FILE}" "${HOLDER_PID_FILE}"
46+
47+
# 仅清理由自己写入的 holder 记录,避免并发切换时误删新 holder 信息
48+
current_holder_pid=""
49+
if [ -f "${HOLDER_PID_FILE}" ]; then
50+
read -r current_holder_pid _ < "${HOLDER_PID_FILE}" || true
51+
fi
52+
rm -f "${RELEASE_FILE}" || true
53+
if [ "${current_holder_pid}" = "${holder_pid}" ]; then
54+
rm -f "${HOLDER_PID_FILE}" || true
55+
fi
3156
) &
32-
echo $! > "${HOLDER_PID_FILE}"
3357

3458
# 主脚本退出,子进程继续持有 fd 200,锁保持到 post-job 执行
3559
exit 0

runner.sh

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -520,9 +520,13 @@ shell_generate_compose_file() {
520520
# 原因:两种板子 runner 都可能需要文件锁机制
521521
local extra_env_phytiumpi=()
522522
local extra_env_roc=()
523+
local extra_proxy_env=()
523524
# 只有设置了相应的资源 ID,才为该类型 runner 添加锁相关环境变量
524525
[[ -n "$res_phytiumpi" ]] && extra_env_phytiumpi=(" RUNNER_RESOURCE_ID: \"$res_phytiumpi\"" " RUNNER_SCRIPT: \"/home/runner/run.sh\"" " RUNNER_LOCK_DIR: \"${RUNNER_LOCK_DIR:-/tmp/github-runner-locks}\"")
525526
[[ -n "$res_roc" ]] && extra_env_roc=(" RUNNER_RESOURCE_ID: \"$res_roc\"" " RUNNER_SCRIPT: \"/home/runner/run.sh\"" " RUNNER_LOCK_DIR: \"${RUNNER_LOCK_DIR:-/tmp/github-runner-locks}\"")
527+
[[ -n "${HTTP_PROXY:-}" ]] && extra_proxy_env+=(" HTTP_PROXY: \"${HTTP_PROXY}\"")
528+
[[ -n "${HTTPS_PROXY:-}" ]] && extra_proxy_env+=(" HTTPS_PROXY: \"${HTTPS_PROXY}\"")
529+
[[ -n "${NO_PROXY:-}" ]] && extra_proxy_env+=(" NO_PROXY: \"${NO_PROXY}\"")
526530

527531
# ════════════════════════════════════════════════════════════════
528532
# 第四步:为两种板子 runner 类型准备卷挂载配置
@@ -554,9 +558,7 @@ shell_generate_compose_file() {
554558
" RUNNER_REMOVE_ON_STOP: \"false\"" \
555559
" DISABLE_AUTO_UPDATE: \"${DISABLE_AUTO_UPDATE}\"" \
556560
" RUNNER_WORKDIR: \"${RUNNER_WORKDIR}\"" \
557-
" HTTP_PROXY: \"http://127.0.0.1:7890\"" \
558-
" HTTPS_PROXY: \"http://127.0.0.1:7890\"" \
559-
" NO_PROXY: localhost,127.0.0.1,.internal" \
561+
"${extra_proxy_env[@]}" \
560562
" network_mode: host" \
561563
" privileged: true" \
562564
"" \
@@ -645,6 +647,7 @@ shell_generate_compose_file() {
645647
" volumes:" \
646648
" - /home/$(whoami)/test/phytiumpi:/home/runner/tftp" \
647649
"$extra_vol_phytiumpi" \
650+
" - ./runner-wrapper:/home/runner/runner-wrapper:ro" \
648651
" - ${RUNNER_NAME_PREFIX}runner-phytiumpi-data:/home/runner" \
649652
" - ${RUNNER_NAME_PREFIX}runner-phytiumpi-udev-rules:/etc/udev/rules.d" \
650653
"" >> "${COMPOSE_FILE}"
@@ -695,6 +698,7 @@ shell_generate_compose_file() {
695698
" BOARD_COMM_UART_BAUD: \"1500000\"" \
696699
"${extra_env_roc[@]}" \
697700
" volumes:" \
701+
" - ./runner-wrapper:/home/runner/runner-wrapper:ro" \
698702
"$extra_vol_roc" \
699703
" - ${RUNNER_NAME_PREFIX}runner-roc-rk3568-pc-data:/home/runner" \
700704
" - ${RUNNER_NAME_PREFIX}runner-roc-rk3568-pc-udev-rules:/etc/udev/rules.d" \
@@ -1220,7 +1224,7 @@ if [[ "${BASH_SOURCE[0]}" == "${0}" ]]; then
12201224
shell_info "Successfully built ${RUNNER_CUSTOM_IMAGE} image"
12211225

12221226
# Update hash file
1223-
local new_hash=""
1227+
new_hash=""
12241228
if command -v sha256sum >/dev/null 2>&1; then
12251229
new_hash=$(sha256sum Dockerfile | awk '{print $1}')
12261230
elif command -v shasum >/dev/null 2>&1; then

0 commit comments

Comments
 (0)