Skip to content

Commit a291524

Browse files
authored
fix: correct org and dataset paths (#18)
1 parent adebf5b commit a291524

4 files changed

Lines changed: 22 additions & 5 deletions

File tree

README.md

Lines changed: 18 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -62,6 +62,19 @@ We will go through the example based on `Qwen/Qwen2.5-14B-Instruct-1M`:
6262

6363
### Rejection Sampling
6464

65+
This step aims to generate SFT data for later use.
66+
Note that we already have pre-generated datasets:
67+
68+
* [`Qwen2.5-14B-Instruct-1M`](https://huggingface.co/datasets/purpcode/ctxdistill-verified-Qwen2.5-14B-Instruct-1M-57k) via best of 8
69+
* [`Qwen2.5-32B-Instruct`](https://huggingface.co/datasets/purpcode/ctxdistill-verified-Qwen2.5-32B-Instruct-55k) via best of 4
70+
71+
To generate data from scratch or for other models, follow the steps below:
72+
73+
<details><summary><b>Rejection Sampling from Scratch</b> <i>:: click to expand ::</i></summary>
74+
<div>
75+
76+
The instructions are exemplified for `Qwen/Qwen2.5-14B-Instruct-1M`. Please change the model names and the later SFT script accordingly for other models.
77+
6578
```bash
6679
# --- TMUX SESSION "sgl" ---
6780
conda create -n sgl python=3.12 -y
@@ -102,6 +115,10 @@ python datagen/ctxdistill/post.py --generation-path Qwen2.5-14B-Instruct-1M.dist
102115
# ----------------------------
103116
```
104117

118+
</div>
119+
</details>
120+
121+
105122
### Running SFT
106123

107124
```bash
@@ -116,7 +133,7 @@ pip3 install --no-build-isolation -e '.[flash-attn,deepspeed]'
116133

117134
cd purpcode # come back to the root directory
118135
# double check sft/ctxdistill_qwen14b.yaml to make sure the paths are aligned well
119-
axolotl train sft/ctxdistill_qwen14b.yaml --deepspeed deepspeed_configs/zero3.json
136+
axolotl train sft/ctxdistill_qwen14b.yaml --deepspeed deepspeed_configs/zero3.json # default to pre-generated datasets
120137
# --> outputs/purpcode-14b-ctxdistill
121138
```
122139

eval/compile_xscode/cwe2ovrf.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -224,7 +224,7 @@ def load_codeguru_vulnerabilities(file_path):
224224
return vulnerabilities
225225

226226

227-
def create_codeguru_information(dataset_path: str = "purpcorn/codeguru-rules"):
227+
def create_codeguru_information(dataset_path: str = "purpcode/codeguru-rules"):
228228
collection = {}
229229
ds = load_dataset(dataset_path, split="scraped")
230230

@@ -574,7 +574,7 @@ def cwe2ovrf_main(
574574

575575
collection = create_cwe_information()
576576
if vuln_rules_type == "codeguru":
577-
collection = create_codeguru_information("purpcorn/codeguru-rules")
577+
collection = create_codeguru_information("purpcode/codeguru-rules")
578578

579579
if os.path.exists(init_filepath):
580580
cprint(f"Found existing init messages at {init_filepath}", "yellow")

sft/ctxdistill_qwen14b.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -61,7 +61,7 @@ bf16: true
6161

6262
# dataset
6363
datasets:
64-
- path: purpcode/ctxdistill-verified-Qwen2.5-14B-Instruct-1M.jsonl
64+
- path: purpcode/ctxdistill-verified-Qwen2.5-14B-Instruct-1M-57k
6565
type: chat_template
6666
field_messages: messages
6767
message_field_training: train

sft/ctxdistill_qwen32b.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -61,7 +61,7 @@ bf16: true
6161

6262
# dataset
6363
datasets:
64-
- path: purpcode/ctxdistill-verified-Qwen-2.5-32B-Instruct.jsonl
64+
- path: purpcode/ctxdistill-verified-Qwen2.5-32B-Instruct-55k
6565
type: chat_template
6666
field_messages: messages
6767
message_field_training: train

0 commit comments

Comments
 (0)