I evaluated polyglot-ko-1.3b model with HellaSwag and WiC from KoBEST, and I got different results with paper and model card from huggingface.
Environment
- Few-shot examples: 5
- Model: EleutherAI/polyglot-ko-1.3b
- Metrics: F1(Macro) Score
- Computing: Colab / GPU(T4) Instance
I'm going to share a notebook that I tested with.
https://colab.research.google.com/drive/1lyQQisuB5JzuGk72haSdxXfXP20q4YGr?usp=sharing
1. WiC
The paper says the score 0.486, But I got only 0.4541.
| params |
0-shot |
5-shot |
10-shot |
50-shot |
| 1.3B |
0.489 |
0.486 |
0.506 |
0.487 |
hf-causal-experimental (pretrained=EleutherAI/polyglot-ko-1.3b), limit: None, provide_description: False, num_fewshot: 5, batch_size: 8
| Task |
Version |
Metric |
Value |
|
Stderr |
| kobest_wic |
0 |
acc |
0.4952 |
± |
0.0141 |
|
|
macro_f1 |
0.4541 |
± |
0.0138 |
2. HellaSwag
The paper says the score 0.526, But I got only 0.3984.
| params |
0-shot |
5-shot |
10-shot |
50-shot |
| 1.3B |
0.525 |
0.526 |
0.528 |
0.543 |
hf-causal-experimental (pretrained=EleutherAI/polyglot-ko-1.3b), limit: None, provide_description: False, num_fewshot: 5, batch_size: 8
| Task |
Version |
Metric |
Value |
|
Stderr |
| kobest_hellaswag |
0 |
acc |
0.4020 |
± |
0.0219 |
|
|
acc_norm |
0.5280 |
± |
0.0223 |
|
|
macro_f1 |
0.3984 |
± |
0.0218 |
And I found out a Wandb Report Polyglot-Ko: Open-Source Korean Autoregressive Language Model
, And there's a HellaSwag score that is same as my test, 0.3984.
| params |
n=0 |
n=5 |
n=10 |
n=50 |
| 1.3B |
0.4013 |
0.3984 |
0.417 |
0.4416 |
In case of other models
There are also differences in kakaobrain/kogpt and skt/ko-gpt-trinity-1.2B-v0.5.
- kakaobrain/kogpt
Note that I tested kakaobrain/kogpt with Int 8 quantized model.
|
In the paper (FP16) |
In my test (Int8) |
In the Wandb Report |
| CoPA |
0.7287 |
0.7277 (↓0.01%) |
0.7287 |
| HellaSwag |
0.5833 |
0.4560 (↓21.82%) |
0.456 |
| BoolQ |
0.5981 |
0.6015 (↑0.56%) |
- |
| WiC |
0.4775 |
0.3706 (↓22.38%) |
- |
- skt/ko-gpt-trinity-1.2B-v0.5
|
In the paper |
In my test |
In the Wandb Report |
| WiC |
0.4313 |
0.3953 |
- |
| HellaSwag |
0.5272 |
0.400 |
0.4 |
I evaluated
polyglot-ko-1.3bmodel with HellaSwag and WiC from KoBEST, and I got different results with paper and model card from huggingface.Environment
I'm going to share a notebook that I tested with.
https://colab.research.google.com/drive/1lyQQisuB5JzuGk72haSdxXfXP20q4YGr?usp=sharing
1. WiC
The paper says the score 0.486, But I got only 0.4541.
hf-causal-experimental (pretrained=EleutherAI/polyglot-ko-1.3b), limit: None, provide_description: False, num_fewshot: 5, batch_size: 82. HellaSwag
The paper says the score 0.526, But I got only 0.3984.
hf-causal-experimental (pretrained=EleutherAI/polyglot-ko-1.3b), limit: None, provide_description: False, num_fewshot: 5, batch_size: 8And I found out a Wandb Report Polyglot-Ko: Open-Source Korean Autoregressive Language Model
, And there's a HellaSwag score that is same as my test,
0.3984.In case of other models
There are also differences in
kakaobrain/kogptandskt/ko-gpt-trinity-1.2B-v0.5.Note that I tested kakaobrain/kogpt with Int 8 quantized model.