Cannot reproduce the evaluation score of HellaSwag, WiC

I evaluated `polyglot-ko-1.3b` model with HellaSwag and WiC from KoBEST, and I got different results with paper and model card from huggingface.

### Environment

* **Few-shot examples**: 5
* **Model**: EleutherAI/polyglot-ko-1.3b
* **Metrics**: F1(Macro) Score
* **Computing**: Colab / GPU(T4) Instance

I'm going to share a notebook that I tested with.
<https://colab.research.google.com/drive/1lyQQisuB5JzuGk72haSdxXfXP20q4YGr?usp=sharing>

### 1. WiC

The paper says the score **0.486**, But I got only **0.4541**.

* The paper

| **params** | **0-shot** | **5-shot** | **10-shot** | **50-shot** |
| --- | --- | --- | --- | --- |
| 1.3B | 0.489 | **0.486** | 0.506 | 0.487 |
   

* In my test

`hf-causal-experimental (pretrained=EleutherAI/polyglot-ko-1.3b), limit: None, provide_description: False, num_fewshot: 5, batch_size: 8`

|   Task   |Version| Metric |Value |   |Stderr|
|----------|------:|--------|-----:|---|-----:|
|kobest_wic|      0|acc     |0.4952|±  |0.0141|
|          |       |macro_f1|0.4541|±  |0.0138|


### 2. HellaSwag

The paper says the score **0.526**, But I got only **0.3984**.

* In the paper

| **params** | **0-shot** | **5-shot** | **10-shot** | **50-shot** |
| --- | --- | --- | --- | --- |
| 1.3B | 0.525 | **0.526** | 0.528 | 0.543 |

* In my test

`hf-causal-experimental (pretrained=EleutherAI/polyglot-ko-1.3b), limit: None, provide_description: False, num_fewshot: 5, batch_size: 8`

|      Task      |Version| Metric |Value |   |Stderr|
|----------------|------:|--------|-----:|---|-----:|
|kobest_hellaswag|      0|acc     |0.4020|±  |0.0219|
|                |       |acc_norm|0.5280|±  |0.0223|
|                |       |macro_f1|0.3984|±  |0.0218|

And I found out a Wandb Report [Polyglot-Ko: Open-Source Korean Autoregressive Language Model
](https://wandb.ai/eleutherai/polyglot-ko/reports/Polyglot-Ko-Open-Source-Korean-Autoregressive-Language-Model--VmlldzoyOTYyODcz#4.2.-hellaswag-(f1-score)), And there's a HellaSwag score that is same as my test, `0.3984`.

| params | n=0 | n=5 | n=10 | n=50 |
| --- | --- | --- | --- | --- |
| 1.3B | 0.4013 | **0.3984** | 0.417 | 0.4416 |

### In case of other models

There are also differences in `kakaobrain/kogpt` and `skt/ko-gpt-trinity-1.2B-v0.5`.

* **kakaobrain/kogpt**
**Note that I tested kakaobrain/kogpt with Int 8 quantized model.**

|  | In the paper (FP16) | In my test (Int8) | In the Wandb Report |
| --- | --- | --- | --- |
| CoPA | 0.7287 | 0.7277 (↓0.01%) | 0.7287 |
| HellaSwag | 0.5833 | **0.4560 (↓21.82%)** | 0.456 |
| BoolQ | 0.5981 | 0.6015 (↑0.56%) | - |
| WiC | 0.4775 | **0.3706 (↓22.38%)** | - |

* **skt/ko-gpt-trinity-1.2B-v0.5**

|  | In the paper | In my test | In the Wandb Report |
| --- | --- | --- | --- |
| WiC | 0.4313 | **0.3953** | - |
| HellaSwag |  0.5272 | **0.400** | 0.4 |

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot reproduce the evaluation score of HellaSwag, WiC #37

Environment

1. WiC

2. HellaSwag

In case of other models

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Task	Version	Metric	Value		Stderr
kobest_hellaswag	0	acc	0.4020	±	0.0219
		acc_norm	0.5280	±	0.0223
		macro_f1	0.3984	±	0.0218

	In the paper (FP16)	In my test (Int8)	In the Wandb Report
CoPA	0.7287	0.7277 (↓0.01%)	0.7287
HellaSwag	0.5833	0.4560 (↓21.82%)	0.456
BoolQ	0.5981	0.6015 (↑0.56%)	-
WiC	0.4775	0.3706 (↓22.38%)	-

Cannot reproduce the evaluation score of HellaSwag, WiC #37

Description

Environment

1. WiC

2. HellaSwag

In case of other models

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions