TruthfulQA LLM Evaluation: Base vs Instruct Models

Evaluation of truthfulness and informativeness in LLMs using few-shot prompting and judge models.

Repository Structure

src/ → generation and evaluation scripts
data/ → generated model outputs in JSON format
results/ → figures, screenshots, and evaluation summaries
report/ → academic report with methodology and conclusions
requirements.txt → dependencies

Overview

Large Language Models can generate fluent and convincing answers, but they are also prone to producing false statements, common misconceptions, and misleading information.

This project evaluates the behaviour of two language models:

Llama-3.1-8B Base
Llama-3.1-8B Instruct

The evaluation is based on TruthfulQA-Multi, using both English and Basque data. The main goal is to analyze how instruction tuning affects the quality of responses, with special attention to:

truthfulness
informativeness
multilingual behaviour

The project combines prompting, automatic evaluation, and manual analysis to obtain a broader view of model performance.

Methodology

The system is divided into three stages.

Z1 — Few-shot generation

In the first stage, both models are prompted with the same few-shot setup using examples from the training split.

Llama-3.1-8B Base is prompted with a plain Q/A format
Llama-3.1-8B Instruct uses the tokenizer chat template
The same methodology is applied to both Spanish and Basque data

The outputs for the full validation set are stored in JSON files for later evaluation.

Z2 — Automatic evaluation

In the second stage, generated answers are evaluated automatically with independent judge models:

Truthfulness judge: HiTZ/gemma-2-9b-it-multi-truth-judge
Informativeness judge: HiTZ/gemma-2-9b-it-multi-info-judge

These judges classify each answer with a binary decision:

truthful / not truthful
informative / not informative

This stage allows large-scale comparison between Base and Instruct outputs.

Z3 — Manual analysis

The final stage focuses on qualitative inspection.

The worst-scoring examples are manually reviewed in order to:

compare generated answers against the reference answers
identify where the model fails
identify where the automatic evaluator fails
analyze multilingual limitations, especially in Basque

This stage is important because automatic metrics alone do not fully capture answer quality.

Dataset

This project uses TruthfulQA-Multi, a multilingual adaptation of TruthfulQA.

The dataset includes:

misleading or misconception-triggering questions
one best reference answer
alternative incorrect answers
category and question type metadata
training examples for few-shot prompting
a validation split used in all experiments

The experiments in this repository focus on:

Spanish
Basque

Models

Generation models

meta-llama/Llama-3.1-8B
meta-llama/Llama-3.1-8B-Instruct

Judge models

HiTZ/gemma-2-9b-it-multi-truth-judge
HiTZ/gemma-2-9b-it-multi-info-judge

Results Visualization

Quantitative Comparison

TRUTHFULNESS:

INFORMATIVENESS:

Example Outputs

Key Findings

The experiments were designed to study the effect of instruction tuning on response quality. According to the project report, the instruct-aligned model produces responses that are generally more coherent, useful, and reliable than the base model, while multilingual evaluation reveals additional challenges in Basque. :contentReference[oaicite:4]{index=4}

At a system level, the project shows three important points:

instruction tuning improves response quality over the base model
automatic judges are useful for large-scale comparison
manual analysis is necessary to detect evaluator errors and multilingual edge cases

Example Research Questions

This project explores questions such as:

Does instruction tuning improve truthfulness?
Does it also improve informativeness?
Do judge models align with human judgement?
Does performance degrade in multilingual settings such as Basque?

Files

Main scripts

src/z1_generation.py → answer generation with Base and Instruct
src/z2_truth.py → automatic truthfulness evaluation
src/z2_informativeness.py → automatic informativeness evaluation

Data files

data/z1_outputs_full.json → Spanish outputs
data/z1_outputs_full_eus.json → Basque outputs

How to Run

Install dependencies:

pip install -r requirements.txt

Run generation:

python src/z1_generation.py

Run truthfulness evaluation:

python src/z2_truth.py

Run informativeness evaluation:

python src/z2_informativeness.py

Notes

The original code was developed in a research/academic context
Some scripts were initially created from Colab notebooks and then adapted for repository use
Paths may need to be adjusted depending on the local environment
Large output files and figures are included only when useful for reproducibility and inspection

Report

The full academic report is available in:

report/

Future Improvements

Possible next steps for this project include:

cleaner modularization of the evaluation pipeline
unified configuration for paths and models
better visualization of metrics
extension to more languages and more judge models
deeper agreement analysis between automatic and manual evaluation

Author

Aitor Milicua
Artificial Intelligence student focused on machine learning, NLP, and applied AI systems.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TruthfulQA LLM Evaluation: Base vs Instruct Models

Repository Structure

Overview

Methodology

Z1 — Few-shot generation

Z2 — Automatic evaluation

Z3 — Manual analysis

Dataset

Models

Generation models

Judge models

Results Visualization

Quantitative Comparison

Example Outputs

Key Findings

Example Research Questions

Files

Main scripts

Data files

How to Run

Notes

Report

Future Improvements

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
data		data
report		report
results		results
src		src
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

TruthfulQA LLM Evaluation: Base vs Instruct Models

Repository Structure

Overview

Methodology

Z1 — Few-shot generation

Z2 — Automatic evaluation

Z3 — Manual analysis

Dataset

Models

Generation models

Judge models

Results Visualization

Quantitative Comparison

Example Outputs

Key Findings

Example Research Questions

Files

Main scripts

Data files

How to Run

Notes

Report

Future Improvements

Author

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages