ViLegalLM: Language Models for Vietnamese Legal Text

This repository contains the source code for ViLegalLM, a suite of language models pre-trained on Vietnamese legal text. Models and datasets are hosted on Hugging Face.

Note: These models are intended for research purposes only and should not be used in production legal applications without expert validation.

Models

Model	Architecture	Params	Context	HuggingFace
ViLegalBERT	Encoder-only (RoBERTa)	135M	256 tokens	ntphuc149/ViLegalBERT
ViLegalQwen2.5-1.5B-Base	Decoder-only (Qwen2)	1.54B	2,048 tokens	ntphuc149/ViLegalQwen2.5-1.5B-Base
ViLegalQwen3-1.7B-Base	Decoder-only (Qwen3)	1.72B	4,096 tokens	ntphuc149/ViLegalQwen3-1.7B-Base

All models are pre-trained on a 16GB Vietnamese legal corpus (ntphuc149/ViLegalText).

Loading Models

ViLegalBERT

Usable out-of-the-box for feature extraction and semantic similarity. Fine-tuning required for downstream tasks (NLI, classification, extractive QA).

Important: Input text must be word-segmented before feeding into the model. Use PyVi, underthesea, or VNCoreNLP. Words in a multi-syllable token should be joined by underscores (e.g., nghiên_cứu_viên). See PhoBERT for details.

import torch
from transformers import AutoModel, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("vinai/phobert-base-v2")
model = AutoModel.from_pretrained("ntphuc149/ViLegalBERT")

# Input text MUST be word-segmented (multi-syllable words joined by underscores)
sentence = "luật_sư tư_vấn pháp_luật dân_sự ."

input_ids = torch.tensor([tokenizer.encode(sentence)])

with torch.no_grad():
    features = model(input_ids)

ViLegalQwen2.5-1.5B-Base and ViLegalQwen3-1.7B-Base

These are base models (CLM). Fine-tuning or instruction tuning (QLoRA) is required before using for downstream tasks.

from transformers import AutoModelForCausalLM, AutoTokenizer

# ViLegalQwen2.5-1.5B-Base
tokenizer = AutoTokenizer.from_pretrained("ntphuc149/ViLegalQwen2.5-1.5B-Base")
model = AutoModelForCausalLM.from_pretrained("ntphuc149/ViLegalQwen2.5-1.5B-Base")

# ViLegalQwen3-1.7B-Base
tokenizer = AutoTokenizer.from_pretrained("ntphuc149/ViLegalQwen3-1.7B-Base")
model = AutoModelForCausalLM.from_pretrained("ntphuc149/ViLegalQwen3-1.7B-Base")

Datasets

Synthetic datasets generated via LLM generation and negative sampling.

Dataset	Task	Train	Val	HuggingFace
ViLegalTF	True/False QA	13,032	388	ntphuc149/ViLegalTF
ViLegalMCQ	Multiple-Choice QA	14,920	300	ntphuc149/ViLegalMCQ
ViLegalNLI	Natural Language Inference	7,660	150	ntphuc149/ViLegalNLI

Quick Start

Pre-training

All pre-training code is in Source codes/Pre-training ViLegalLM/:

ViLegalBERT/ — MLM pre-training scripts and configs (PhoBERT-based)
ViLegalQwen/ — CLM pre-training scripts and configs (Qwen-based)

Fine-tuning

All fine-tuning code is provided as Jupyter Notebooks in Source codes/Fine-tuning ViLegalLM/.

Notation:

[FT] — Discriminative fine-tuning (encoder models)
[IFT] — Instruction fine-tuning with QLoRA (decoder models)
[FT-CV] — 5-fold cross-validation

Repository Structure

ViLegalLM/
└── Source codes/
    ├── Pre-training ViLegalLM/     # Pre-training scripts and configs
    │   ├── ViLegalBERT/            # MLM pre-training (PhoBERT-based)
    │   └── ViLegalQwen/            # CLM pre-training (Qwen-based)
    └── Fine-tuning ViLegalLM/      # Fine-tuning notebooks (.ipynb)
        ├── Information Retrieval
        ├── Question Answering      # True/False, MCQ, Extractive, Abstractive
        ├── Natural Language Inference
        └── Syllogism Reasoning

Citation

UPDATE SOON!

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ViLegalLM: Language Models for Vietnamese Legal Text

Models

Loading Models

Datasets

Quick Start

Pre-training

Fine-tuning

Repository Structure

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

ViLegalLM: Language Models for Vietnamese Legal Text

Models

Loading Models

Datasets

Quick Start

Pre-training

Fine-tuning

Repository Structure

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages