This repository contains the source code for ViLegalLM, a suite of language models pre-trained on Vietnamese legal text. Models and datasets are hosted on Hugging Face.
Note: These models are intended for research purposes only and should not be used in production legal applications without expert validation.
| Model | Architecture | Params | Context | HuggingFace |
|---|---|---|---|---|
| ViLegalBERT | Encoder-only (RoBERTa) | 135M | 256 tokens | ntphuc149/ViLegalBERT |
| ViLegalQwen2.5-1.5B-Base | Decoder-only (Qwen2) | 1.54B | 2,048 tokens | ntphuc149/ViLegalQwen2.5-1.5B-Base |
| ViLegalQwen3-1.7B-Base | Decoder-only (Qwen3) | 1.72B | 4,096 tokens | ntphuc149/ViLegalQwen3-1.7B-Base |
All models are pre-trained on a 16GB Vietnamese legal corpus (ntphuc149/ViLegalText).
ViLegalBERT
Usable out-of-the-box for feature extraction and semantic similarity. Fine-tuning required for downstream tasks (NLI, classification, extractive QA).
Important: Input text must be word-segmented before feeding into the model. Use PyVi, underthesea, or VNCoreNLP. Words in a multi-syllable token should be joined by underscores (e.g.,
nghiên_cứu_viên). See PhoBERT for details.
import torch
from transformers import AutoModel, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("vinai/phobert-base-v2")
model = AutoModel.from_pretrained("ntphuc149/ViLegalBERT")
# Input text MUST be word-segmented (multi-syllable words joined by underscores)
sentence = "luật_sư tư_vấn pháp_luật dân_sự ."
input_ids = torch.tensor([tokenizer.encode(sentence)])
with torch.no_grad():
features = model(input_ids)ViLegalQwen2.5-1.5B-Base and ViLegalQwen3-1.7B-Base
These are base models (CLM). Fine-tuning or instruction tuning (QLoRA) is required before using for downstream tasks.
from transformers import AutoModelForCausalLM, AutoTokenizer
# ViLegalQwen2.5-1.5B-Base
tokenizer = AutoTokenizer.from_pretrained("ntphuc149/ViLegalQwen2.5-1.5B-Base")
model = AutoModelForCausalLM.from_pretrained("ntphuc149/ViLegalQwen2.5-1.5B-Base")
# ViLegalQwen3-1.7B-Base
tokenizer = AutoTokenizer.from_pretrained("ntphuc149/ViLegalQwen3-1.7B-Base")
model = AutoModelForCausalLM.from_pretrained("ntphuc149/ViLegalQwen3-1.7B-Base")Synthetic datasets generated via LLM generation and negative sampling.
| Dataset | Task | Train | Val | HuggingFace |
|---|---|---|---|---|
| ViLegalTF | True/False QA | 13,032 | 388 | ntphuc149/ViLegalTF |
| ViLegalMCQ | Multiple-Choice QA | 14,920 | 300 | ntphuc149/ViLegalMCQ |
| ViLegalNLI | Natural Language Inference | 7,660 | 150 | ntphuc149/ViLegalNLI |
All pre-training code is in Source codes/Pre-training ViLegalLM/:
ViLegalBERT/— MLM pre-training scripts and configs (PhoBERT-based)ViLegalQwen/— CLM pre-training scripts and configs (Qwen-based)
All fine-tuning code is provided as Jupyter Notebooks in Source codes/Fine-tuning ViLegalLM/.
Notation:
[FT]— Discriminative fine-tuning (encoder models)[IFT]— Instruction fine-tuning with QLoRA (decoder models)[FT-CV]— 5-fold cross-validation
ViLegalLM/
└── Source codes/
├── Pre-training ViLegalLM/ # Pre-training scripts and configs
│ ├── ViLegalBERT/ # MLM pre-training (PhoBERT-based)
│ └── ViLegalQwen/ # CLM pre-training (Qwen-based)
└── Fine-tuning ViLegalLM/ # Fine-tuning notebooks (.ipynb)
├── Information Retrieval
├── Question Answering # True/False, MCQ, Extractive, Abstractive
├── Natural Language Inference
└── Syllogism Reasoning
UPDATE SOON!