Skip to content

ntphuc149/ViLegalLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 

Repository files navigation

ViLegalLM: Language Models for Vietnamese Legal Text

This repository contains the source code for ViLegalLM, a suite of language models pre-trained on Vietnamese legal text. Models and datasets are hosted on Hugging Face.

Note: These models are intended for research purposes only and should not be used in production legal applications without expert validation.


Models

Model Architecture Params Context HuggingFace
ViLegalBERT Encoder-only (RoBERTa) 135M 256 tokens ntphuc149/ViLegalBERT
ViLegalQwen2.5-1.5B-Base Decoder-only (Qwen2) 1.54B 2,048 tokens ntphuc149/ViLegalQwen2.5-1.5B-Base
ViLegalQwen3-1.7B-Base Decoder-only (Qwen3) 1.72B 4,096 tokens ntphuc149/ViLegalQwen3-1.7B-Base

All models are pre-trained on a 16GB Vietnamese legal corpus (ntphuc149/ViLegalText).

Loading Models

ViLegalBERT

Usable out-of-the-box for feature extraction and semantic similarity. Fine-tuning required for downstream tasks (NLI, classification, extractive QA).

Important: Input text must be word-segmented before feeding into the model. Use PyVi, underthesea, or VNCoreNLP. Words in a multi-syllable token should be joined by underscores (e.g., nghiên_cứu_viên). See PhoBERT for details.

import torch
from transformers import AutoModel, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("vinai/phobert-base-v2")
model = AutoModel.from_pretrained("ntphuc149/ViLegalBERT")

# Input text MUST be word-segmented (multi-syllable words joined by underscores)
sentence = "luật_sư tư_vấn pháp_luật dân_sự ."

input_ids = torch.tensor([tokenizer.encode(sentence)])

with torch.no_grad():
    features = model(input_ids)

ViLegalQwen2.5-1.5B-Base and ViLegalQwen3-1.7B-Base

These are base models (CLM). Fine-tuning or instruction tuning (QLoRA) is required before using for downstream tasks.

from transformers import AutoModelForCausalLM, AutoTokenizer

# ViLegalQwen2.5-1.5B-Base
tokenizer = AutoTokenizer.from_pretrained("ntphuc149/ViLegalQwen2.5-1.5B-Base")
model = AutoModelForCausalLM.from_pretrained("ntphuc149/ViLegalQwen2.5-1.5B-Base")

# ViLegalQwen3-1.7B-Base
tokenizer = AutoTokenizer.from_pretrained("ntphuc149/ViLegalQwen3-1.7B-Base")
model = AutoModelForCausalLM.from_pretrained("ntphuc149/ViLegalQwen3-1.7B-Base")

Datasets

Synthetic datasets generated via LLM generation and negative sampling.

Dataset Task Train Val HuggingFace
ViLegalTF True/False QA 13,032 388 ntphuc149/ViLegalTF
ViLegalMCQ Multiple-Choice QA 14,920 300 ntphuc149/ViLegalMCQ
ViLegalNLI Natural Language Inference 7,660 150 ntphuc149/ViLegalNLI

Quick Start

Pre-training

All pre-training code is in Source codes/Pre-training ViLegalLM/:

  • ViLegalBERT/ — MLM pre-training scripts and configs (PhoBERT-based)
  • ViLegalQwen/ — CLM pre-training scripts and configs (Qwen-based)

Fine-tuning

All fine-tuning code is provided as Jupyter Notebooks in Source codes/Fine-tuning ViLegalLM/.

Notation:

  • [FT] — Discriminative fine-tuning (encoder models)
  • [IFT] — Instruction fine-tuning with QLoRA (decoder models)
  • [FT-CV] — 5-fold cross-validation

Repository Structure

ViLegalLM/
└── Source codes/
    ├── Pre-training ViLegalLM/     # Pre-training scripts and configs
    │   ├── ViLegalBERT/            # MLM pre-training (PhoBERT-based)
    │   └── ViLegalQwen/            # CLM pre-training (Qwen-based)
    └── Fine-tuning ViLegalLM/      # Fine-tuning notebooks (.ipynb)
        ├── Information Retrieval
        ├── Question Answering      # True/False, MCQ, Extractive, Abstractive
        ├── Natural Language Inference
        └── Syllogism Reasoning

Citation

UPDATE SOON!