Token classification using Phobert Models for 🇻🇳Vietnamese
Get started in seconds with verified environments. Run script below for install all dependencies
bash ./install_dependencies.shThe input data's format of 🍜VPhoBertTagger follows VLSP-2016 format with four columns separated by a tab character, including of word, pos, chunk, and named entity. Each word which was segmented has been put on a separate line and there is an empty line after each sentence. For details, see sample data in 'datasets/samples' directory. The table below describes an example Vietnamese sentence in dataset.
| Word | POS | Chunk | NER |
|---|---|---|---|
| Dương | Np | B-NP | B-PER |
| là | V | B-VP | O |
| một | M | B-NP | O |
| chủ | N | B-NP | O |
| cửa hàng | N | B-NP | O |
| lâu | A | B-AP | O |
| năm | N | B-NP | O |
| ở | E | B-PP | O |
| Hà Nội | Np | B-NP | B-LOC |
| . | CH | O | O |
The dataset must put on directory with structure as below.
├── data_dir
| └── train.txt
| └── dev.txt
| └── test.txt
The commands below fine-tune PhoBert for Token-classification task. Models download automatically from the latest Hugging Face release
python main.py train --task vlsp2016 --run_test --data_dir ./datasets/vlsp2016 --model_name_or_path vinai/phobert-base --model_arch softmax --output_dir outputs --max_seq_length 256 --train_batch_size 32 --eval_batch_size 32 --learning_rate 3e-5 --epochs 20 --early_stop 2 --overwrite_dataor
bash ./train.shArguments:
- type (
str,*required): What is process type to be run. Must in [train,test,predict,demo].- task (
str,*optional): Training task selected in the list: [vlsp2016,vlsp2018_l1,vlsp2018_l2,vlsp2018_join]. Default:vlsp2016- data_dir (
Union[str, os.PathLike],*required): The input data dir. Should contain the .csv files (or other data files) for the task.- overwrite_data (
bool,*optional) : Whether not to overwirte splitted dataset. Default=False- load_weights (
Union[str, os.PathLike],*optional): Path of pretrained file.- model_name_or_path (
str,*required): Pre-trained model selected in the list: [vinai/phobert-base,vinai/phobert-large,...] Default=vinai/phobert-base- model_arch (
str,*required): Punctuation prediction model architecture selected in the list: [softmax,crf,lstm_crf].- output_dir (
Union[str, os.PathLike],*required): The output directory where the model predictions and checkpoints will be written.- max_seq_length (
int,*optional): The maximum total input sequence length after WordPiece tokenization. Sequences longer than this will be truncated, and sequences shorter than this will be padded. Default=190.- train_batch_size (
int,*optional): Total batch size for training. Default=32.- eval_batch_size (
int,*optional): Total batch size for eval. Default=32.- learning_rate (
float,*optional): The initial learning rate for Adam. Default=1e-4.- classifier_learning_rate (
float,*optional): The initial classifier learning rate for Adam. Default=5e-4.- epochs (
float,*optional): Total number of training epochs to perform. Default=100.0.- weight_decay (
float,*optional): Weight deay if we apply some. Default=0.01.- adam_epsilon (
float,*optional): Epsilon for Adam optimizer. Default=5e-8.- max_grad_norm (
float,*optional): Max gradient norm. Default=1.0.- early_stop (
float,*optional): Number of early stop step. Default=10.0.- no_cuda (
bool,*optional): Whether not to use CUDA when available. Default=False.- run_test (
bool,*optional): Whether not to run evaluate best model on test set after train. Default=False.- seed (
bool,*optional): Random seed for initialization. Default=42.- num_workers (
int,*optional): how many subprocesses to use for data loading. 0 means that the data will be loaded in the main process. Default=0.- save_step (
int,*optional): The number of steps in the model will be saved. Default=10000.- gradient_accumulation_steps (
int,*optional): Number of updates steps to accumulate before performing a backward/update pass. Default=1.
The command below start Tensorboard help you follow fine-tune process.
tensorboard --logdir runs --host 0.0.0.0 --port=6006All experiments were performed on an RTX 3090 with 24GB VRAM, and a CPU Xeon® E5-2678 v3 with 64GB RAM, both of which are available for rent on vast.ai. The pretrained-model used for comparison are available on HuggingFace.
Click to expand!
| Model | BIO-Metrics | NE-Metrics | Log | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Accuracy | Precision | Recall | F1-score | Accuracy (w/o 'O') |
Accuracy | Precision | Recall | F1-score | |||
| Bert-base-multilingual-cased [1] | Softmax | 0.9905 | 0.9239 | 0.8776 | 0.8984 | 0.9068 | 0.9905 | 0.8938 | 0.8941 | 0.8939 |
Maxtrix
Log |
| CRF | 0.9903 | 0.9241 | 0.8880 | 0.9048 | 0.9087 | 0.9903 | 0.8951 | 0.8945 | 0.8948 |
Maxtrix
Log |
|
| LSTM_CRF | 0.9905 | 0.9183 | 0.8898 | 0.9027 | 0.9178 | 0.9905 | 0.8879 | 0.8992 | 0.8935 |
Maxtrix
Log |
|
| PhoBert-base [2] | Softmax | 0.9950 | 0.9312 | 0.9404 | 0.9348 | 0.9570 | 0.9950 | 0.9434 | 0.9466 | 0.9450 |
Maxtrix
Log |
| CRF | 0.9949 | 0.9497 | 0.9248 | 0.9359 | 0.9525 | 0.9949 | 0.9516 | 0.9456 | 0.9486 |
Maxtrix
Log |
|
| LSTM_CRF | 0.9949 | 0.9535 | 0.9181 | 0.9349 | 0.9456 | 0.9949 | 0.9520 | 0.9396 | 0.9457 |
Maxtrix
Log |
|
| viBERT [3] | Softmax | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| CRF | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
| LSTM_CRF | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
Click to expand!
| Model | BIO-Metrics | NE-Metrics | Epoch | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Accuracy | Precision | Recall | F1-score | Accuracy (w/o 'O') |
Accuracy | Precision | Recall | F1-score | |||
| Bert-base-multilingual-cased [1] | Softmax | 0.9828 | 0.7421 | 0.7980 | 0.7671 | 0.8510 | 0.9828 | 0.7302 | 0.8339 | 0.7786 |
Maxtrix
Log |
| CRF | 0.9824 | 0.7716 | 0.7619 | 0.7601 | 0.8284 | 0.9824 | 0.7542 | 0.8127 | 0.7824 |
Maxtrix
Log |
|
| LSTM_CRF | 0.9829 | 0.7533 | 0.7750 | 0.7626 | 0.8296 | 0.9829 | 0.7612 | 0.8122 | 0.7859 |
Maxtrix
Log |
|
| PhoBert-base [2] | Softmax | 0.9896 | 0.7970 | 0.8404 | 0.8170 | 0.8892 | 0.9896 | 0.8421 | 0.8942 | 0.8674 |
Maxtrix
Log |
| CRF | 0.9903 | 0.8124 | 0.8428 | 0.8260 | 0.8834 | 0.9903 | 0.8695 | 0.8943 | 0.8817 |
Maxtrix
Log |
|
| LSTM+CRF | 0.9901 | 0.8240 | 0.8278 | 0.8241 | 0.8715 | 0.9901 | 0.8671 | 0.8773 | 0.8721 |
Maxtrix
Log |
|
| viBERT [3] | Softmax | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| CRF | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
| LSTM_CRF | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
Click to expand!
| Model | BIO-Metrics | NE-Metrics | Epoch | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Accuracy | Precision | Recall | F1-score | Accuracy (w/o 'O') |
Accuracy | Precision | Recall | F1-score | |||
| Bert-base-multilingual-cased [1] | Softmax | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| CRF | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
| LSTM_CRF | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
| PhoBert-base [2] | Softmax | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| CRF | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
| LSTM+CRF | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
| viBERT [3] | Softmax | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| CRF | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
| LSTM_CRF | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
Click to expand!
| Model | BIO-Metrics | NE-Metrics | Epoch | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Accuracy | Precision | Recall | F1-score | Accuracy (w/o 'O') |
Accuracy | Precision | Recall | F1-score | |||
| Bert-base-multilingual-cased [1] | Softmax | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| CRF | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
| LSTM_CRF | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
| PhoBert-base [2] | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
| CRF | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
| LSTM+CRF | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
| viBERT [3] | Softmax | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| CRF | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
| LSTM_CRF | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
[1] Kenton, J. D. M. W. C., & Toutanova, L. K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of NAACL-HLT (pp. 4171-4186).
[2] Nguyen, D. Q., & Nguyen, A. T. (2020, November). PhoBERT: Pre-trained language models for Vietnamese. In Findings of the Association for Computational Linguistics: EMNLP 2020 (pp. 1037-1042).
[3] The, V. B., Thi, O. T., & Le-Hong, P. (2020). Improving sequence tagging for vietnamese text using transformer-based neural models. arXiv preprint arXiv:2006.15994.
The command below load your fine-tuned model and inference in your text input.
python main.py predict --model_path outputs/best_model.ptArguments:
- type (
str,*required): What is process type to be run. Must in [train,test,predict,demo].- model_path (
Union[str, os.PathLike],*optional): Path of pretrained file.- no_cuda (
bool,*optional): Whether not to use CUDA when available. Default=False.
The command below load your fine-tuned model and start demo page.
python main.py demo --model_path outputs/best_model.ptArguments:
- type (
str,*required): What is process type to be run. Must in [train,test,predict,demo].- model_path (
Union[str, os.PathLike],*optional): Path of pretrained file.- no_cuda (
bool,*optional): Whether not to use CUDA when available. Default=False.
Pretrained model Phobert by VinAI Research and Pytorch implementation by Hugging Face.
