Skip to content

Latest commit

 

History

History
144 lines (90 loc) · 11 KB

File metadata and controls

144 lines (90 loc) · 11 KB

Automatically Parse any Document

Train Detectron2 based Custom Models for Document Layout Parsing

Nasheed Yasin

Medium Post | July 1st, 2021 | 5 min read

A depiction of various layouts one encounters in everyday and sometimes not so everyday life, with the code to parse such layouts using the layoutparser library, superimposed on top of it.Image by Layout Parser on GitHub

Documents have been ubiquitous ever since humans first developed the written script. Magazines, agreements, historical archives, pamphlets at the local store, tax forms, property deeds, college application forms and so on. Processing these documents has been a rather manual task thus far, with automation only beginning to take over in the last few decades. This automation journey has largely been impeded by a crucial pitfall- Computers can’t understand layouts as intuitively as humans.

With the advent of modern Computer Vision all this changed. We now have models that can accurately locate, represent and understand components of a document’s layout. But these models are rather abstract to the average automation enthusiast, usually requiring comprehensive knowledge in Python to even contemplate an attempt at understanding the documentation let alone use it in a project.

The Layout Parser logo.Credits: Layout Parser

To that extent, Layout Parser, as explained in their really cool paper, alleviates this complexity with a clean API that allows and enables complete end to end layout detection, parsing and understanding in just a few (and I mean really few, like 5) lines of code. They have a bunch of models that can be used straight out of the box. All in all a super cool tool.

Now, how can one use this tool to understand and work on custom layouts beyond the capabilities of the pre-trained models?

Images of two scientific papers with their layout components marked. Image by author

The obvious thought would be to fine tune an existing layout model on your custom layouts.

And you are right, it is the best way to go, especially given that not all of us have access to the hardware firepower required to train such models from scratch.

While the finetuning process is a tad more technically involved than just using a pre-trained model, a handy repository created by the authors of Layout Parser, helps alleviate some of these issues by largely handling the untenable bits of the training/ finetuning activity.

"A rather unique arrange of gears shaped pretty similar to that of a human brain, signifying intelligent design. GIF by Gareth Fowler on Tumblr

In the following sections we go through a comprehensive tutorial on using this repository to train your own custom models.

Prerequisites

A checklist animation, where the boxes are being checked sequenctiallyImage by Navanshu Agarwal

  1. Python ≥ 3.6
  2. Detectron2 forked or cloned from the master branch.*
  3. Latest version of Layout Parser and it’s dependencies.
  4. Pytorch (Linux: 1.6+ or Windows: 1.6)**
  5. CUDA toolkit: 10+ (As compatible with Pytorch)***
  6. Dataset: Annotated in COCO format.

Caveats

*Detectron2 is not easily installed on Windows systems, please refer to this fantastic post by ivanapp for guidance on the Windows based installation process.

**Though 1.8 is recommended in the official docs, Windows users should stick to 1.6.

***CUDA is not mandatory and one could theoretically train on CPU as well. Although, such a training attempt would be painfully slow.

Step 1: Basic Setup

Step 2: Splitting the Dataset (Optional)

  • Packaged in the layout-model-training repo is an inbuilt script (utils\cocosplit.py) to perform dataset splitting into test and train subsets.
  • The script ensures that in the event of images without tagged regions being present in the dataset, the ratio of tagged to untagged images in the train and test subsets will be equal.
  • Use the below command to split your dataset (Assuming the working directory is as instructed in the previous step).
<script src="https://gist.github.com/nasheedyasin/39ed0fe84ae2167393c96d7029ede9cf.js" charset="utf-8"></script>

Note that the above command is on a Windows 10 based system, alter the path separators according to the Operating System.

Argument Explanation

  • annotation_path: The path to where the consolidated dataset lies.
  • train and test: The path to where the train/ test dataset should be saved.
  • split-ratio: The fraction of the consolidated dataset to be allocated for training.

Step 3: Download the Pretrained Model

  • Download a pre-trained model and its related config file from Layout Parser’s model zoo.
  • The download consists of two files:
  1. model_final.pth: This is the pre-trained model’s weights.
  2. config.yaml: This is the pre-trained model’s configuration. For information about the configuration file, refer the Detectron2 Docs.

Step 4: Training the Model

Now that the dataset is split and the pretrained model weights are downloaded, let’s get to the juicy part: model training (or rather finetuning).

  • The training is done using the training script at tools\train_net.py
  • Use the command below to train the model.
<script src="https://gist.github.com/nasheedyasin/5bf6754ae2ef20d7e1f2d3142a93797e.js" charset="utf-8"></script>

Note that the above command is on a Windows 10 based system, alter the path separators according to the Operating System.

Argument Explanation

  • dataset_name: The name of the custom dataset (One can name it as one pleases).
  • json_annotation_train: The path to the training annotations.
  • json_annotation_val: The path to the testing annotations.
  • image_path_train: The path to the training images.
  • image_path_val: The path to the testing images.
  • config-file: The path to the model configuration file downloaded in Step 3.

Note that the rest of the argument-value pairs are actually config modifications and are specific to the use case (sometimes). For clarity on how to use and set them, refer the Detectron2 Docs.

  • The finetuned model along with its config file, training metrics and logs will be saved in the output path as indicated by the OUTPUT_DIR in the command above.

Step 5: Inference

With the finetuned model, it is a straightforward task to use it to parse documents.

  • Replace the model initialization with the below code in Layout Parser’s demo.
<script src="https://gist.github.com/nasheedyasin/989dac825d3293d89447e51e8a1fd159.js" charset="utf-8"></script>

Note that the above paths are based on a Windows 10 based system, alter the path separators according to the Operating System.

  • custom_label_map is the int_label -> text_label mapping. This mapping is made in accordance to the 'categories' field present in the training data’s COCO Json in the following way: {'id': 'name'} for each category. For instance:

custom_label_map = {0: "layout_class_1", 1: "layout_class_2"}

Conclusion

All in all, custom models on any dataset can easily be trained using the layout-model-training repo. Such models can be used to parse and understand a wide variety of documents with relative ease post training.

References

[1] Y. Wu, A. Kirillov, F. Massa, W. Y. Lo and R. Girshick, Detectron2: Facebook AI Research’s next generation library that provides state-of-the-art detection and segmentation algorithms (2019), GitHub Repo

[2] Z. Shen, R. Zhang, M. Dell, B. C. G. Lee, J. Carlson and W. Li, LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis (2021), arXiv preprint arXiv:2103.15348

[3] T. S. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays, P. Perona, D. Ramanan, C. L. Zitnick and P. Dollár, Microsoft COCO: Common Objects in Context (2015), arXiv preprint arXiv:1405.0312v3