Layout Detection Algorithm#
Introduction#
Layout detection is a fundamental task in document content extraction, aiming to locate different types of regions on a page, such as images, tables, text, and headings, to facilitate high-quality content extraction. For text and heading regions, OCR models can be used for text recognition, while table regions can be converted using table recognition models.
Model Usage#
Layout detection supports following models:
| Model | Description | Characteristics | Model weight | Config file |
|---|---|---|---|---|
| DocLayout-YOLO | Improved based on YOLO-v10: 1. Generate diverse pre-training data,enhance generalization ability across multiple document types 2. Model architecture improvement, improve perception ability on scale-varing instances Details in DocLayout-YOLO |
Speed:Fast, Accuracy:High | doclayout_yolo_ft.pt | layout_detection.yaml |
| YOLO-v10 | Base YOLO-v10 model | Speed:Fast, Accuracy:Moderate | yolov10l_ft.pt | layout_detection_yolo.yaml |
| LayoutLMv3 | Base LayoutLMv3 model | Speed:Slow, Accuracy:High | layoutlmv3_ft | layout_detection_layoutlmv3.yaml |
Once enciroment is setup, you can perform layout detection by executing scripts/layout_detection.py directly.
Run demo
$ python scripts/layout_detection.py --config configs/layout_detection.yaml
Model Configuration#
1. DocLayout-YOLO / YOLO-v10
inputs: assets/demo/layout_detection
outputs: outputs/layout_detection
tasks:
layout_detection:
model: layout_detection_yolo
model_config:
img_size: 1024
conf_thres: 0.25
iou_thres: 0.45
model_path: path/to/doclayout_yolo_model
visualize: True
inputs/outputs: Define the input file path and the directory for visualization output.
tasks: Define the task type, currently only a layout detection task is included.
model: Specify the specific model type, e.g., layout_detection_yolo.
model_config: Define the model configuration.
img_size: Define the image long edge size; the short edge will be scaled proportionally based on the long edge, with the default long edge being 1024.
conf_thres: Define the confidence threshold, detecting only targets above this threshold.
iou_thres: Define the IoU threshold, removing targets with an overlap greater than this threshold.
model_path: Path to the model weights.
visualize: Whether to visualize the model results; visualized results will be saved in the outputs directory.
2. layoutlmv3
Note
LayoutLMv3 cannot run directly by default. Please follow the steps below to modify the configuration:
Detectron2 Environment Setup
# For Linux
pip install https://wheels-1251341229.cos.ap-shanghai.myqcloud.com/assets/whl/detectron2/detectron2-0.6-cp310-cp310-linux_x86_64.whl
# For macOS
pip install https://wheels-1251341229.cos.ap-shanghai.myqcloud.com/assets/whl/detectron2/detectron2-0.6-cp310-cp310-macosx_10_9_universal2.whl
# For Windows
pip install https://wheels-1251341229.cos.ap-shanghai.myqcloud.com/assets/whl/detectron2/detectron2-0.6-cp310-cp310-win_amd64.whl
Enable LayoutLMv3 Registration Code
Uncomment the lines at the following links:
from pdf_extract_kit.tasks.layout_detection.models.yolo import LayoutDetectionYOLO
from pdf_extract_kit.tasks.layout_detection.models.layoutlmv3 import LayoutDetectionLayoutlmv3
from pdf_extract_kit.registry.registry import MODEL_REGISTRY
__all__ = [
"LayoutDetectionYOLO",
"LayoutDetectionLayoutlmv3",
]
inputs: assets/demo/layout_detection
outputs: outputs/layout_detection
tasks:
layout_detection:
model: layout_detection_layoutlmv3
model_config:
model_path: path/to/layoutlmv3_model
inputs/outputs: Define the input file path and the directory for visualization output.
tasks: Define the task type, currently only a layout detection task is included.
model: Specify the specific model type, e.g., layout_detection_layoutlmv3.
model_config: Define the model configuration.
model_path: Path to the model weights.
Diverse Input Support#
The layout detection script in PDF-Extract-Kit supports input formats such as a single image, a directory containing only image files, a single PDF file, and a directory containing only PDF files.
Note
Modify the path to inputs in configs/layout_detection.yaml according to your actual data format: - Single image: path/to/image - Image directory: path/to/images - Single PDF file: path/to/pdf - PDF directory: path/to/pdfs
Note
When using PDF as input, you need to change predict_images to predict_pdfs in layout_detection.py.
# for image detection
detection_results = model_layout_detection.predict_images(input_data, result_path)
Change to:
# for pdf detection
detection_results = model_layout_detection.predict_pdfs(input_data, result_path)
Viewing Visualization Results#
When visualize is set to True in the config file, the visualization results will be saved in the outputs directory.
Note
Visualization is helpful for analyzing model results, but for large-scale tasks, it is recommended to turn off visualization (set visualize to False ) to reduce memory and disk usage.