Layout Detection Algorithm#

Introduction#

Layout detection is a fundamental task in document content extraction, aiming to locate different types of regions on a page, such as images, tables, text, and headings, to facilitate high-quality content extraction. For text and heading regions, OCR models can be used for text recognition, while table regions can be converted using table recognition models.

Model Usage#

Layout detection supports following models:

Model Description Characteristics Model weight Config file
DocLayout-YOLO Improved based on YOLO-v10:
1. Generate diverse pre-training data,enhance generalization ability across multiple document types
2. Model architecture improvement, improve perception ability on scale-varing instances
Details in DocLayout-YOLO
Speed:Fast, Accuracy:High doclayout_yolo_ft.pt layout_detection.yaml
YOLO-v10 Base YOLO-v10 model Speed:Fast, Accuracy:Moderate yolov10l_ft.pt layout_detection_yolo.yaml
LayoutLMv3 Base LayoutLMv3 model Speed:Slow, Accuracy:High layoutlmv3_ft layout_detection_layoutlmv3.yaml

Once enciroment is setup, you can perform layout detection by executing scripts/layout_detection.py directly.

Run demo

$ python scripts/layout_detection.py --config configs/layout_detection.yaml

Model Configuration#

1. DocLayout-YOLO / YOLO-v10

inputs: assets/demo/layout_detection
outputs: outputs/layout_detection
tasks:
  layout_detection:
    model: layout_detection_yolo
    model_config:
      img_size: 1024
      conf_thres: 0.25
      iou_thres: 0.45
      model_path: path/to/doclayout_yolo_model
      visualize: True
  • inputs/outputs: Define the input file path and the directory for visualization output.

  • tasks: Define the task type, currently only a layout detection task is included.

  • model: Specify the specific model type, e.g., layout_detection_yolo.

  • model_config: Define the model configuration.

  • img_size: Define the image long edge size; the short edge will be scaled proportionally based on the long edge, with the default long edge being 1024.

  • conf_thres: Define the confidence threshold, detecting only targets above this threshold.

  • iou_thres: Define the IoU threshold, removing targets with an overlap greater than this threshold.

  • model_path: Path to the model weights.

  • visualize: Whether to visualize the model results; visualized results will be saved in the outputs directory.

2. layoutlmv3

Note

LayoutLMv3 cannot run directly by default. Please follow the steps below to modify the configuration:

  1. Detectron2 Environment Setup

# For Linux
pip install https://wheels-1251341229.cos.ap-shanghai.myqcloud.com/assets/whl/detectron2/detectron2-0.6-cp310-cp310-linux_x86_64.whl

# For macOS
pip install https://wheels-1251341229.cos.ap-shanghai.myqcloud.com/assets/whl/detectron2/detectron2-0.6-cp310-cp310-macosx_10_9_universal2.whl

# For Windows
pip install https://wheels-1251341229.cos.ap-shanghai.myqcloud.com/assets/whl/detectron2/detectron2-0.6-cp310-cp310-win_amd64.whl
  1. Enable LayoutLMv3 Registration Code

Uncomment the lines at the following links:

from pdf_extract_kit.tasks.layout_detection.models.yolo import LayoutDetectionYOLO
from pdf_extract_kit.tasks.layout_detection.models.layoutlmv3 import LayoutDetectionLayoutlmv3
from pdf_extract_kit.registry.registry import MODEL_REGISTRY

__all__ = [
   "LayoutDetectionYOLO",
   "LayoutDetectionLayoutlmv3",
]
inputs: assets/demo/layout_detection
outputs: outputs/layout_detection
tasks:
  layout_detection:
    model: layout_detection_layoutlmv3
    model_config:
      model_path: path/to/layoutlmv3_model
  • inputs/outputs: Define the input file path and the directory for visualization output.

  • tasks: Define the task type, currently only a layout detection task is included.

  • model: Specify the specific model type, e.g., layout_detection_layoutlmv3.

  • model_config: Define the model configuration.

  • model_path: Path to the model weights.

Diverse Input Support#

The layout detection script in PDF-Extract-Kit supports input formats such as a single image, a directory containing only image files, a single PDF file, and a directory containing only PDF files.

Note

Modify the path to inputs in configs/layout_detection.yaml according to your actual data format: - Single image: path/to/image - Image directory: path/to/images - Single PDF file: path/to/pdf - PDF directory: path/to/pdfs

Note

When using PDF as input, you need to change predict_images to predict_pdfs in layout_detection.py.

# for image detection
detection_results = model_layout_detection.predict_images(input_data, result_path)

Change to:

# for pdf detection
detection_results = model_layout_detection.predict_pdfs(input_data, result_path)

Viewing Visualization Results#

When visualize is set to True in the config file, the visualization results will be saved in the outputs directory.

Note

Visualization is helpful for analyzing model results, but for large-scale tasks, it is recommended to turn off visualization (set visualize to False ) to reduce memory and disk usage.