Dataset extraction
Description
Use the teklia-layout-reader dataset extract command to extract a dataset in LayoutReader format.
The command expects a pre-formatted dataset provided in one of the supported input formats described below.
| Parameter | Description | Type | Default |
|---|---|---|---|
|
Name of the HuggingFace or YOLO dataset. |
|
|
|
Path to the output directory. |
|
|
|
Dataset type. Must be one of the supported modes. |
|
|
|
Ratio of the data that will be shuffled (expected between 0 and 1). |
|
|
|
Whether to extract classes. |
|
|
|
Whether to extract separators. In this case, LSD will be applied to detect vertical and horizontal separators. |
|
|
Supported formats
1. HuggingFace format
The initial dataset must follow the HuggingFace structure and include the following fields for each page:
-
page_arkindex_id (
str) - Unique identifier of the page. -
page_image (
PIL.Image) - The page image. -
zone_polygons (
list[list[float]]) - List of polygons for the page. Each polygon is represented as a list of four floats:[x_min, y_min, x_max, y_max], with coordinates normalized between 0 and 1. -
zone_classes (
list[int]) - List of class identifiers corresponding to each polygon. Each integer maps to a specific class. -
zone_orders (
list[int]) - Reading order of the zones (polygons) on the page.
2. YOLO Dataset Format
The initial dataset can also be provided in YOLO format.
For each split (e.g. train, val, test), the following folders are required:
train/
├── images/
│ └── *.jpg
└── labels/
└── *.txt
Each label file contains one object per line, using the following format: class_id x_center y_center width height
Where:
-
class_id(int): identifier of the object class -
x_center,y_center,width,height(float): bounding box values normalized between 0 and 1, relative to the image size.
| Each label file must be ordered so that objects appear in their reading order. |
Examples
Extract from a HF dataset
This command will convert a HF dataset into a LayoutReader dataset:
teklia-layout-reader dataset extract Teklia/Newspapers-finlam-La-Liberte \
--mode HF \
--output-dir dataset \
--extract-classes \
Expected files in dataset:
$ ls dataset
dev.jsonl.gz test.jsonl.gz train.jsonl.gz
Extract from a YOLO dataset
| Before running this command, please ensure that objects appear in their reading order in the label files. |
This command will convert a YOLO dataset into a LayoutReader dataset:
teklia-layout-reader dataset extract my_yolo_dataset \
--mode YOLO \
--output-dir dataset \
--extract-classes \
--extract-separators
Expected files in dataset:
$ ls dataset
dev.jsonl.gz test.jsonl.gz train.jsonl.gz