Dataset extraction

Description

Use the teklia-layout-reader dataset extract command to extract a dataset in LayoutReader format.

The command expects a pre-formatted dataset provided in one of the supported input formats described below.

Parameter Description Type Default

Parameter	Description	Type	Default
`dataset`	Name of the HuggingFace or YOLO dataset.	`string`
`--output-dir`	Path to the output directory.	`pathlib.Path`
`--mode`	Dataset type. Must be one of the supported modes.	`Mode`
`--shuffle-rate`	Ratio of the data that will be shuffled (expected between 0 and 1).	`float`	`0.5`
`--extract-classes`	Whether to extract classes.	`bool`	`false`
`--extract-separators`	Whether to extract separators. In this case, LSD will be applied to detect vertical and horizontal separators.	`bool`	`false`

dataset

Name of the HuggingFace or YOLO dataset.

string

--output-dir

Path to the output directory.

pathlib.Path

--mode

Dataset type. Must be one of the supported modes.

Mode

--shuffle-rate

Ratio of the data that will be shuffled (expected between 0 and 1).

float

0.5

--extract-classes

Whether to extract classes.

bool

false

--extract-separators

Whether to extract separators. In this case, LSD will be applied to detect vertical and horizontal separators.

bool

false

The initial dataset must follow the HuggingFace structure and include the following fields for each page:

page_arkindex_id (str) - Unique identifier of the page.
page_image (PIL.Image) - The page image.
zone_polygons (list[list[float]]) - List of polygons for the page. Each polygon is represented as a list of four floats: [x_min, y_min, x_max, y_max], with coordinates normalized between 0 and 1.
zone_classes (list[int]) - List of class identifiers corresponding to each polygon. Each integer maps to a specific class.
zone_orders (list[int]) - Reading order of the zones (polygons) on the page.

The initial dataset can also be provided in YOLO format.

For each split (e.g. train, val, test), the following folders are required:

train/
├── images/
│   └── *.jpg
└── labels/
    └── *.txt

Each label file contains one object per line, using the following format: class_id x_center y_center width height

Where:

class_id (int): identifier of the object class
x_center, y_center, width, height (float): bounding box values normalized between 0 and 1, relative to the image size.

Each label file must be ordered so that objects appear in their reading order.

This command will convert a HF dataset into a LayoutReader dataset:

teklia-layout-reader dataset extract Teklia/Newspapers-finlam-La-Liberte \
    --mode HF  \
    --output-dir dataset  \
    --extract-classes  \

Expected files in dataset:

$ ls dataset
dev.jsonl.gz  test.jsonl.gz  train.jsonl.gz

Before running this command, please ensure that objects appear in their reading order in the label files.

This command will convert a YOLO dataset into a LayoutReader dataset:

teklia-layout-reader dataset extract my_yolo_dataset \
    --mode YOLO \
    --output-dir dataset \
    --extract-classes \
    --extract-separators

Expected files in dataset:

$ ls dataset
dev.jsonl.gz  test.jsonl.gz  train.jsonl.gz