Dataset extraction

Description

Use the teklia-layout-reader dataset extract command to extract a dataset from Hugging Face.

Parameter Description Type Default

dataset

Name of a Hugging Face.dataset.

Dataset

--output-dir

Path to the output directory.

pathlib.Path

--shuffle-rate

Ratio of the data that will be shuffled (expected between 0 and 1).

float

0.5

--bbx-factor

Factor used to scale normalized coordinates between 0-1000

float

1000

Examples

To extract this Hugging Face dataset, use the following command:

teklia-layout-reader dataset extract \
    Teklia/Newspapers-finlam \
    --output-dir finlam_dataset/ \
    --shuffle-rate 0.5 \
    --bbx-factor 10

This will create a JSONL file for each split in gzip format:

  • finlam_dataset/train.jsonl.gz

  • finlam_dataset/val.jsonl.gz

  • finlam_dataset/test.jsonl.gz