Dataset extraction
Description
Use the teklia-layout-reader dataset extract command to extract a dataset from Hugging Face.
| Parameter | Description | Type | Default |
|---|---|---|---|
|
Name of a Hugging Face.dataset. |
|
|
|
Path to the output directory. |
|
|
|
Ratio of the data that will be shuffled (expected between 0 and 1). |
|
|
|
Factor used to scale normalized coordinates between 0-1000 |
|
|
Examples
To extract this Hugging Face dataset, use the following command:
teklia-layout-reader dataset extract \
Teklia/Newspapers-finlam \
--output-dir finlam_dataset/ \
--shuffle-rate 0.5 \
--bbx-factor 10
This will create a JSONL file for each split in gzip format:
-
finlam_dataset/train.jsonl.gz -
finlam_dataset/val.jsonl.gz -
finlam_dataset/test.jsonl.gz