Dataset extraction
The teklia-qwen dataset extract command requires the following arguments:
-
the path to an SQLite project export,
-
a configuration file, provided through the
--configargument, more details about its format are available in a dedicated page, -
a list of IDs for the datasets to extract, provided through the
--datasetsargument, -
a folder where the dataset files will be saved, provided through the
--outputargument.
Images are not downloaded at this point. You need to run the download subcommand afterwards.
|
The dataset is generated in JSON format, in the file named split.json. There is one key per training set.
{
"train": {
"element_id_1": {
"system": "<system_prompt>",
"query": "<image> <user_prompt>",
"response": "<transcription>",
"images": [["<image_path>", "<image_url>"]]
},
"element_id_2": {
"system": "<system_prompt>",
"query": "<image1> <image2> <user_prompt>",
"response": "<transcription>",
"images": [
["<image_path_1>", "<image_url_1>"],
["<image_path_2>", "<image_url_2>"]
]
},
...
},
...
}