Dataset extraction

The teklia-qwen dataset extract command requires the following arguments:

  • the path to an SQLite project export,

  • a configuration file, provided through the --config argument, more details about its format are available in a dedicated page,

  • a list of IDs for the datasets to extract, provided through the --datasets argument,

  • a folder where the dataset files will be saved, provided through the --output argument.

Images are not downloaded at this point. You need to run the download subcommand afterwards.

The dataset is generated in JSON format, in the file named split.json. There is one key per training set.

{
    "train": {
        "element_id_1": {
            "system": "<system_prompt>",
            "query": "<image> <user_prompt>",
            "response": "<transcription>",
            "images": [["<image_path>", "<image_url>"]]
        },
        "element_id_2": {
            "system": "<system_prompt>",
            "query": "<image1> <image2> <user_prompt>",
            "response": "<transcription>",
            "images": [
                ["<image_path_1>", "<image_url_1>"],
                ["<image_path_2>", "<image_url_2>"]
            ]
        },
        ...
    },
    ...
}

Examples

Below is a command to export multiple datasets with the ID <dataset_id1> and <dataset_id2>, at once:

teklia-qwen dataset extract exports/database.sqlite \
    --config configs/extract.yml \
    --datasets <dataset_id1> <dataset_id2> \
    --output dataset/