Dataset extraction

The extract subcommand is used to extract data from Arkindex. This will create:

  • images/, a folder with the images that need to be transcribed,

  • labels.json, a JSON file where each image is linked to its transcription.

The full command is:

atr-data-generator extract \
    --config path/to/configuration.yaml \
    --database-path path/to/db.sqlite

Both these arguments are required:

  • --config, the path to the configuration file,

  • --database-path, the path to the Arkindex SQLite export of the corpus.

More details about the configuration file needed in the Dataset extraction section.

More formats

There is an additional CLI argument --format that allows specifying the format of the export, if the default one doesn’t suit the use case. Supported versions are: