Configuration

The dataset extraction command requires a configuration file. Format is described below.

Sections

Prompt

This section holds the paths towards the system and user prompts that should be used for your training dataset. The default key is mandatory. You can specify per-dataset prompt, using the optional datasets key.

---
prompt:
  system:
    default: path/to/system/prompt
  user:
    default: path/to/user/prompt
    datasets:
      <dataset-id>: path/to/dataset/user/prompt

NER

This optional section holds the information related to the entities extracted on the dataset to do NER. The different keys are:

mode Required: format in which the data is generated, see examples below. Currently supported mode are csv, xml and json.
delimiter: delimiter used in the csv mode. Defaults to ",". Ignored in any other mode.
wk_run: specify a list of worker run IDs that produced the transcription entities.
types:
- specify the header of the csv-formatted data. Default (omitted) will not guaranty that the header follows the order in the transcription,
- limit the scope to these entity types in xml-formatted data.

---
ner:
  mode: csv
  delimiter: ','
  wk_run:
    - 03d95092-2b2c-4877-9859-be8a37890185
    - c9ab4fa9-c2b8-4f10-874b-9de98fe14604
  types:
    - surname
    - firstname
    - age

NER | CSV mode

Below is an example of a generated string using the csv mode.

surname,firstname,age
Laulont,Francois,8
Ciret,Antoine,27
Ciret,Marie,28
Ciret,Marie,2

This mode should be preferred for datasets with tabular data.

NER | XML mode

Below is an example of a generated string using the xml mode.

<root>
    <surname>Laulont</surname> <firstname>Francois</firstname> <age>8</age>
    <surname>Ciret</surname> <firstname>Antoine</firstname> <age>27</age>
    <surname>Ciret</surname> <firstname>Marie</firstname> <age>28</age>
    <surname>Ciret</surname> <firstname>Marie</firstname> <age>2</age>
</root>

NER | JSON mode

Below is an example of a generated string using the json mode.

[
    {
        "surname": "Laulont",
        "firstname": "Francois",
        "age": "8"
    },
    {
        "surname": "Ciret",
        "firstname": "Antoine",
        "age": "27"
    },
    {
        "surname": "Ciret",
        "firstname": "Marie",
        "age": "28"
    },
    {
        "surname": "Ciret",
        "firstname": "Marie",
        "age": "2"
    }
]

Element

This optional section holds the information of the elements to extract. This is common to all datasets. Default behavior is to extract dataset elements. However, you can look for their children recursively, by specifying the type of these elements. You can filter them by worker run IDs as well using the wk_run parameter.

---
element:
  type: null|str # null = dataset_element, str means recursive search
  wk_run: # only relevant when type is not null
    - wk_run1
    - wk_run2

Transcription

This optional section holds the information of the transcription to extract. This is common to all datasets. You can filter the transcriptions to keep for each element by worker run IDs using the wk_run parameter.

---
transcription:
  wk_run:
    - wk_run1
    - wk_run2

Examples

Full configuration

---
prompt:
  system:
    default: path/to/system/prompt
  user:
    default: path/to/user/prompt
  datasets:
    <dataset-id>: path/to/dataset/user/prompt

NER configuration (CSV mode)

---
prompt:
  system:
    default: path/to/system/prompt
  user:
    default: path/to/user/prompt
  datasets:
    <dataset-id>: path/to/dataset/user/prompt
ner:
  mode: csv

NER configuration (CSV mode + ordered types)

---
prompt:
  system:
    default: path/to/system/prompt
  user:
    default: path/to/user/prompt
  datasets:
    <dataset-id>: path/to/dataset/user/prompt
ner:
  mode: csv
  types:
    - surname
    - firstname
    - age

NER configuration (XML mode)

---
prompt:
  system:
    default: path/to/system/prompt
  user:
    default: path/to/user/prompt
  datasets:
    <dataset-id>: path/to/dataset/user/prompt
ner:
  mode: xml

Element configuration (extract single pages created by worker run `wk_run1`)

---
prompt:
  system:
    default: path/to/system/prompt
  user:
    default: path/to/user/prompt
  datasets:
    <dataset-id>: path/to/dataset/user/prompt
element:
  type: single_page
  wk_run:
    - wk_run1

Transcription configuration (extract transcriptions created by worker run `wk_run1`)

---
prompt:
  system:
    default: path/to/system/prompt
  user:
    default: path/to/user/prompt
  datasets:
    <dataset-id>: path/to/dataset/user/prompt
transcription:
  wk_run:
    - wk_run1