Configuration
The dataset extraction command requires a configuration file. Format is described below.
Sections
Prompt
This section holds the paths towards the system and user prompts that should be used for your training dataset. The default key is mandatory. You can specify per-dataset prompt, using the optional datasets key.
---
prompt:
system:
default: path/to/system/prompt
user:
default: path/to/user/prompt
datasets:
<dataset-id>: path/to/dataset/user/prompt
NER
This optional section holds the information related to the entities extracted on the dataset to do NER. The different keys are:
-
modeRequired: format in which the data is generated, see examples below. Currently supported mode arecsv,xmlandjson. -
delimiter: delimiter used in thecsvmode. Defaults to",". Ignored in any other mode. -
wk_run: specify a list of worker run IDs that produced the transcription entities. -
types:-
specify the header of the
csv-formatted data. Default (omitted) will not guaranty that the header follows the order in the transcription, -
limit the scope to these entity types in
xml-formatted data.
-
---
ner:
mode: csv
delimiter: ','
wk_run:
- 03d95092-2b2c-4877-9859-be8a37890185
- c9ab4fa9-c2b8-4f10-874b-9de98fe14604
types:
- surname
- firstname
- age
NER | CSV mode
Below is an example of a generated string using the csv mode.
surname,firstname,age
Laulont,Francois,8
Ciret,Antoine,27
Ciret,Marie,28
Ciret,Marie,2
| This mode should be preferred for datasets with tabular data. |
NER | XML mode
Below is an example of a generated string using the xml mode.
<root>
<surname>Laulont</surname> <firstname>Francois</firstname> <age>8</age>
<surname>Ciret</surname> <firstname>Antoine</firstname> <age>27</age>
<surname>Ciret</surname> <firstname>Marie</firstname> <age>28</age>
<surname>Ciret</surname> <firstname>Marie</firstname> <age>2</age>
</root>
NER | JSON mode
Below is an example of a generated string using the json mode.
[
{
"surname": "Laulont",
"firstname": "Francois",
"age": "8"
},
{
"surname": "Ciret",
"firstname": "Antoine",
"age": "27"
},
{
"surname": "Ciret",
"firstname": "Marie",
"age": "28"
},
{
"surname": "Ciret",
"firstname": "Marie",
"age": "2"
}
]
Element
This optional section holds the information of the elements to extract. This is common to all datasets. Default behavior is to extract dataset elements. However, you can look for their children recursively, by specifying the type of these elements. You can filter them by worker run IDs as well using the wk_run parameter.
---
element:
type: null|str # null = dataset_element, str means recursive search
wk_run: # only relevant when type is not null
- wk_run1
- wk_run2
Examples
Full configuration
---
prompt:
system:
default: path/to/system/prompt
user:
default: path/to/user/prompt
datasets:
<dataset-id>: path/to/dataset/user/prompt
NER configuration (CSV mode)
---
prompt:
system:
default: path/to/system/prompt
user:
default: path/to/user/prompt
datasets:
<dataset-id>: path/to/dataset/user/prompt
ner:
mode: csv
NER configuration (CSV mode + ordered types)
---
prompt:
system:
default: path/to/system/prompt
user:
default: path/to/user/prompt
datasets:
<dataset-id>: path/to/dataset/user/prompt
ner:
mode: csv
types:
- surname
- firstname
- age
NER configuration (XML mode)
---
prompt:
system:
default: path/to/system/prompt
user:
default: path/to/user/prompt
datasets:
<dataset-id>: path/to/dataset/user/prompt
ner:
mode: xml