Dataset language model
Description
Use the teklia-dan dataset language-model
command to build language model resources of a dataset from a split extracted by DAN. This will:
-
Generate the resources needed to build a n-gram language model at character, subword or word-level with kenlm (in the
language_model/
folder).
Parameter | Description | Type | Default |
---|---|---|---|
|
Path where the |
|
|
|
Size of the vocabulary used to train the sentencepiece subword tokenizer used to train the optional language model. |
|
|
|
Token to use to replace character in the validation/test sets that is not included in the training set. |
|
|
|
Mapping between starting tokens and end tokens to extract text with their entities. |
|
The --output
directory should have:
-
A
charset.pkl
file of the set of characters encountered in the dataset, -
A
labels.json
JSON-formatted file with a specific format. A mapping of the images (identified by its path) to the ground-truth transcription (with NER tokens if needed).
These files can be generated by the teklia-dan dataset download
command. More details in the dedicated page.
{
"train": {
"<image_path>": "\u24e2Coufet \u24d5Bouis \u24d107.12.14"
},
"dev": {},
"test": {}
}
The --tokens
argument expects a YAML-formatted file with a specific format. A list of entries with each entry describing a NER entity. The label of the entity is the key to a dict mapping the starting and ending tokens respectively. This file can be generated by the teklia-dan dataset tokens
command. More details in the dedicated page.
INTITULE: # Type of the entity on Arkindex
start: ⓘ # Starting token for this entity
end: Ⓘ # Optional ending token for this entity
DATE:
start: ⓓ
end: Ⓓ
COTE_SERIE:
start: ⓢ
end: Ⓢ
ANALYSE_COMPL.:
start: ⓒ
end: Ⓒ
PRECISIONS_SUR_COTE:
start: ⓟ
end: Ⓟ
COTE_ARTICLE:
start: ⓐ
end: Ⓐ
CLASSEMENT:
start: ⓛ
end: Ⓛ