Dataset language model

Description

Use the teklia-dan dataset language-model command to build language model resources of a dataset from a split extracted by DAN. This will:

Generate the resources needed to build a n-gram language model at character, subword or word-level with kenlm (in the language_model/ folder).

Parameter Description Type Default

Parameter	Description	Type	Default
`--output`	Path where the `labels.json` and `charset.pkl` files are stored and where the data will be generated.	`pathlib.Path`
`--subword-vocab-size`	Size of the vocabulary used to train the sentencepiece subword tokenizer used to train the optional language model.	`int`	`1000`
`--unknown-token`	Token to use to replace character in the validation/test sets that is not included in the training set.	`str`	`⁇`
`--tokens`	Mapping between starting tokens and end tokens to extract text with their entities.	`pathlib.Path`

--output

Path where the labels.json and charset.pkl files are stored and where the data will be generated.

pathlib.Path

--subword-vocab-size

Size of the vocabulary used to train the sentencepiece subword tokenizer used to train the optional language model.

int

1000

--unknown-token

Token to use to replace character in the validation/test sets that is not included in the training set.

str

⁇

--tokens

Mapping between starting tokens and end tokens to extract text with their entities.

pathlib.Path

The --output directory should have:

A charset.pkl file of the set of characters encountered in the dataset,
A labels.json JSON-formatted file with a specific format. A mapping of the images (identified by its path) to the ground-truth transcription (with NER tokens if needed).

These files can be generated by the teklia-dan dataset download command. More details in the dedicated page.

{
  "train": {
    "<image_path>": "\u24e2Coufet \u24d5Bouis \u24d107.12.14"
  },
  "dev": {},
  "test": {}
}

The --tokens argument expects a YAML-formatted file with a specific format. A list of entries with each entry describing a NER entity. The label of the entity is the key to a dict mapping the starting and ending tokens respectively. This file can be generated by the teklia-dan dataset tokens command. More details in the dedicated page.

INTITULE: # Type of the entity on Arkindex
  start: ⓘ # Starting token for this entity
  end: Ⓘ # Optional ending token for this entity
DATE:
  start: ⓓ
  end: Ⓓ
COTE_SERIE:
  start: ⓢ
  end: Ⓢ
ANALYSE_COMPL.:
  start: ⓒ
  end: Ⓒ
PRECISIONS_SUR_COTE:
  start: ⓟ
  end: Ⓟ
COTE_ARTICLE:
  start: ⓐ
  end: Ⓐ
CLASSEMENT:
  start: ⓛ
  end: Ⓛ

Examples

HTR and NER data

To build language model resources with NER data, please use the following:

teklia-dan dataset language-model \
    --output data \
    --tokens tokens.yml

HTR data

To build language model resources without NER data, please use the following:

teklia-dan dataset language-model \
    --output data