Training configuration

To train a model, you need to write a JSON configuration file. The list of fields are described in the next section. An empty configuration file is available at configs/quickstart.json. You will need to fill in the paths.

Dataset parameters

Parameter Description Type Default

dataset.max_char_prediction

Maximum number of characters to predict.

int

1000

dataset.tokens

Path to a NER tokens configuration file similar to the one used for extraction.

pathlib.Path

To determine the value to use for dataset.max_char_prediction, you can use the analyze command to find the maximum number of characters in a label of the dataset.

You must replace the pseudo-variables $dataset_name and $dataset_path with respectively the name and the relative/absolute path to your dataset.

Model parameters

Name Description Type Default

model.transfered_charset

Transfer learning of the decision layer based on charset of the model to transfer.

bool

True

model.additional_tokens

For decision layer = [, \], only for transferred charset.

int

1

model.h_max

Maximum height for encoder output (for 2D positional embedding).

int

500

model.w_max

Maximum width for encoder output (for 2D positional embedding).

int

1000

Encoder

Name Description Type Default

model.encoder.dropout

Dropout probability in the encoder.

float

0.5

model.encoder.nb_layers

Number of layers in the encoder.

int

5

Decoder

Name Description Type Default

model.decoder.enc_dim

Dimension of features extracted by the encoder.

int

256

model.decoder.l_max

Maximum predicted sequence length (for 1D positional embedding).

int

15000

model.decoder.dec_num_layers

Number of transformer decoder layers.

int

8

model.decoder.dec_num_heads

Number of heads in transformer decoder layers.

int

4

model.decoder.dec_res_dropout

Dropout probability in transformer decoder layers.

float

0.1

model.decoder.dec_pred_dropout

Dropout rate before decision layer.

float

0.1

model.decoder.dec_att_dropout

Dropout rate in multi head attention.

float

0.1

model.decoder.dec_dim_feedforward

Number of dimensions for feedforward layer in transformer decoder layers.

int

256

model.decoder.attention_win

Length of attention window.

int

100

Language model

This assumes that you have already trained a language model.

Name Description Type Default

model.lm.path

Path to the language model.

str

model.lm.weight

How much weight to give to the language model. It should be set carefully (usually between 0.5 and 2.0) as it will affect the quality of the predictions.

float

- linebreaks are treated as spaces by language models, as a result predictions will not include linebreaks.

The model.lm.path argument expects a path to the language mode, but the parent folder should also contains:

  • a lexicon.txt file,

  • a tokens.txt file.

You should get the following tree structure:

folder/
├── <model.lm.path> # Path to the language model
├── lexicon.txt
└── tokens.txt

Training parameters

Name Description Type Default

training.output_folder

Directory for checkpoint and results.

str

training.max_nb_epochs

Maximum number of epochs before stopping training.

int

800

training.load_epoch

Model to load. Should be either "best" (evaluation) or last (training).

str

"last"

training.lr_schedulers

Learning rate schedulers.

custom class

Device

Name Description Type Default

training.device.use_ddp

Whether to use DistributedDataParallel.

bool

False

training.device.ddp_port

DDP port.

int

20027

training.device.use_amp

Whether to enable automatic mix-precision.

bool

True

training.device.nb_gpu

Number of GPUs to train DAN. Set to null to use all GPUs available.

int

training.device.force

Use a specific device if available. Use cpu to train on CPU (for debugging) or cuda/cuda:$gpu_device to train on GPU.

str

To train on several GPUs, simply set the training.device.use_ddp parameter to True. By default, the model will use all available GPUs. To restrict access to fewer GPUs, one can modify the training.device.nb_gpu parameter.

Optimizers

Name Description Type Default

training.optimizers.all.args.lr

Learning rate for the optimizer.

float

0.0001

training.optimizers.all.args.amsgrad

Whether to use AMSGrad optimization.

bool

False

Validation

Name Description Type Default

training.validation.eval_on_valid

Whether to evaluate and log metrics on the validation set during training.

bool

True

training.validation.eval_on_valid_interval

Interval (in epochs) to evaluate during training.

int

5

training.validation.eval_on_valid_start

Wait until this epoch before evaluating.

int

0

training.validation.set_name_focus_metric

Dataset to focus on to select best weights.

str

training.validation.font

Path to the font used in the image to log.

str

fonts/LinuxLibertine.ttf

training.validation.maximum_font_size

Maximum size used for the font of the image to log.

int

training.validation.nb_logged_images

Number of images to log during validation.

int

5

training.validation.limit_val_steps

Number of validation steps within an epoch.

int

500

During the validation stage, the batch size is set to 1. This avoids problems associated with image sizes that can be very different inside batches and lead to significant padding, resulting in performance degradations.

Metrics

Name Description Type Default

training.metrics.train

List of metrics to compute during training.

list

["loss_ce", "cer", "cer_no_token", "wer", "wer_no_punct", "wer_no_token"]

training.metrics.eval

List of metrics to compute during validation.

list

["cer", "cer_no_token", "wer", "wer_no_punct", "wer_no_token"]

Label noise scheduler

Name Description Type Default

training.label_noise_scheduler.min_error_rate

Minimum ratio of teacher forcing.

float

0.2

training.label_noise_scheduler.max_error_rate

Maximum ratio of teacher forcing.

float

0.2

training.label_noise_scheduler.total_num_steps

Number of steps before stopping teacher forcing.

float

5e4

Transfer learning

Name Description Type Default

training.transfer_learning.encoder

Model to load for the encoder [state_dict_name, checkpoint_path, learnable, strict].

list

["encoder", "pretrained_models/dan_rimes_page.pt", True, True]

training.transfer_learning.decoder

Model to load for the decoder [state_dict_name, checkpoint_path, learnable, strict].

list

["decoder", "pretrained_models/dan_rimes_page.pt", True, False]

Data

Name Description Type Default

training.data.batch_size

Mini-batch size for the training loop.

int

2

training.data.load_in_memory

Load all images in CPU memory.

bool

True

training.data.worker_per_gpu

Number of parallel processes per gpu for data loading.

int

4

training.data.preprocessings

List of pre-processing functions to apply to input images.

list

(see dedicated section)

training.data.augmentation

Whether to use data augmentation on the training set.

bool

True (see dedicated section)

training.data.limit_train_steps

Number of training steps within an epoch.

int

500

Preprocessing

Preprocessing is applied before training the network.

Usage:

  • Resize to a fixed height

[
    {
        "type": "fixed_height_resize",
        "fixed_height": 1500,
    }
]
  • Resize to a fixed width

[
    {
        "type": "fixed_width_resize",
        "fixed_width": 1500,
    }
]
  • Resize to a fixed width and a fixed height

[
    {
        "type": "fixed_resize",
        "fixed_height": 1900,
        "fixed_width": 1250,
    }
]
  • Resize to a maximum size (only if the image is bigger than the given size)

[
    {
        "type": "max_resize,
        "max_height": 2000,
        "max_width": 2000,
    }
]
  • Combine these pre-processings

[
    {
        "type": "fixed_height_resize",
        "fixed_height": 2000,
    },
    {
        "type": "fixed_width_resize",
        "fixed_width": 2000,
    }
]

Augmentation

Augmentation transformations are applied on-the-fly during training to artificially increase data variability.

DAN takes advantage of transforms from albumentations. The following configuration is used by default when using the teklia-dan train command. Data augmentation is applied with a probability of 0.9. In this case, two transformations are randomly selected to be applied.

transforms = A.Compose(
    [
        # Scale between 0.75 and 1.0
        RandomScale(scale_limit=[-0.25, 0], p=1, interpolation=cv2.INTER_AREA),
        A.SomeOf(
            [
                ErosionDilation(min_kernel=1, max_kernel=4, iterations=1),
                Perspective(scale=(0.05, 0.09), fit_output=True, p=0.4),
                GaussianBlur(sigma_limit=2.5, p=1),
                GaussNoise(var_limit=50**2, p=1),
                ColorJitter(
                    contrast=0.2, brightness=0.2, saturation=0.2, hue=0.2, p=1
                ),
                ElasticTransform(
                    alpha=20.0, sigma=5.0, border_mode=0, p=1
                ),
                Sharpen(alpha=(0.0, 1.0), p=1),
                Affine(shear={"x": (-20, 20), "y": (0, 0)}, p=1),
                CoarseDropout(p=1),
                ToGray(p=0.5),
            ],
            n=2,
            p=0.9,
        ),
    ],
    p=0.9,
)

For a detailed description of all augmentation transforms, see the dedicated page.

MLFlow logging

To log your experiment on MLFlow, you need to:

  • install the extra requirements via

$ pip install .[mlflow]
  • update the following arguments:

Name Description Type Default

mlflow.run_id

ID of the current run in MLflow.

int

mlflow.run_name

Name of the current run in MLflow.

str

mlflow.s3_endpoint_url

URL of S3 endpoint.

str

mlflow.tracking_uri

URI of a tracking server.

str

mlflow.experiment_id

ID of the current experiment in MLFlow.

str

mlflow.aws_access_key_id

Access key ID to the AWS server.

str

mlflow.aws_secret_access_key

Secret access key to the AWS server.

str

Weights & Biases logging

To log your run on Weights & Biases (W&B), you need to:

wandb login
  • update the following arguments:

Name Description Type Default

wandb.init

Keys and values to use to initialise your experiment on W&B. See the full list of available keys on the official documentation.

dict

wandb.images

Whether to log images during validation with their predicted transcription.

bool

False

wandb.inferences

Whether to log inferences during evaluation.

bool

False

Using W&B during DAN training will allow you to follow the DAN training with a W&B run. This run will automatically record:

  • a configuration using the DAN training configuration. Any wandb.init.config.* keys and values found in the DAN training configuration will be added to the W&B run configuration.

  • metrics listed in the training.metrics key of the DAN training configuration. To edit the metrics to log to W&B see the dedicated section.

  • images according to the wandb.images and training.validation.* keys of the DAN training configuration. To edit the images to log to W&B see the dedicated section.

Resume run

To be sure that your DAN training will only produce one W&B run even if your DAN training has been resumed, we strongly recommend you to either reuse your --wandb parameter of your analyze command or define these two keys before starting your DAN training:

  • wandb.init.id with a unique ID that has never been used on your W&B project. We recommend you to generate a random 8-character word composed of letters and numbers using the Short Unique ID (UUID) Generating Library.

  • wandb.init.resume with the value auto.

The final configuration should look like:

{
  "wandb": {
    "init": {
      "id": "<unique_ID>",
      "resume": "auto"
    }
  }
}

Otherwise, W&B will create a new run for each DAN training session, even if the DAN training has been resumed.

Offline mode

If you do not have Internet access during the DAN training, you can set the wandb.init.mode key to offline to use W&B’s offline mode. W&B will create a wandb folder in the training.output_folder defined in the DAN training configuration. To use another location, see the dedicated section.

The final configuration should look like:

{
  "wandb": {
    "init": {
      "mode": "offline"
    }
  }
}

Once your DAN training is complete, you can publish your W&B run with the wandb sync command and the --append parameter:

wandb sync --project <wandb_project> --sync-all --append

If you prefer, you can publish your W&B run regularly using a script similar to:

#!/bin/bash

while :
do
    echo "[`date +%Y-%m-%d\ %H:%M:%S`] Publishing W&B runs...";
    wandb sync --project <wandb_project> --sync-all --append;
    echo "[`date +%Y-%m-%d\ %H:%M:%S`] W&B runs published.";

    # Publish W&B runs every 5 minutes
    sleep 5m
done

As in online mode, we recommend you to set up a resume of your W&B runs (see the dedicated section).