Dataset analysis

Description

Use the teklia-dan dataset analyze command to analyze a dataset. This will display statistics in Markdown format.

Parameter Description Type Default

--labels

Path to the labels.json file.

pathlib.Path

--tokens

Path to the tokens.yml file.

pathlib.Path

--output-file

Where the summary will be saved.

pathlib.Path

--wandb

Keys and values to use to initialise your experiment on W&B. See the full list of available keys on the official documentation.

dict

Weights & Biases logging

To log your statistics file on Weights & Biases (W&B), you need to:

wandb login

Resume run

To be sure that your statistics file is linked to your DAN training, we strongly recommend you to either reuse your wandb.init parameter of your DAN training configuration or define these two keys:

  • id with a unique ID that has never been used on your W&B project. We recommend you to generate a random 8-character word composed of letters and numbers using the Short Unique ID (UUID) Generating Library.

  • resume with the value auto.

The final configuration should look like:

{
  "id": "<unique_ID>",
  "resume": "auto"
}

Otherwise, W&B will create a new run when you’ll publish your statistics file.

Offline mode

If you do not have Internet access during the file generation, you can set the mode key to offline to use W&B’s offline mode. W&B will create a wandb folder next to the --output-file defined in the command.

The final configuration should look like:

{
  "mode": "offline"
}

Once your statistics file is complete, you can publish your W&B run with the wandb sync command and the --append parameter:

wandb sync --project <wandb_project> --sync-all --append

As in online mode, we recommend you to set up a resume of your W&B runs (see the dedicated section).

Examples

Display statistics for an HTR dataset

teklia-dan dataset analyze \
    --labels path/to/dataset/labels.json \
    --output-file statistics.md

Display statistics for an HTR-NER dataset

teklia-dan dataset analyze \
    --labels path/to/dataset/labels.json \
    --tokens  path/to/tokens.yml \
    --output-file statistics.md