Convert

Use the teklia-dan convert command to convert DAN predictions to BIO format. This is also the code used during evaluation.

BIO format

The BIO (or IOB) format is a representation used for the Named Entity Recognition task.

Description

This command is meant to be used on DAN predictions. Make sure the predict command has been used first. The first argument of this command is the path to a folder with the predictions in JSON format. The other required arguments are described in the table below.

Parameter Description Type Default

--output

Where BIO files are saved. Will be created if missing

pathlib.Path

--tokens

Mapping between starting tokens and end tokens to extract text with their entities.

pathlib.Path

The --tokens argument is the same file used during dataset extraction, generated by the tokens subcommand.

Examples

Take a simple prediction from DAN.

predictions/image.json
{
  "text": "Ⓐ27 aout 1858\nⒶ27 aout 1858\nⒶ27 aout 1858\nⒶ28 aout 1858\nⒶ30 aout 1858",
  "confidences": {},
  "language_model": {},
  "objects": [...]
}

With this tokens map:

tokens.yml
Date:
  start: Ⓐ
  end:

Then you can create the corresponding BIO file using

teklia-dan convert predictions --tokens tokens.yml --output bio

The folder pointed by --output will be created if missing. This command will generate one BIO file per JSON prediction, under the same name.

bio/image.bio
27 B-Date
aout I-Date
1858 I-Date
27 B-Date
aout I-Date
1858 I-Date
27 B-Date
aout I-Date
1858 I-Date
28 B-Date
aout I-Date
1858 I-Date
30 B-Date
aout I-Date
1858 I-Date