Evaluation

Use the teklia-qwen evaluate command to evaluate QWEN’s predictions.

Parameter Description Type Default

Parameter	Description	Type	Default
`--labels`	Path to the JSONL label file.	`pathlib.Path`
`--predictions`	Path to the JSON prediction file.	`pathlib.Path`
`--entities`	Whether to compute Nerval scores and entity-based error rates.	`bool`	`False`
`--show-n-worst`	Show the n worst samples.	`int`	`0`
`--nerval-threshold`	Nerval threshold (entities will be matched if their CER is < threshold).	`float`	`0.0`
`--bio-dir`	Path to store the BIO files generated during the evaluation.	`pathlib.Path`	`None`

--labels

Path to the JSONL label file.

pathlib.Path

--predictions

Path to the JSON prediction file.

pathlib.Path

--entities

Whether to compute Nerval scores and entity-based error rates.

bool

False

--show-n-worst

Show the n worst samples.

int

0

--nerval-threshold

Nerval threshold (entities will be matched if their CER is < threshold).

float

0.0

--bio-dir

Path to store the BIO files generated during the evaluation.

pathlib.Path

None

Examples

Evaluate a model

By default, the evaluation calculates CER, WER, and a Format Score, which reflects the percentage of correctly formatted predictions.

Command to use:

teklia-qwen evaluate --labels test.jsonl --predictions predict_test.json

Output:

2025-09-11 17:53:06,581 INFO/qwen.evaluate: Summary:
| Set  | CER (%) | WER (%) | Format Score (%) | N samples |
|:----:|:-------:|:-------:|:----------------:|:---------:|
| test |   9.16  |  25.85  |      100.0       |     81    |

Evaluate and show n worst images

Command to use:

teklia-qwen evaluate --labels test.jsonl \
        --predictions predict_test.json \
        --show-n-worst 3

Output:

2025-09-15 13:54:45,999 INFO/qwen.evaluate: Summary:
| Set  | CER (%) | WER (%) | Format Score (%) | N samples |
|:----:|:-------:|:-------:|:----------------:|:---------:|
| test |   9.16  |  25.85  |      100.0       |     81    |
2025-09-15 13:54:46,000 INFO/qwen.evaluate: Worst samples:
|               Image ID               | CER (%) | WER (%) | Format |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Label                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |                                                                                                                                                                                                                                                                                                                                                                                                                Prediction                                                                                                                                                                                                                                                                                                                                                                                                                |
|:------------------------------------:|:-------:|:-------:|:------:|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|
| 434bf324-07a9-4bb7-9b63-9d1af2ada29c |  61.77  |  78.46  |  True  |                                                                                                                                                                                                                                                                                                                                                                                                   [{"nom": "Pauvrau"}, {"nom": "Paul\u00e9"}, {"nom": "Lanthivine"}, {"nom": "Guignon"}, {"nom": "Bosc"}, {"nom": "Mant\u00e9"}, {"nom": "Gueirard"}, {"nom": "Pr\u00e9bois"}, {"nom": "Dard"}, {"nom": "Nevi\u00e8re"}, {"nom": "Jez\u00e9quel"}, {"nom": "Barrol"}, {"nom": "Barth\u00e9lemy"}]                                                                                                                                                                                                                                                                                                                                                                                                   |                                                                                                                [{"nom": "Amorau"}, {"nom": "Paul\u00e9"}, {"nom": "Lanthivone"}, {"nom": "Guiguer"}, {"nom": "Bose"}, {"nom": "Mante"}, {"nom": "Gucirard"}, {"nom": "Soci\u00e9t\u00e9 des gran\u00e8s bassins"}, {"nom": "Idem"}, {"nom": "Idem"}, {"nom": "Idem"}, {"nom": "Idem"}, {"nom": "Idem"}, {"nom": "Idem"}, {"nom": "Idem"}, {"nom": "Idem"}, {"nom": "Idem"}, {"nom": "Idem"}, {"nom": "Idem"}, {"nom": "Idem"}, {"nom": "Idem"}, {"nom": "Idem"}, {"nom": "Idem"}, {"nom": "Idem"}, {"nom": "Idem"}, {"nom": "Idem"}, {"nom": "Idem"}, {"nom": "Idem"}, {"nom": "Idem"}, {"nom": "Idem"}, {"nom": "Idem"}]                                                                                                                |
| 0457d388-57c1-4fa8-9cfd-1b684c1e6ea9 |  25.31  |  44.34  |  True  | [{"nom": "Cadi\u00e9re"}, {"nom": "Mourou P\u00e9cunia"}, {"nom": "Jouveucel", "prenom": "Marcel Curet"}, {"nom": "Lercari"}, {"nom": "Bianco Giraud"}, {"nom": "Roux"}, {"nom": "Matharon"}, {"nom": "Audrieu", "prenom": "Ou\u00e9te"}, {"prenom": "Pigoul"}, {"nom": "Barseus Chardonne"}, {"nom": "R\u00e9cugis", "prenom": "Louis"}, {"nom": "Paul", "prenom": "Rapharet"}, {"nom": "B\u00e9renger"}, {"nom": "Fournier"}, {"nom": "Flotte"}, {"nom": "Rivi\u00e9re"}, {"nom": "Boeuf", "prenom": "Christin"}, {"nom": "Sicard", "prenom": "Adolphe des Cables"}, {"nom": "Herautte"}, {"nom": "Chardonne"}, {"nom": "Jaune", "prenom": "Matharon"}, {"nom": "Piguol"}, {"nom": "Bianco"}, {"nom": "Toche"}, {"nom": "B\u00e9levy", "prenom": "Bernard"}, {"nom": "David"}, {"nom": "Coste"}, {"nom": "Crastain"}, {"nom": "Laurent"}, {"nom": "Maurras"}, {"nom": "Orichioni", "prenom": "Carrega"}, {"nom": "Berre"}, {"nom": "Bouguier"}, {"nom": "Ou\u00e9te"}, {"nom": "Fortun\u00e9"}, {"nom": "Long"}, {"nom": "Touati Roux"}, {"nom": "Tric Tric"}, {"nom": "Cay"}, {"nom": "Giraud"}] | [{"nom": "Cadi\u00e8re"}, {"nom": "Mourou"}, {"nom": "Jouvencel"}, {"nom": "Lercari"}, {"nom": "Bianco"}, {"nom": "Roux"}, {"nom": "Matharon"}, {"nom": "Audrieu"}, {"nom": "Piguel"}, {"nom": "Barreus"}, {"nom": "R\u00e9cugis"}, {"nom": "Paul"}, {"nom": "B\u00e9renger"}, {"nom": "Fournier"}, {"nom": "Flotte"}, {"nom": "Nivi\u00e8re"}, {"nom": "Boeuf"}, {"nom": "Sicard"}, {"nom": "Herautte"}, {"nom": "Chardonne"}, {"nom": "Jaume"}, {"nom": "Piguel"}, {"nom": "Bianco"}, {"nom": "Toche"}, {"nom": "S\u00e9levy"}, {"nom": "David"}, {"nom": "Coste"}, {"nom": "Crastain"}, {"nom": "Laurent"}, {"nom": "Maurras"}, {"nom": "Oriolioni"}, {"nom": "Berre"}, {"nom": "Rousquier"}, {"nom": "On\u00e9ti"}, {"nom": "Fortun\u00e9"}, {"nom": "Long"}, {"nom": "Touati"}, {"nom": "Tric"}, {"nom": "Cay"}, {"nom": "Giraud"}] |
| aea27708-2ab4-4643-b309-cbd619f5b466 |  20.84  |  34.09  |  True  |                                                                                                                                                                                                                                                                                                                                                             [{"nom": "Garnier"}, {"nom": "Girardello"}, {"nom": "Barbieri"}, {"nom": "Giraud"}, {"nom": "Renoux"}, {"nom": "Nanasque"}, {"nom": "Pons"}, {"nom": "Costa"}, {"nom": "Costa"}, {"nom": "Cessin"}, {"nom": "Soulari"}, {"nom": "Mascilesie"}, {"nom": "Figoli"}, {"nom": "Molfino"}, {"nom": "Garnier"}, {"nom": "Garello"}, {"nom": "Ginouv\u00e9s"}, {"nom": "Lamonico"}]                                                                                                                                                                                                                                                                                                                                                            |                                                                                                                                                                                   [{"nom": "Garnier"}, {"nom": "Girandello"}, {"nom": "Barbieri"}, {"nom": "Giraud", "prenom": "Jouglas"}, {"nom": "Renoux"}, {"nom": "Nanasque"}, {"nom": "Point"}, {"nom": "Costa"}, {"nom": "Costa"}, {"nom": "Cessin", "prenom": "C\u00e9saire"}, {"nom": "Soulari"}, {"nom": "Masalesi", "prenom": "Delague"}, {"nom": "Figoli"}, {"nom": "Molfino"}, {"nom": "Garnier"}, {"nom": "Garello"}, {"nom": "Genouv\u00e9s", "prenom": "Delague"}, {"nom": "Lamonica"}]                                                                                                                                                                                    |

Evaluate a model with Nerval

To calculate Nerval metrics and error rates by categories, use the --entities option.

Command to use:

teklia-qwen evaluate --labels test.jsonl \
        --predictions predict_test.json \
        --entities

Output:

2025-09-15 13:48:24,048 INFO/qwen.evaluate: Summary:
| Set  | CER (%) | WER (%) | Format Score (%) | N samples |
|:----:|:-------:|:-------:|:----------------:|:---------:|
| test |   9.16  |  25.85  |      100.0       |     81    |
2025-09-15 13:51:45,698 INFO/bio_parser.utils: Loading labels...
2025-09-15 13:51:45,705 INFO/bio_parser.utils: Loading prediction...
2025-09-15 13:51:45,711 INFO/bio_parser.utils: The dataset is complete and valid.
2025-09-15 13:51:45,813 INFO/qwen.evaluate: Nerval evaluation:
| Category | Precision (%) | Recall (%) | F1 (%) | Support |
|:---------|:-------------:|:----------:|:------:|:-------:|
| nom      |     56.17     |   56.93    | 56.55  |   1990  |
| prenom   |     40.87     |   40.55    | 40.71  |   254   |
| total    |     54.47     |   55.08    | 54.78  |   2244  |
2025-09-15 13:51:45,814 INFO/bio_parser.utils: Loading labels...
2025-09-15 13:51:45,820 INFO/bio_parser.utils: Loading prediction...
2025-09-15 13:51:45,825 INFO/bio_parser.utils: The dataset is complete and valid.
2025-09-15 13:51:45,849 INFO/qwen.evaluate: CER/WER evaluation:
| Category | ECER (%) | EWER (%) | Support |
|:---------|:--------:|:--------:|--------:|
| nom      |  13.38   |  43.82   |      76 |
| prenom   |  31.83   |  52.99   |      22 |
| total    |  15.83   |  45.20   |      81 |

Evaluate a model with custom Nerval threshold

Nerval aligns label and predicted entities by calculating the Character Error Rate (CER) between them. The default threshold is 0 (strict).

To use a different threshold, set --nerval-threshold.

Command to use:

teklia-qwen evaluate --labels test.jsonl \
        --predictions predict_test.json \
        --entities \
        --nerval-threshold 0.3

Output:

2025-09-15 13:52:42,407 INFO/qwen.evaluate: Summary:
| Set  | CER (%) | WER (%) | Format Score (%) | N samples |
|:----:|:-------:|:-------:|:----------------:|:---------:|
| test |   9.16  |  25.85  |      100.0       |     81    |
2025-09-15 13:52:42,408 INFO/bio_parser.utils: Loading labels...
2025-09-15 13:52:42,415 INFO/bio_parser.utils: Loading prediction...
2025-09-15 13:52:42,422 INFO/bio_parser.utils: The dataset is complete and valid.
2025-09-15 13:52:42,526 INFO/qwen.evaluate: Nerval evaluation:
| Category | Precision (%) | Recall (%) | F1 (%) | Support |
|:---------|:-------------:|:----------:|:------:|:-------:|
| nom      |     80.71     |   81.81    | 81.26  |   1990  |
| prenom   |     59.52     |   59.06    | 59.29  |   254   |
| total    |     78.36     |   79.23    | 78.79  |   2244  |
2025-09-15 13:52:42,527 INFO/bio_parser.utils: Loading labels...
2025-09-15 13:52:42,533 INFO/bio_parser.utils: Loading prediction...
2025-09-15 13:52:42,539 INFO/bio_parser.utils: The dataset is complete and valid.
2025-09-15 13:52:42,562 INFO/qwen.evaluate: CER/WER evaluation:
| Category | ECER (%) | EWER (%) | Support |
|:---------|:--------:|:--------:|--------:|
| nom      |  13.38   |  43.82   |      76 |
| prenom   |  31.83   |  52.99   |      22 |
| total    |  15.83   |  45.20   |      81 |