Bag-of-Entity metric

The ie-eval boe command can be used to compute the bag-of-entity recognition and error rates, globally and for each semantic category.

Metric description

Recognition rate (Precision, Recall, F1)

The Bag-of-Entities (BoE) recognition rate checks whether predicted entities appear in the ground truth and if ground truth entities appear in the prediction, regardless of their position.

The number of True Positives (TP) is the number of entities that appears both in the label and the prediction.
The number of False Positives (FP) is the number of entities that appear in the prediction, but not in the label.
The number of False Negatives (FN) is the number of entities that appear in the label, but not in the prediction.

From these counts, the Precision, Recall and F1-scores can be computed:

The Precision (P) is the fraction of predicted entities that also appear in the ground truth. It is defined by $\frac{TP}{TP + FP}$.
The Recall ® is the fraction of ground truth entities that are predicted by the automatic model. It is defined by $\frac{TP}{TP + FN}$.
The F1-score is the harmonic mean of the Precision and Recall. It is defined by $\frac{2 \times P \times R}{P + R}$.

Error rate (bWER)

The Bag-of-Entity (BoE) error rate is derived from the bag of words WER (bWER) metric proposed by Vidal et al. in End-to-End page-Level assessment of handwritten text recognition. Entities are defined as a combination of a text and its semantic tag. For example:

Label: [("person", "Georges Washington"), ("date", "the last day of 1798"), ("date", "January 24th")]
Prediction: [("person", "Georges Woshington"), ("date", "the last day of 1798")

From ground truth and predicted entities, we count the number of errors and compute the error rate.

The number of insertions & deletions ($N_{ID}$) is the absolute difference between the number of ground truth entities and predicted entities. In this case, ("date", "January 24th") counts as a deletion, so $N_{ID} = 1$.
The number of substitutions ($N_S$) is defined as $(N_{SID} - N_{ID}) / 2$, where $N_{SID}$ is the total number of errors. In this case, ("person", "Georges Woshington") counts as a substitution, so $N_S = 1$.
The error rate ($BoE_{WER}$) is then defined as $(N_{ID} + N_S)/|G|$, where $|G|$ is the number of ground truth words. In this example, $BoE_{WER} = 2 / 3 = 0.67$.

Parameters

Here are the available parameters for this metric:

Parameter Description Type Default

Parameter	Description	Type	Default
`--label-dir`	Path to the directory containing BIO label files.	`pathlib.Path`
`--prediction-dir`	Path to the directory containing BIO prediction files.	`pathlib.Path`
`--by-category`	Whether to display the metric for each category.	`bool`	`False`

--label-dir

Path to the directory containing BIO label files.

pathlib.Path

--prediction-dir

Path to the directory containing BIO prediction files.

pathlib.Path

--by-category

Whether to display the metric for each category.

bool

False

The parameters are also described when running ie-eval boe --help.

Examples

Global evaluation

Use the following command to compute the overall BoE metrics:

ie-eval boe --label-dir Simara/labels/ \
            --prediction-dir Simara/predictions/

It will output the results in Markdown format:

2024-01-24 12:20:26,973 INFO/bio_parser.utils: Loading labels...
2024-01-24 12:20:27,104 INFO/bio_parser.utils: Loading prediction...
2024-01-24 12:20:27,187 INFO/bio_parser.utils: The dataset is complete and valid.
| Category | bWER (%) | Precision (%) | Recall (%) | F1 (%) | N words | N documents |
|:---------|:--------:|:-------------:|:----------:|:------:|:-------:|:-----------:|
| total    |  23.23   |     77.06     |   77.34    | 77.20  |   4430  |     804     |

Evaluation for each category

Use the following command to compute the BoE metrics for each semantic category:

ie-eval boe --label-dir Simara/labels/ \
            --prediction-dir Simara/predictions/ \
            --by-category

It will output the results in Markdown format:

2024-01-24 12:20:48,096 INFO/bio_parser.utils: Loading labels...
2024-01-24 12:20:48,232 INFO/bio_parser.utils: Loading prediction...
2024-01-24 12:20:48,315 INFO/bio_parser.utils: The dataset is complete and valid.
| Category            | bWER (%) | Precision (%) | Recall (%) | F1 (%) | N words | N documents |
|:--------------------|:--------:|:-------------:|:----------:|:------:|:-------:|:-----------:|
| total               |  23.23   |     77.06     |   77.34    | 77.20  |   4430  |     804     |
| cote_article        |   2.81   |     97.21     |   97.78    | 97.49  |   676   |     676     |
| cote_serie          |   2.81   |     97.64     |   97.78    | 97.71  |   676   |     676     |
| precisions_sur_cote |  11.85   |     88.28     |   88.15    | 88.21  |   675   |     675     |
| intitule            |  56.09   |     43.91     |   43.91    | 43.91  |   804   |     804     |
| date                |   5.73   |     94.65     |   94.27    | 94.46  |   751   |     751     |
| analyse_compl       |  50.45   |     50.85     |   50.71    | 50.78  |   771   |     771     |
| classement          |  25.97   |     74.03     |   74.03    | 74.03  |    77   |      77     |