Jaccard index

The newspaper-eval jaccard command computes the Jaccard Error Rate, which estimates the count consistency of zone detection, based on surface coverage.

To know more about the options of this command, use newspaper-eval jaccard --help.

Purpose

This command evaluates the alignment between predicted and ground truth articles and sections by performing the following steps:

For pages: filter predicted and reference zones, and compute the Jaccard Error Rate per zone label.
For articles and sections:
1. Match predicted and reference articles and sections
  - Compute an Intersection over Union (IoU) matrix between all predicted and reference articles/sections
  - Use the Hungarian matching algorithm to pair predicted and reference articles/sections with an IoU greater than 0.5
2. Compute the Jaccard Error Rate per zone at article/section level

Parameters

The list of parameters is detailed in this section.

Parameter Description Type Default

Parameter	Description	Type	Default
`--label-dir`	Path to the directory containing JSON label files.	`Path`
`--prediction-dir`	Path to the directory containing JSON prediction files.	`Path`
`--config`	Path to the configuration file with mapping classes.	`Path`
`--iou-threshold`	Minimum IoU threshold to use for matching.	`Path`	`None`
`--from-journal`	Whether to load files using the Journal format.	`bool`	`False`
`--per-sample`	Whether to evaluate metrics for each newspaper page.	`bool`	`False`
`--save-csv-path`	Path to a CSV file used to save the evaluation results.	`Path`	`None`
`--allow-partial`	Whether to allow partial match between the files in `labels-dir` and `prediction-dir`.	`bool`	`False`

--label-dir

Path to the directory containing JSON label files.

Path

--prediction-dir

Path to the directory containing JSON prediction files.

Path

--config

Path to the configuration file with mapping classes.

Path

--iou-threshold

Minimum IoU threshold to use for matching.

Path

None

--from-journal

Whether to load files using the Journal format.

bool

False

--per-sample

Whether to evaluate metrics for each newspaper page.

bool

False

--save-csv-path

Path to a CSV file used to save the evaluation results.

Path

None

--allow-partial

Whether to allow partial match between the files in labels-dir and prediction-dir.

bool

False

Examples

Basic evaluation

Run the following command to compute metrics:

newspaper-eval jaccard  --label-dir data/labels/ \
                        --prediction-dir data/predictions/ \
                        --config configs/finlam.yaml \
                        --from-journal

Will output:

INFO     Loading labels...
INFO     Loading prediction...
INFO     The dataset is complete and valid.
INFO     Evaluation:
|  Level  |      Class      | Jaccard Error Rate (%) | count predicted | count target |
| :-----: | :-------------: | :--------------------: | :-------------: | :----------: |
|   page  |   HEADER-TITLE  |         100.00         |        1        |      1       |
|   page  | HEADER-SUBTITLE |         100.00         |        0        |      0       |
|   page  |  ARTICLE-TITLE  |          0.00          |        0        |      2       |
|   page  |   ARTICLE-TEXT  |         80.00          |        5        |      4       |
|   page  | ILLUSTRATEDTEXT |         100.00         |        0        |      0       |
|   page  |       ALL       |         85.71          |        6        |      7       |
| article |       ALL       |         60.00          |        5        |      3       |
| section |       ALL       |         83.33          |        5        |      6       |

Evaluation per sample

To compute metrics for each page, use the --per-sample option:

newspaper-eval jaccard  --label-dir data/labels/ \
                        --prediction-dir data/predictions/ \
                        --config configs/finlam.yaml \
                        --from-journal \
                        --per-sample

Will output:

INFO     Loading labels...
INFO     Loading prediction...
INFO     The dataset is complete and valid.
INFO     Evaluation:
|  Level  |      Class      | Jaccard Error Rate (%) | count predicted | count target |
| :-----: | :-------------: | :--------------------: | :-------------: | :----------: |
|   page  |   HEADER-TITLE  |         100.00         |        1        |      1       |
|   page  | HEADER-SUBTITLE |         100.00         |        0        |      0       |
|   page  |  ARTICLE-TITLE  |          0.00          |        0        |      2       |
|   page  |   ARTICLE-TEXT  |         80.00          |        5        |      4       |
|   page  | ILLUSTRATEDTEXT |         100.00         |        0        |      0       |
|   page  |       ALL       |         85.71          |        6        |      7       |
| article |       ALL       |         60.00          |        5        |      3       |
| section |       ALL       |         83.33          |        5        |      6       |
INFO     Per sample evaluation:                                                                                             count_jaccard_similarity.py:242
|    Sample    |  Level  | Jaccard Error Rate (%) | count predicted | count target |
| :----------: | :-----: | :--------------------: | :-------------: | :----------: |
| journal.json |   page  |         85.71          |        6        |      7       |
| journal.json | article |         60.00          |        5        |      3       |
| journal.json | section |         83.33          |        5        |      6       |

Evaluation and saving results to CSV

To save metrics in a CSV file, use the --save-csv-path option:

newspaper-eval jaccard  --label-dir data/labels/ \
                        --prediction-dir data/predictions/ \
                        --config configs/finlam.yaml \
                        --from-journal \
                        --save-csv-path metrics.csv

Will output:

INFO     Loading labels...
INFO     Loading prediction...
INFO     The dataset is complete and valid.
INFO     Evaluation:
|  Level  |      Class      | Jaccard Error Rate (%) | count predicted | count target |
| :-----: | :-------------: | :--------------------: | :-------------: | :----------: |
|   page  |   HEADER-TITLE  |         100.00         |        1        |      1       |
|   page  | HEADER-SUBTITLE |         100.00         |        0        |      0       |
|   page  |  ARTICLE-TITLE  |          0.00          |        0        |      2       |
|   page  |   ARTICLE-TEXT  |         80.00          |        5        |      4       |
|   page  | ILLUSTRATEDTEXT |         100.00         |        0        |      0       |
|   page  |       ALL       |         85.71          |        6        |      7       |
| article |       ALL       |         60.00          |        5        |      3       |
| section |       ALL       |         83.33          |        5        |      6       |
           INFO     Saving metrics to CSV: metrics.csv.

This will create a new metrics.csv file:

Level,Class,Jaccard Error Rate (%),count predicted,count target
page,HEADER-TITLE,100.00,1,1
page,HEADER-SUBTITLE,100.00,0,0
page,ARTICLE-TITLE,0.00,0,2
page,ARTICLE-TEXT,80.00,5,4
page,ILLUSTRATEDTEXT,100.00,0,0
page,ALL,85.71,6,7
article,ALL,60.00,5,3
section,ALL,83.33,5,6