Validation
Use the bio-parser validate command to parse and validate the structure of one or more BIO2 files.
Supported format
The BIO2 format is a common tagging format in NER (Named entities recognition) tasks. More details about it on Wikipedia.
An example of such a tagging format is given below.
Alex B-PER
is O
going O
to O
Los B-LOC
Angeles I-LOC
in O
California B-LOC
Usage
You can specify one or more paths to your BIO files. The extension used has to be .bio.
The parser will check them one by one and report the first error encountered.
$ bio-parser validate input.bio
[12:37:20] INFO Parsing file @ `input.bio` validate.py:19
INFO The file @ `input.bio` is valid! validate.py:25
With multiple files:
$ bio-parser validate input1.bio input2.bio
[12:37:20] INFO Parsing file @ `input1.bio` validate.py:19
INFO The file @ `input1.bio` is valid! validate.py:25
[12:37:20] INFO Parsing file @ `input2.bio` validate.py:19
INFO The file @ `input2.bio` is valid! validate.py:25
With an invalid file.
$ bio-parser validate invalid.bio
[12:41:16] INFO Parsing file @ `invalid.bio` validate.py:19
ERROR Error on token n°0: Found `Tag.INSIDE` before `Tag.BEGINNING`. document.py:283
ERROR Could not load the file @ `invalid.bio`: validate.py:24
INFO The file @ `invalid.bio` is valid! validate.py:25
In addition to validating the structure of the file, a JSON representation of the BIO file is also saved at the same location.
This JSON file has three keys:
-
bio_repr: The string in BIO format passed to the command, -
tokens: the list of tokens in the file, with their index and text, -
spans: the list of NER entities found and their tokens.