Development

DAN uses different tools during its development.

Linter

Code syntax is analyzed before submitting the code.

To run the linter tools suite you may use pre-commit.

pip install pre-commit
pre-commit run -a

Tests

Unit tests

Tests are executed with tox using pytest.

pip install tox
tox

To recreate tox virtual environment (e.g. a dependencies update), you may run tox -r.

Run a single test module: tox -- <test_path> Run a single test: tox -- <test_path>::<test_function>

The tests use a large file stored via Git-LFS. Make sure to run git-lfs pull before running them.

Commands

As unit tests do not test everything, it is sometimes necessary to use DAN commands directly to test developments.

Dataset tokens command

The library already has all the documents needed to run the dataset tokens command on a minimalist dataset. In the tests/data directory, you can run the following command and add any extra parameters you need:

teklia-dan dataset tokens entities.yml

Dataset download command

The library already has all the documents needed to run the dataset download command on a minimalist dataset. In the tests/data/extraction directory, you can run the following command and add any extra parameters you need:

teklia-dan dataset download --output .

Dataset language-model command

The library already has all the documents needed to run the dataset language-model command on a minimalist dataset. In the tests/data/prediction directory, you can run the following command and add any extra parameters you need:

teklia-dan dataset language-model --output . --subword-vocab-size 45

Dataset analyze command

The library already has all the documents needed to run the dataset analyze command on a minimalist dataset. In the tests/data/training/training_dataset directory, you can run the following command and add any extra parameters you need:

teklia-dan dataset analyze --labels labels.json --output-file analyze.md

Training command

The library already has all the documents needed to run the training command on a minimalist dataset. You can use the configuration available at configs/tests.json. It is already populated with the parameters used in the unit tests.

teklia-dan train --config configs/tests.json

Evaluation command

The library already has all the documents needed to run the evaluation command on a minimalist dataset. You can use the configuration available at configs/eval.json. It is already populated with the parameters used in the unit tests.

teklia-dan evaluate --config configs/eval.json

Predict command

The library already has all the documents needed to run the predict command with a minimalist model. In the tests/data/prediction directory, you can run the following command and add any extra parameters you need:

teklia-dan predict \
    --image-dir images/ \
    --image-extension png \
    --model . \
    --output /tmp/dan-predict

Convert command

If you want to evaluate a NER models with you own scripts, you can convert DAN’s predictions in BIO format, using the convert command.

teklia-dan convert /tmp/dan-predict --tokens tokens.yml --output /tmp/dan-convert

Documentation

This documentation is written in AsciiDoc and generated by Antora.

Setup

Install the needed dependencies through:

npm install

Build the documentation using make antora. You can then write in AsciiDoc in the relevant docs/*.adoc files, and see output on file:///path/to/the/repo/public/index.html.