Dataset tokens

Description

Use the teklia-dan dataset tokens command generate a YAML file containing entities and their token(s) to train a DAN model.

Parameter Description Type Default

entities

Path to a YAML file containing the extracted entities.

pathlib.Path

--end-tokens

Whether to generate end tokens along with starting tokens.

bool

False

--output-file

Path to a YAML file to save the entities and their token(s).

bool

tokens.yml

The entities argument expects a YAML-formatted file with the list of entity names. This file can be generated by the teklia-dan dataset entities command. More details in the dedicated page.

entities:
  - INTITULE
  - DATE
  - ANALYSE_COMPL.
  - PRECISIONS_SUR_COTE
  - COTE_ARTICLE
  - CLASSEMENT

Examples

Start tokens

teklia-dan dataset tokens \
    entities.yml

This command will create a tokens.yml YAML-formatted file with a specific format. A list of entries with each entry describing a NER entity. The label of the entity is the key to a dict mapping the starting and ending tokens respectively.

INTITULE: # Type of the entity on Arkindex
  start: Ⓐ # Starting token for this entity
  end: ''
DATE:
  start: Ⓑ
  end: ''
ANALYSE_COMPL.:
  start: Ⓒ
  end: ''
PRECISIONS_SUR_COTE:
  start: Ⓓ
  end: ''
COTE_ARTICLE:
  start: Ⓔ
  end: ''
CLASSEMENT:
  start: Ⓕ
  end: ''

Start tokens + End tokens

teklia-dan dataset tokens \
    entities.yml \
    --end-tokens

This command will create a tokens.yml YAML-formatted file with a specific format. A list of entries with each entry describing a NER entity. The label of the entity is the key to a dict mapping the starting and ending tokens respectively.

INTITULE: # Type of the entity on Arkindex
  start: Ⓐ # Starting token for this entity
  end: Ⓑ # Ending token for this entity
DATE:
  start: Ⓒ
  end: Ⓓ
ANALYSE_COMPL.:
  start: Ⓔ
  end: Ⓕ
PRECISIONS_SUR_COTE:
  start: Ⓖ
  end: Ⓗ
COTE_ARTICLE:
  start: Ⓘ
  end: Ⓙ
CLASSEMENT:
  start: Ⓚ
  end: Ⓛ