Arkindex 1.9.0

We are happy to announce that a new Arkindex release is available. You can explore Arkindex and try out the newest features on our demo instance, demo.arkindex.org.

Entity table removal

In Arkindex, there were two distinct concepts of entities and transcription entities. Transcription entities are the mention of an entity within a transcription, and an entity can be mentioned multiple times, while also being linked to metadata for special cases where an element refers to an entity without doing so through any transcription.

However, we have found that the vast majority of the projects cannot group entities together, thus only have one transcription entity per entity. Entity names were the same as the text found on transcription entities, and other entity attributes were not used. This means a lot of data was duplicated, taking a lot of disk space and slowing down any worker that produces transcription entities. Maintaining the link with both transcription entities and metadata at once was also too high a cost compared to its very low usage.

In this release, we are removing entities entirely, leaving only transcription entities. The breaking changes are as follows:

Transcription entities are now linked to entity types instead of entities.
Metadata cannot be linked to entities anymore, so entity_id is now longer available on metadata.
The entities list and entity details pages are removed from the frontend.
The ListCorpusEntities, CreateEntity, RetrieveEntity, UpdateEntity, PartialUpdateEntity, DestroyEntity and ListEntityElements API endpoints are removed.
The ListTranscriptionEntities API endpoint cannot filter by entity worker run or entity worker version anymore, and returns the entity type as type instead of the entity as entity.
The CreateTranscriptionEntity API endpoint now expects an entity type ID as type_id instead of an entity ID as entity_id.
The CreateTranscriptionEntities API endpoint now expects a list of transcription_entities instead of entities, and returns a list of UUIDs instead of a list of objects with both the entity and transcription entity IDs.
The SearchCorpus API endpoint now works with transcription entities and not entities.
The entity table has been removed from database exports.
The Arkindex CLI’s entity export now exports transcription entities instead of entities.

Of course you can still produce and store entities in Arkindex as before: their internal representation is only becoming simpler and more efficient.

On Arkindex instances that use a significant amount of workers producing transcription entities, we have found that this removal can reduce the database size by up to 40%. This removal has also allowed us to simplify various complex database queries, improving performance on project, element or worker results deletion as well as transcription entity APIs.

This removal may take a long time to execute and may require manual intervention from a system administrator when upgrading to this new Arkindex release. We encourage administrators to review the technical release notes carefully.

Processes

We have made some improvements to processes and task execution to make them easier to understand and avoid common mistakes.

On Workers processes, the Load children filter can now take three different values, to allow selecting no children, only direct children, or all children recursively. This matches what the element navigation performs more precisely.

New options for the Load children filter

With this change, we have also reworked the way in which we list elements on processes, which could bring performance improvements and make the initialization tasks faster on most processes.

To make it easier to understand how the TTL (time-to-live) of a task impacts it, the execution time is now shown in red when it exceeds the TTL, and a warning is shown when a task enters a Cancelled state after having exceeded that limit.

Additionally, we fixed an issue that caused processes with any retried tasks to be excluded when listing processes filtered by state.

Jobs

Background jobs such as deletions, search indexations or database exports can now be stopped while they are running. This can give more opportunities to avoid accidental deletions.

Multipart uploads

New APIs have been introduced to support uploading files and model versions in multiple chunks. This allows for very large files and models to be uploaded, even on networks with lower reliability. The Arkindex CLI can use those APIs through the new arkindex upload model_version and arkindex upload data_file commands.

You can now upload and use large vision models (usually based on LLMs) without storing them on external resources.

Worker results deletion

The worker results deletion was built before transcription entities had their own independent link to WorkerRuns. For this reason, transcription entities were only deleted when their transcription matched the specified WorkerRun. In this release, the deletion will now also target transcription entities that are directly linked to the WorkerRun.

To prevent accidental deletions, the frontend will now also prevent any worker results deletion from happening until all filters are removed in the elements navigation, as those filters cannot be applied to the deletion.

Worker configurations

We have continued to work on the new worker configuration format, particularly in building the new forms that will enable users to fill in each of the configuration fields. As this will come with more precise validation, we will be using a new format for API errors that will help the new form to attribute each error to each field and make it easier to resolve any errors before saving a configuration or starting a process.

In this release, while the form is not yet available, this new API error format will be made available on CreateWorkerConfiguration, UpdateWorkerConfiguration and PartialUpdateWorkerConfiguration. In the long term, we hope to be able to gradually replace all errors on all API endpoints with this new format to make it easier for all API users to handle errors, and make error messages more precise on our frontend.

Misc

A padlock or globe icon is now shown when listing models and viewing their details, to show whether they are public.
When an error occurs while adding an element type to a project, an error notification is now shown only once and not twice.
In the worker results deletion modal, clicking on a configuration’s name now shows its details instead of a placeholder.

Upgrade notes

To upgrade a development instance, follow this documentation.

To upgrade a production instance, you need to:

Deploy this release’s Docker image: registry.gitlab.teklia.com/arkindex/backend:1.9.0
Run the database migrations: docker exec ark-backend arkindex migrate
Update the system workers: docker exec ark-backend arkindex update_system_workers

The main changes impacting developers and system administrators are detailed below.

Entity removal

The Entity table has been removed, and TranscriptionEntities are now directly linked to entity types. This can save a significant amount of disk space, as in most cases, one Entity existed for each TranscriptionEntity. On projects that include an entity recognition step on all documents, this can reduce the database size by up to 40%.

Long database migration

Two database migrations are required to make this change, and one has to update every row in the TranscriptionEntity table to switch from entity IDs to entity type IDs. The documents.0029_migrate_entity migration can thus take a long time, and may require significant backend downtime to execute to completion. This migration can however be executed normally through arkindex migrate.

Manual deduplication

The database migrations will add a new unique constraint to ensure that only one TranscriptionEntity with a given entity type can be declared, at the same position, on the same transcription, and with the same WorkerRun. There previously was a constraint to require unique entities, but there could have been multiple distinct entities of the same entity type.

If multiple entities of the same entity type exist in those conditions, the documents.0030_drop_entity migration can fail with the following error:

django.db.utils.IntegrityError: could not create unique index "unique_transcription_entity"
DETAIL:  Key (transcription_id, type_id, "offset", length, worker_run_id)=(a07c6e1f-097b-401a-9041-925415df1b5d, d498d6c9-0763-4130-95eb-3d4bda72ef43, 100, 4, null) is duplicated.
CONTEXT:  parallel worker

Manual intervention on the database, for example through arkindex dbshell, will be necessary to deduplicate the entities, as Arkindex cannot assess on its own whether this data is vital to a project or not.

To list every element ID for which there are duplicate entities that require attention, you can use the following query:

SELECT DISTINCT any_value(element_id)
FROM documents_transcriptionentity te
INNER JOIN documents_transcription t ON t.id = te.transcription_id
GROUP BY transcription_id, type_id, "offset", length, te.worker_run_id
HAVING COUNT(*) > 1;

If none of the duplicates are of any importance, you can deduplicate across the whole database with the following query:

DELETE FROM documents_transcriptionentity
WHERE id IN (
    SELECT duplicates.id FROM (
        SELECT
            id, row_number() OVER (
                PARTITION BY transcription_id, type_id, "offset", length, worker_run_id
                ORDER BY id
            ) AS position
        FROM documents_transcriptionentity
    ) AS duplicates
    WHERE duplicates.position > 1
);

Full reindexation required

It was possible to search through entities using the Solr search feature. Breaking changes had to be made to the search index, and a full reindexation will be required. This can be run with the arkindex reindex --all --drop command.

Until this reindexation is executed, the search results will not include entities, and the search API will not be able to return any facets, not just the facets related to entities.