Indexing

Indexation is the process by which data from Arkindex projects is fed into the search engine, to later be used for searching.

This can be time-consuming, as the data needs to be structured very differently, and may also use a lot of disk space as this effectively duplicates data. For this reason, indexation is only done on a per-project and per-element type basis.

Indexation is only available if the search feature has been enabled on the Arkindex instance.

Because indexation cannot be done automatically, the search index is not fully synchronized with the database. This is why searching may turn up results for outdated data, or may not include data that has been created after the last reindexation.

Configuring indexation

Enabling or disabling indexation on a project can only be done by instance administrators through the administration interface.

  1. Under Documents  Corpora, select the project to edit.

  2. Toggle the Indexable checkbox to enable or disable indexation on the project.

  3. In the Element types table, toggle the Indexable checkboxes on each element type that should be indexed within the project.

  4. Click Save.

In order for a project to be indexed, both the project and at least one element type must be marked as indexable.
Arkindex does not delete an existing index after disabling indexation on a project. System administrators may delete the Solr collection manually.

Index contents

The following data types will be indexed:

While the Indexable setting on a project only determines whether a project can be indexed and searched on, the Indexable setting on element types determines how this data will be indexed.

The search engine cannot index the complex hierarchy enabled by element paths. For this reason, we index each item with a single parent element. The parent elements are the elements with indexable types. Any data type found within those parent elements recursively will be indexed as if those parent elements are the direct parents.

For example, with a common project structure where folders contain pages, and pages contain text lines with transcriptions:

  • Marking text lines as indexable will index the text lines with no parent, and the transcriptions with each text line as their parent.

  • Marking pages as indexable will index the pages with no parent, and both the text lines and transcriptions with each page as their parent.

  • Marking folders as indexable will index the folders with no parent, and everything else with each folder as their parent.

When multiple parent elements could match for a single item, for example with multiple indexable element types or for complex project structures with multiple parents, the item will be indexed multiple times, once per parent.

Indexing a project

Once a project has been configured, a search indexation must be executed. This can be done through the web interface, or by system administrators through a management command.

Web interface

On the project management page, the Search tab offers an option to reindex the project.

Search tab of the project management page

The Drop existing index for this project checkbox will cause any existing indexed data to be deleted before reindexing. When this option is checked, searching may be temporarily disrupted by the indexation as there will be nothing to search on while the indexation runs. However, this is the only way to delete data from the search index, because an indexation cannot detect that portions of the existing index must be removed.

Clicking the Reindex button will schedule a project indexation in the background, as an asynchronous task.

System administrators can configure or disable a time limit on indexations from the web interface.

In Enterprise Edition only, indexing a project is restricted to project administrators. In Community Edition, any logged-in user may index any project.

Management command

System administrators can use the reindex management command to run a search indexation. This does not run as an asynchronous task, and thus has no limit on execution time.