Importing data

Arkindex provides multiple methods to import images, elements, transcriptions and more into a project. Some methods are more convenient for small and simple uploads, while other methods may be more efficient with large amounts of data or may be adapted for more advanced use cases.

File import

File imports are a type of process that can be started after uploading files to a project. To learn how to upload files and import them, see the dedicated documentation.

File imports support the following formats:

JPEG, JPEG 2000, PNG, TIFF, GIF and BMP images

Any image that could cause issues with an IIIF image server will be converted to JPEG automatically.

PDF documents

One folder element will be created for each document, holding one element per page. Transcriptions will be added to each page if the PDF contains machine-readable text.

IIIF Presentation manifests

One folder element will be created for each manifest, holding one element per canvas from the manifest. Metadata will be added to all elements from the manifest’s own metadata. With IIIF 3 only, metadata may also be created from the manifest’s requiredStatement and summary.

IIIF Presentation versions 2 and 3 are supported. Both IIIF Image 2 and 3 images embedded within those manifests are supported. Other types of annotations, such as static images, audio, video or text are not supported. See IIIF Presentation API support for more details.

IIIF Presentation collections

One folder element will be created for the collection. Both collections and manifests contained within the collection will be imported. IIIF Presentation versions 2 and 3 are supported.

See IIIF Presentation API support for more details.

ZIP, TAR archives

One folder element will be created for the archive, then all files within the archive will be imported inside of this folder as if they were independent files.

Archives may contain files of any of the formats supported by the file import, and may contain a mix of multiple formats. Nested archives are supported up to three levels deep. Archives must not be password-protected.

TAR archives may be uncompressed (.tar), gzip-compressed (.tar.gz or .tgz), BZip2-compressed (.tar.bz2), LZMA-compressed (.tar.lzma, .tar.xz) or Zstandard-compressed (.tar.zst).

Directory structures within archives are not supported. If an archive contains files within directories, they will be imported as if the directories were part of their name, for example as an element named directory/image.jpg.

To import archives while respecting their directory structure, you may use an S3 import instead.

Transkribus exports

ZIP archives that contain a mets.xml file are treated as if they were an export from a Transkribus collection. To learn more about importing data from Transkribus, see the Transkribus import documentation.

IIIF Presentation API support

Support for versions 2 and 3 of the IIIF Presentation API is limited because its scope extends beyond the capabilities of Arkindex. Known limitations include:

  • To detect the API version, the IIIF import will first attempt to read the collection or manifest as IIIF 3 using iiif-prezi3. An invalid IIIF 3 document will cause the import to treat it as an IIIF 2 document.

  • Ranges are ignored. Canvases will be imported in the order in which they appear within the manifest.

  • Sequences, canvases, annotation pages, annotations or content resources are not dereferenced. Manifests are treated as self-contained documents, with only embedded resources.

    The only resources allowed to be externally referenced are:

    • IIIF Image API services;

    • Collections as collection members;

    • Manifests as collection members.

  • A canvas using a static image without an IIIF Image API service will have its image ignored. An element will still be created, but without any image associated with it.

  • When a canvas contains multiple annotations with multiple images, only the first image with an IIIF Image API service will be used.

  • Any other annotation format, such as audio, video or text, is not supported.

  • No extensions are supported.

  • In IIIF 2, metadata is only created on manifests and canvases from their metadata property.

    In IIIF 3, metadata is also created on collections and manifests from the requiredStatement and summary.

  • HTML elements in metadata values are only supported on an IIIF 3 summary.

Because IIIF 2 and 3 have significant differences in their representation of multiple values and multilingual values, the IIIF import behaves differently with each version:

IIIF 2

The first value with a language code set to en or to any en-* variant is selected. If there is no English value, the first value in the list will be selected no matter its language.

For collection, manifest, canvas or metadata labels, only one value is ever selected.

For metadata values, when all values have any defined language, they are treated as multiple translations of a single value, so only one value is selected. When any value does not have a defined language, then one metadata will be created for each value regardless of any language.

IIIF 3

The list of values that will be used is selected using the following order of preference:

  1. The en language code.

  2. The eng language code.

  3. The first language code that starts with en-, in alphabetical order.

  4. The first language code that starts with eng-, in alphabetical order.

  5. The special code none for values without a language set.

  6. The first language code in alphabetical order.

The selected values are then combined into a single string using spaces. In the case of summary, the values will be joined with two line breaks to break them up into Markdown paragraphs.

S3 import

S3 imports are a type of process that allows importing large-scale uploads of data that has been pre-processed to be compatible with Arkindex and uploaded using the S3 API. Such imports are not available by default on an Arkindex instance, as they require a specific configuration with a compatible IIIF server.

To learn more, see the documentation on uploading to Teklia’s S3 server and starting an S3 import.

S3 imports support the following formats:

JPEG and PNG images

Unlike in file imports, the images are not checked for any issues or converted automatically. The responsibility for ensuring the images will be compatible with the IIIF server falls upon the user. Common issues include:

  • Images that contain the EXIF Orientation metadata, which may cause the image to be unexpectedly rotated or mirrored when requesting it from the IIIF server;

  • Images that are not in the RGB colorspace;

  • Images with file extensions that do not match their file format, for example a PNG image with a .jpg extension.

PDF documents

Each page of the PDF document will be converted into a JPEG image and re-uploaded into the S3 bucket. One folder element will be created for each document, holding one element per page using those JPEG images. Transcriptions will be added to each page if the PDF contains machine-readable text.

ZIP archives

Each ZIP archive is downloaded, extracted, then re-uploaded to the bucket. Archives may contain JPEG images, PNG images or PDF documents, which will then be imported following the rules set above.

Unlike in file imports, TAR archives and nested archives are not supported.

Folder elements will be created to reproduce the hierarchy found in the S3 bucket, with slashes as the directory separator. This means that an image uploaded as directory/subdirectory/image.jpg will be imported as an image.jpg element, within a subdirectory folder, within a directory folder.

Arkindex CLI

The Arkindex CLI provides several commands that allow experienced users to import PAGE XML, METS, ALTO files and more. To learn more, see the Arkindex CLI upload subcommands documentation.

Arkindex API

For particularly complex imports with specific needs not covered by the tools Arkindex provides, it is still possible for developers to import data using the Arkindex API.

Restoring an export

System administrators can import a database export of an existing Arkindex project into a new Arkindex instance using the arkindex load_export command.