Inference

Description

Use the teklia-qwen predict command to apply a QWEN model on a set of images.

Parameter Description Type Default

Parameter	Description	Type	Default
`--model-name`	Path to the QWEN model to use for inference. Can be eiher a full model or an adapter. Should be either a local path or a name from HuggingFace.	`str`
`--images-dirs`	Path(s) to the folder(s) where the images to predict are stored.	`List[pathlib.Path]`
`--output-json`	Path to save prediction results in JSON format.	`pathlib.Path`	`Path("results.json")`
`--query-path`	Path to the file containing the instruction prompt.	`pathlib.Path`	`Path("query.txt")`
`--system-prompt-path`	Path to the file containing the custom system prompt. If not set, the default system prompt will be used.	`Optional[pathlib.Path]`	`None`
`--max-new-tokens`	The maximum numbers of tokens to generate, ignoring the number of tokens in the prompt.	`int`	`2048`
`--max-model-len`	Maximum length of a sequence (including prompt and generated text). This might be defined in the model’s configuration file. -1 (default) means autofit based on GPU capacities.	`int`	`-1`
`--temperature`	The value used to modulate the next token probabilities.	`float`	`1.0`
`--delimiter`	The delimiter to parse the model output when in `"csv"` mode.	`str`	`","`
`--post-process`	The post-processsing method to apply to the predictions. Should be either `"default"`, `"json"`, `"markdown"`, `"csv"` or `"xml"`.	`Mode`	`"default"`
`--labels`	Path to the JSONL files with the labels. This will be used to generate features to train a confidence-score model.	`pathlib.Path`
`--confidence-model`	Path to the external confidence model.	`pathlib.Path`
`--stop-strings`	A list of strings that will trigger the end of the generation.	`List[str]`	`None`
`--no-quantization`	Disable 4-bit quantization. Enabled by default.	`bool`	True
`--thinking`	Enable thinking mode. Disabled by default.	`bool`	False

--model-name

Path to the QWEN model to use for inference. Can be eiher a full model or an adapter. Should be either a local path or a name from HuggingFace.

str

--images-dirs

Path(s) to the folder(s) where the images to predict are stored.

List[pathlib.Path]

--output-json

Path to save prediction results in JSON format.

pathlib.Path

Path("results.json")

--query-path

Path to the file containing the instruction prompt.

pathlib.Path

Path("query.txt")

--system-prompt-path

Path to the file containing the custom system prompt. If not set, the default system prompt will be used.

Optional[pathlib.Path]

None

--max-new-tokens

The maximum numbers of tokens to generate, ignoring the number of tokens in the prompt.

int

2048

--max-model-len

Maximum length of a sequence (including prompt and generated text). This might be defined in the model’s configuration file. -1 (default) means autofit based on GPU capacities.

int

-1

--temperature

The value used to modulate the next token probabilities.

float

1.0

--delimiter

The delimiter to parse the model output when in "csv" mode.

str

","

--post-process

The post-processsing method to apply to the predictions. Should be either "default", "json", "markdown", "csv" or "xml".

Mode

"default"

--labels

Path to the JSONL files with the labels. This will be used to generate features to train a confidence-score model.

pathlib.Path

--confidence-model

Path to the external confidence model.

pathlib.Path

--stop-strings

A list of strings that will trigger the end of the generation.

List[str]

None

--no-quantization

Disable 4-bit quantization. Enabled by default.

bool

True

--thinking

Enable thinking mode. Disabled by default.

bool

False

Requirements

Images should be resized so that their largest size does not exceed 2000 pixels.
Inference can run on a single GPU. Here are some GPUs tested and supported by the 7B model:
- 1 x NVIDIA GeForce RTX 3090 Ti GPU
- 1 x NVIDIA A100 GPU
- 1 x NVIDIA V100 GPU

Nested entities support

Nested entities are only partially supported at inference. This specific parsing is only available in XML mode. You can use nesting to an infinite depth.

Below are two prediction examples with nested entities:

This is fully supported:

<root>
  <Person>
    <Firstname>John</Firstname>
    <Lastname>Doe</Lastname>
    <Nickname>dit Anonymous</Nickname>
  </Person>
</root>

This is partially supported:
```
<root>
  <Person>
    <Firstname>John</Firstname>
    <Lastname>Doe</Lastname>
    dit 
    <Nickname>Anonymous</Nickname>
  </Person>
</root>
```
⚠️ All entities will be properly parsed except dit, which is not in a nested entity of the same level as John, Doe and Anonymous, it will simply be removed from the transcription. No warning will be raised.

Examples

Predict using a full model or an adapter

You can run a basic inference using this model from HuggingFace.

Content of my_query.txt:

Extract the firstnames and surnames from this document. Format your answer in a Markdown table.

Command to run:

Both full models (--model-name Qwen/Qwen3-VL-8b-Instruct) and adapters (--model-name my_adapter/) are supported.

teklia-qwen predict --model-name Qwen/Qwen3-VL-8B-Instruct \
  --images-dirs images/ \
  --output-json results.json \
  --query-path my_query.txt

Output:

{
  "006bb5fa-84eb-4cb9-ae43-a12694e8d99b": {
    "raw_output": "Answer: | Firstname | Surname |\n|------|-------|\n| alain | dalmatien |\n| marie | montagne |",
    "confidence": {
      "raw": 0.87,
      "content": null,
      "structure": null
    },
    "estimated_confidence": null,
    "parsing_failed": false,
    "parsed_output": null,
    "entities": [],
  }
...
}

Predict with a custom system prompt

You can run a more advanced inference using a custom system prompt. This can be useful when predicting with a fine-tuned model, as the system prompt used during training is overwritten during the export.

Content of my_system_prompt.txt

You need to extract information from these French documents. Each image contains a table, and each row of the table contains information about an individual.

Here is the information you need to extract for each person:

* Firstname (should be capitalized)
* Surname (should be capitalized)

If the information is missing, put '`N/A`'.

Command to run:

teklia-qwen predict --model-name my_local_models/qwen_finetuned/ \
    --images-dirs images/ \
    --output-json results.json \
    --query-path my_query.txt \
    --system-prompt-path my_system_prompt.txt

Output:

{
  "006bb5fa-84eb-4cb9-ae43-a12694e8d99b": {
    "raw_output": "Answer: | firstname | surname |\n|------|-------|\n| Alain | Damasio |\n| Marion | Montaigne |",
    "confidence": {
      "raw": 0.97,
      "content": null,
      "structure": null
    },
    "estimated_confidence": null,
    "parsing_failed": false,
    "parsed_output": null,
    "entities": [],
  }
...
}

Predict with/without a quantized model

By default, the model will be loaded in 4-bit by unsloth. To disable quantization, use --no-load-in-4bit.

Command to run:

Quantization halves VRAM usage: Qwen3-VL-8B-Instruct requires ~9.4 GB with quantization, and ~19.5 GB without quantization.

teklia-qwen predict --model-name Qwen/Qwen3-VL-8B-Instruct \
  --images-dirs images/ \
  --output-json results.json \
  --query-path my_query.txt \
  --no-load-in-4bit

Predict with structured output

Command to run:

teklia-qwen predict --model-name my_local_models/qwen_finetuned/ \
    --images-dirs images/ \
    --output-json results.json \
    --query-path my_query.txt \
    --system-prompt-path my_system_prompt.txt
    --schema my_schema.yaml

Schema format

A schema is a YAML file with the following top-level keys:

Key Type Description

Key	Type	Description
`name`	`str`	Name of the schema. Used as the Pydantic model name.
`as_list`	`bool`	If `true`, the output will be a list of objects matching the schema.
`fields`	`dict`	A mapping of field names to their definitions. Set to `{}` to accept any valid JSON.

name

str

Name of the schema. Used as the Pydantic model name.

as_list

bool

If true, the output will be a list of objects matching the schema.

fields

dict

A mapping of field names to their definitions. Set to {} to accept any valid JSON.

Field definition

Each field under fields supports the following keys:

Key Type Required Description

Key	Type	Required	Description
`type`	`str`	Yes	Type of the field. One of `str`, `int`, `float`, `bool`, `enum`.
`required`	`bool`	No (default: `false`)	If `true`, the field must be present in the output.
`description`	`str`	No	Description of the field, used as a hint for the model.
`pattern`	`str`	No	Only for `string` type. Regex pattern the value must match.
`values`	`list`	Yes	Only for `enum` type. List of accepted values for the field.

type

str

Yes

Type of the field. One of str, int, float, bool, enum.

required

bool

No (default: false)

If true, the field must be present in the output.

description

str

Description of the field, used as a hint for the model.

pattern

str

Only for string type. Regex pattern the value must match.

values

list

Yes

Only for enum type. List of accepted values for the field.

Examples

To ensure a valid JSON output without any constraint, use the following schema:
```
name: Generic
as_list: false
fields: {}
```
To return a list of objects, set as_list: true:
```
name: Generic
as_list: true
fields: {}
```

To define a custom schema, use this template as an example. Set as_list: true to return a list of Person:

name: Person
as_list: false
fields:
  name:
    type: str
    required: true
  age:
    type: int
    required: false
  occupation:
    type: enum
    values: ["boulanger", "instituteur", "agent de mairie"]
  has_children:
    type: bool
    description: "Cette personne a-t-elle des enfants ?"
  salary:
    type: float
    required: false
    description: "Salaire mensuel net en euros"
  date_birth:
    type: str
    pattern: ^\d{2}/\d{2}/\d{4}$
    description: "Date de naissance en format DD/MM/YYYY"

Predict with constraints to stop the generation

Since QWEN can occasionally hallucinate, there are two ways to control the generation:

Limit the number of generated tokens. Use the --max-new-tokens option to ensure Qwen only generates up to n new tokens.
Stop when specific strings are predicted. Use the --stop-strings option to stop generation once any of the specified strings appears in the output.

You can combine both options: the model will stop as soon as either condition is met.

Command to run:

teklia-qwen predict --model-name my_local_models/qwen_finetuned/ \
    --images-dirs images/ \
    --output-json results.json \
    --query-path my_query.txt \
    --system-prompt-path my_system_prompt.txt
    --max-new-tokens 200 \
    --stop-strings "\n" "</root>"

In this example:

The model stops after generating 200 tokens
Or stops earlier as soon as it predicts a newline (\n) or the closing XML tag (</root>).

Predict with post-processing

`Markdown` mode

You can also use the --post-process markdown option to parse the predicted Markdown table into a dictionary.

Command to run

teklia-qwen predict --model-name my_local_models/qwen_finetuned/ \
  --images-dirs images/ \
  --output-json results.json \
  --query-path my_query.txt \
  --system-prompt-path my_system_prompt.txt \
  --post-process markdown

Output

{
  "006bb5fa-84eb-4cb9-ae43-a12694e8d99b": {
    "raw_output": "Answer: | firstname | surname |\n|------|-------|\n| Alain | Damasio |\n| Marion | Montaigne |",
    "confidence": {
      "raw": 0.97,
      "content": 0.96,
      "structure": 1.0
    },
    "estimated_confidence": null,
    "parsing_failed": false,
    "parsed_output": "Alain Damasio\nMarion Montaigne",
    "entities": [
      {
        "type": "firstname",
        "offset": 0,
        "length": 5,
        "confidence": 1.0
      },
      {
        "type": "surname",
        "offset": 6,
        "length": 7,
        "confidence": 1.0
      },
      {
        "type": "firstname",
        "offset": 14,
        "length": 6,
        "confidence": 1.0
      },
      {
        "type": "surname",
        "offset": 21,
        "length": 9,
        "confidence": 1.0
      }
    ]
  }
...
}

`CSV` mode

You can also use the --post-process csv option to parse the predicted CSV string into a dictionary.

Command to run

teklia-qwen predict --model-name my_local_models/qwen_finetuned/ \
  --images-dirs images/ \
  --output-json results.json \
  --query-path my_query.txt \
  --system-prompt-path my_system_prompt.txt \
  --post-process csv \
  --delimiter ;

Output

{
  "006bb5fa-84eb-4cb9-ae43-a12694e8d99b": {
    "raw_output": "firstname ; surname\n Alain ; Damasio \nMarion;Montaigne",
    "confidence": {
      "raw": 0.97,
      "content": 0.96,
      "structure": 1.0
    },
    "estimated_confidence": null,
    "parsing_failed": false,
    "parsed_output": "Alain Damasio\nMarion Montaigne",
    "entities": [
      {
          "type": "firstname",
          "offset": 0,
          "length": 5,
          "confidence": 1.0
      },
      {
          "type": "surname",
          "offset": 6,
          "length": 7,
          "confidence": 1.0
      },
      {
          "type": "firstname",
          "offset": 14,
          "length": 6,
          "confidence": 1.0
      },
      {
          "type": "surname",
          "offset": 21,
          "length": 9,
          "confidence": 1.0
      }
    ]
  }
...
}

`XML` mode

You can also use the --post-process xml option to parse the predicted XML content into a dictionary.

Command to run

teklia-qwen predict --model-name my_local_models/qwen_finetuned/ \
  --images-dirs images/ \
  --output-json results.json \
  --query-path my_query.txt \
  --system-prompt-path my_system_prompt.txt \
  --post-process xml

Output

{
  "006bb5fa-84eb-4cb9-ae43-a12694e8d99b": {
    "raw_output": "<root><firstname>Alain</firstname> <surname>Damasio</surname></root>",
    "confidence": {
      "raw": 0.97,
      "content": 0.96,
      "structure": 1.0
    },
    "estimated_confidence": null,
    "parsing_failed": false,
    "parsed_output": "Alain Damasio",
    "entities": [
      {
        "type": "firstname",
        "offset": 0,
        "length": 5,
        "confidence": 1.0
      },
      {
        "type": "surname",
        "offset": 6,
        "length": 7,
        "confidence": 1.0
      },
    ]
  }
...
}

`JSON` mode

You can also use the --post-process json option to parse the predicted JSON object into a dictionary.

Command to run

teklia-qwen predict --model-name my_local_models/qwen_finetuned/ \
  --images-dirs images/ \
  --output-json results.json \
  --query-path my_query.txt \
  --system-prompt-path my_system_prompt.txt \
  --post-process json

Output

{
  "006bb5fa-84eb-4cb9-ae43-a12694e8d99b": {
    "raw_output": "[{\"firstname\": \"Alain\", \"surname\": \"Damasio\"}, {\"firstname\": \"Marion\", \"surname\": \"Montaigne\"}]",
    "confidence": {
      "raw": 0.97,
      "content": 0.96,
      "structure": 1.0
    },
    "estimated_confidence": null,
    "parsing_failed": false,
    "parsed_output": "Alain Damasio\nMarion Montaigne",
    "entities": [
      {
        "type": "firstname",
        "offset": 0,
        "length": 5,
        "confidence": 1.0
      },
      {
        "type": "surname",
        "offset": 6,
        "length": 7,
        "confidence": 1.0
      },
      {
          "type": "firstname",
          "offset": 14,
          "length": 6,
          "confidence": 1.0
      },
      {
          "type": "surname",
          "offset": 21,
          "length": 9,
          "confidence": 1.0
      }
    ]
  }
...
}

Predict with temperature scaling

You can also use the --temperature option to modulate the model’s temperature. For OCR/IE, it is recommended to set a low temperature (between 0 and 0.1).

Command to run

teklia-qwen predict --model-name my_local_models/qwen_finetuned/ \
  --images-dirs images/ \
  --output-json results.json \
  --query-path my_query.txt \
  --system-prompt-path my_system_prompt.txt \
  --post-process markdown \
  --temperature 0

Output

{
  "006bb5fa-84eb-4cb9-ae43-a12694e8d99b": {
    "raw_output": "Answer: | firstname | surname |\n|------|-------|\n| Alain | Damasio |\n| Marion | Montaigne |",
    "confidence": {
      "raw": 0.76,
      "content": 0.71,
      "structure": 0.81
    },
    "estimated_confidence": null,
    "parsing_failed": false,
    "parsed_output": "Alain Damasio\nMarion Montaigne"
    "entities": [
      {
        "type": "firstname",
        "offset": 0,
        "length": 5,
        "confidence": 1.0
      },
      {
        "type": "surname",
        "offset": 6,
        "length": 7,
        "confidence": 1.0
      },
      {
        "type": "firstname",
        "offset": 14,
        "length": 6,
        "confidence": 1.0
      },
      {
        "type": "surname",
        "offset": 21,
        "length": 9,
        "confidence": 1.0
      }
    ]
  }
...
}

Predict and generate features

You can use the --labels option to extract features, both about each image and Qwen’s prediction. This argument should point to a JSONL file, holding the ground truth annotations of the images to process.

Command to run

teklia-qwen predict --model-name Qwen/Qwen3-VL-8B-Instruct \
  --images-dirs images/ \
  --output-json results.json \
  --query-path my_query.txt \
  --system-prompt-path my_system_prompt.txt \
  --max-new-tokens 100 \
  --labels labels.jsonl

Output

One CSV file, one column per feature.

image_h,image_w,aspect_ratio,image_mean_pixel,image_std_pixel,output_mean_softmax,output_mean_top2,output_std_softmax,output_mean_top2,output_std_top2,output_length,target
500.0,750.0,0.6666666666666666,129.5420151111111,36.71494801935272,0.6521164721856683,0.507029977881603,0.269365231353945,0.35904884921690255,248.0,0.8

The CSV will be saved in the same directory as the JSON output file, under the same name.

Predict with a confidence-estimation model

You can use the --confidence-model option to load a trained confidence-estimation model. This argument should point to a folder, holding the files of the model to use.

Command to run

teklia-qwen predict --model-name Qwen/Qwen3-VL-8B-Instruct \
  --images-dirs images/ \
  --output-json results.json \
  --query-path my_query.txt \
  --system-prompt-path my_system_prompt.txt \
  --max-new-tokens 100 \
  --confidence-model ./my_confidence_model

Inference

Description

Requirements

Nested entities support

Examples

Predict using a full model or an adapter

Predict with a custom system prompt

Predict with/without a quantized model

Predict with structured output

Schema format

Field definition

Examples

Predict with constraints to stop the generation

Predict with post-processing

Markdown mode

CSV mode

XML mode

JSON mode

Predict with temperature scaling

Predict and generate features

Predict with a confidence-estimation model

`Markdown` mode

`CSV` mode

`XML` mode

`JSON` mode