Dataset formatting
To train PyLaia, you need line images and their corresponding transcriptions. The dataset should be divided into three sets: training, validation and test sets.
The dataset should be formatted as follows:
# Images
├── images
├── train/
├── val/
└── test/
# Image ids (used for prediction)
├── train_ids.txt
├── val_ids.txt
├── test_ids.txt
# Tokenized transcriptions (used for training)
├── train.txt
├── val.txt
├── test.txt
# Transcriptions (used for evaluation)
├── train_text.txt
├── val_text.txt
├── test_text.txt
# Symbol list
└── syms.txt
Images
By default, images should be resized to a fixed height (recommended value: 128 pixels). This can be done using ImageMagick’s mogrify
function:
mogrify -resize x128 images/*.jpg
Note that PyLaia can also support variable size images by setting --fixed_input_height 0
during model initialization.
Ground truth
Tokenized transcriptions
Two files {train|val}.txt
are required to train the model. They should map image names and tokenized transcriptions for the training and validation sets.
Example:
train/im01 f o r <space> d e t <space> t i l f æ l d e <space> d e t <space> s k u l d e <space> l y k k e s <space> D i g
train/im02 a t <space> o p d r i v e <space> d e t <space> o m s k r e v n e <space> e x p l : <space> a f
train/im03 « F r u <space> I n g e r » , <space> a t <space> s e n d e <space> m i g <space> s a m m e
Transcriptions
Three files {train|val|test}_text.txt
are required to evaluate your models. They should map image names and non-tokenized transcriptions.
Example:
train/im01 for det tilfælde det skulde lykkes Dig
train/im02 at opdrive det omskrevne expl: af
train/im03 «Fru Inger», at sende mig samme
Image list
Three files {train|val|test}_ids.txt
are required to run predictions. They should list image names without transcriptions and can be obtained with:
cut -d' ' -f1 train_text.txt > train_ids.txt
Example:
train/im01
train/im02
train/im03
Symbol list
Finally, a file named syms.txt
is required, mapping tokens from the training set and their index, starting with the <ctc>
token.
Example:
<ctc> 0
! 1
" 2
& 3
' 4
( 5
) 6
+ 7
, 8
- 9
. 10
/ 11
0 12
1 13
2 14
3 15
4 16
5 17
6 18
7 19
8 20
9 21
: 22
; 23
< 24
= 25
> 26
? 27
A 28
B 29
C 30
D 31
E 32
F 33
G 34
H 35
I 36
J 37
K 38
L 39
M 40
N 41
O 42
P 43
Q 44
R 45
S 46
T 47
U 48
V 49
W 50
X 51
Y 52
Z 53
[ 54
] 55
a 56
b 57
c 58
d 59
e 60
f 61
g 62
h 63
i 64
j 65
k 66
l 67
m 68
n 69
o 70
p 71
q 72
r 73
s 74
t 75
u 76
v 77
w 78
x 79
y 80
z 81
« 82
¬ 83
» 84
¼ 85
½ 86
Å 87
Æ 88
Ø 89
à 90
á 91
â 92
ä 93
å 94
æ 95
ç 96
è 97
é 98
ê 99
ö 100
ø 101
ù 102
û 103
ü 104
– 105
— 106
’ 107
„ 108
… 109
<unk> 110
<space> 111