Explicit language modeling with n-grams
Build the language model
Generate resources to train the language model
To train a language model, you need to generate a corpus containing the training text tokenized at character, subword or word level.
Characters
Here is a sample of text tokenized at character-level (corpus_characters.txt
).
u d e <space> i <space> r e s t a u r a n t e r ,
v æ r e t <space> u h y r e <space> m e g e t <space> s a m m e n , <space> o f t e <space> t i l <space> m a a l t i d e r <space> o g <space> t i l <space> t h e <space> h o s <space> O s s b a h r ,
v i <space> s i d d e r <space> v e d <space> k a m i n e n <space> d e r <space> o g <space> s n a k k e r , <space> h v i l k e t <space> e r <space> m e g e t <space> m o r s o m t . <space> N u
k o m m e r <space> d e r <space> m a n g e <space> r e i s e n d e <space> v e n n e r <space> e l l e r <space> s l æ g t <space> e l l e r <space> p r i n s e s s e r , <space> s o m
O s s b a h r <space> m a a <space> v æ r e <space> s a m m e n <space> m e d <space> H e d b e r g <space> o f t e <space> o g s a a . <space> M e n <space> v i <space> k a n <space> l e v e
Subwords
Here is a sample of text tokenized at subword-level (corpus_subwords.txt
).
ud e <space> i <space> r e st au r ant er ,
været <space> u h y r e <space> meget <space> sammen , <space> ofte <space> til <space> ma altid er <space> og <space> til <space> th e <space> hos <space> O s s ba h r ,
vi <space> sidde r <space> ved <space> ka min en <space> der <space> og <space> snakke r , <space> hvilket <space> er <space> meget <space> morsomt . <space> Nu
kommer <space> der <space> mange <space> r e i sende <space> venner <space> eller <space> s læg t <space> eller <space> pr in s e s ser , <space> som
O s s ba h r <space> maa <space> være <space> sammen <space> med <space> H e d berg <space> ofte <space> ogsaa . <space> Men <space> vi <space> kan <space> lev e
Words
Here is a sample of text tokenized at word-level (corpus_words.txt
).
ude <space> i <space> restauranter <space> ,
været <space> uhyre <space> meget <space> sammen <space> , <space> ofte <space> til <space> maaltider <space> og <space> til <space> the <space> hos <space> Ossbahr <space> ,
vi <space> sidder <space> ved <space> kaminen <space> der <space> og <space> snakker <space> , <space> hvilket <space> er <space> meget <space> morsomt <space> . <space> Nu
kommer <space> der <space> mange <space> reisende <space> venner <space> eller <space> slægt <space> eller <space> prinsesser <space> , <space> som
Ossbahr <space> maa <space> være <space> sammen <space> med <space> Hedberg <space> ofte <space> ogsaa <space> . <space> Men <space> vi <space> kan <space> leve
Train the language model
Once your corpus is created, you can estimate the n-gram model.
Characters
At character-level, we recommend building a 6-gram model. Use the following command:
bin/lmplz --order 6 \
--text my_dataset/language_model/corpus_characters.txt \
--arpa my_dataset/language_model/model_characters.arpa \
--discount_fallback
The --discount_fallback option can be removed if your corpus is very large.
|
The following message should be displayed if the language model was built successfully:
=== 1/5 Counting and sorting n-grams ===
Reading language_model/corpus.txt
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Unigram tokens 111629 types 109
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:1308 2:784852864 3:1471599104 4:2354558464 5:3433731328 6:4709116928
Statistics:
1 109 D1=0.586207 D2=0.534483 D3+=1.5931
2 1734 D1=0.538462 D2=1.09853 D3+=1.381
3 7957 D1=0.641102 D2=1.02894 D3+=1.37957
4 17189 D1=0.747894 D2=1.20483 D3+=1.41084
5 25640 D1=0.812458 D2=1.2726 D3+=1.57601
6 32153 D1=0.727411 D2=1.13511 D3+=1.42722
Memory estimate for binary LM:
type kB
probing 1798 assuming -p 1.5
probing 2107 assuming -r models -p 1.5
trie 696 without quantization
trie 313 assuming -q 8 -b 8 quantization
trie 648 assuming -a 22 array pointer compression
trie 266 assuming -a 22 -q 8 -b 8 array pointer compression and quantization
=== 3/5 Calculating and sorting initial probabilities ===
Chain sizes: 1:1308 2:27744 3:159140 4:412536 5:717920 6:1028896
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
####################################################################################################
=== 4/5 Calculating and writing order-interpolated probabilities ===
Chain sizes: 1:1308 2:27744 3:159140 4:412536 5:717920 6:1028896
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
####################################################################################################
=== 5/5 Writing ARPA model ===
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Name:lmplz VmPeak:12643224 kB VmRSS:6344 kB RSSMax:1969316 kB user:0.196445 sys:0.514686 CPU:0.711161 real:0.682693
Subwords
At subword-level, we recommend building a 6-gram model. Use the following command:
bin/lmplz --order 6 \
--text my_dataset/language_model/corpus_subwords.txt \
--arpa my_dataset/language_model/model_subwords.arpa \
--discount_fallback
The --discount_fallback option can be removed if your corpus is very large.
|
Words
At word-level, we recommend building a 3-gram model. Use the following command:
bin/lmplz --order 3 \
--text my_dataset/language_model/corpus_words.txt \
--arpa my_dataset/language_model/model_words.arpa \
--discount_fallback
The --discount_fallback option can be removed if your corpus is very large.
|
Predict with a language model
Once the language model is trained, you need to generate a list of tokens and a lexicon.
List of tokens
The list of tokens tokens.txt
lists all the tokens that can be predicted by PyLaia.
It should be similar to syms.txt
, but without any index, and can be generated with this command:
cut -d' ' -f 1 syms.txt > tokens.txt
This file does not depend on the tokenization level. |
<ctc>
.
,
a
b
c
...
<space>
Lexicon
The lexicon lists all the words in the vocabulary and its decomposition in tokens.
Characters
At character-level, words are simply characters, so the lexicon_characters.txt
file should map characters to characters:
<ctc> <ctc>
. .
, ,
a a
b b
c c
...
<space> <space>
Predict with PyLaia
See the dedicated example.