Training an explicit language model
Build the language model
The teklia-dan dataset language-model
automatically generate the files required to train a language model either at character, subword or word-level in my_dataset/language_model/
.
Note that linebreaks are replaced by spaces in the language model.
Character-level
At character-level, we recommend building a 6-gram model. Use the following command:
bin/lmplz --order 6 \
--text my_dataset/language_model/corpus_characters.txt \
--arpa my_dataset/language_model/model_characters.arpa \
--discount_fallback
Note that the --discount_fallback
option can be removed if your corpus is very large.
The following message should be displayed if the language model was built successfully:
=== 1/5 Counting and sorting n-grams ===
Reading language_model/corpus.txt
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Unigram tokens 111629 types 109
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:1308 2:784852864 3:1471599104 4:2354558464 5:3433731328 6:4709116928
Statistics:
1 109 D1=0.586207 D2=0.534483 D3+=1.5931
2 1734 D1=0.538462 D2=1.09853 D3+=1.381
3 7957 D1=0.641102 D2=1.02894 D3+=1.37957
4 17189 D1=0.747894 D2=1.20483 D3+=1.41084
5 25640 D1=0.812458 D2=1.2726 D3+=1.57601
6 32153 D1=0.727411 D2=1.13511 D3+=1.42722
Memory estimate for binary LM:
type kB
probing 1798 assuming -p 1.5
probing 2107 assuming -r models -p 1.5
trie 696 without quantization
trie 313 assuming -q 8 -b 8 quantization
trie 648 assuming -a 22 array pointer compression
trie 266 assuming -a 22 -q 8 -b 8 array pointer compression and quantization
=== 3/5 Calculating and sorting initial probabilities ===
Chain sizes: 1:1308 2:27744 3:159140 4:412536 5:717920 6:1028896
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
####################################################################################################
=== 4/5 Calculating and writing order-interpolated probabilities ===
Chain sizes: 1:1308 2:27744 3:159140 4:412536 5:717920 6:1028896
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
####################################################################################################
=== 5/5 Writing ARPA model ===
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Name:lmplz VmPeak:12643224 kB VmRSS:6344 kB RSSMax:1969316 kB user:0.196445 sys:0.514686 CPU:0.711161 real:0.682693
Subword-level
At subword-level, we recommend building a 6-gram model. Use the following command:
bin/lmplz --order 6 \
--text my_dataset/language_model/corpus_subwords.txt \
--arpa my_dataset/language_model/model_subwords.arpa \
--discount_fallback
Note that the --discount_fallback
option can be removed if your corpus is very large.
Word-level
At word-level, we recommend building a 3-gram model. Use the following command:
bin/lmplz --order 3 \
--text my_dataset/language_model/corpus_words.txt \
--arpa my_dataset/language_model/model_words.arpa \
--discount_fallback
Note that the --discount_fallback
option can be removed if your corpus is very large.
Predict with a language model
See the dedicated example.