Training an explicit language model

DAN supports lattice rescoring using a statistical language model. This documentation gives instructions to build a language model with kenlm. Note that you can also use SRILM.

Install kenlm

To build the language model, you first need to install and compile kenlm by following the instructions detailed in the README.

Build the language model

The teklia-dan dataset language-model automatically generate the files required to train a language model either at character, subword or word-level in my_dataset/language_model/.

Note that linebreaks are replaced by spaces in the language model.

Character-level

At character-level, we recommend building a 6-gram model. Use the following command:

bin/lmplz --order 6 \
    --text my_dataset/language_model/corpus_characters.txt \
    --arpa my_dataset/language_model/model_characters.arpa \
    --discount_fallback

Note that the --discount_fallback option can be removed if your corpus is very large.

The following message should be displayed if the language model was built successfully:

=== 1/5 Counting and sorting n-grams ===
Reading language_model/corpus.txt
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Unigram tokens 111629 types 109
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:1308 2:784852864 3:1471599104 4:2354558464 5:3433731328 6:4709116928
Statistics:
1 109 D1=0.586207 D2=0.534483 D3+=1.5931
2 1734 D1=0.538462 D2=1.09853 D3+=1.381
3 7957 D1=0.641102 D2=1.02894 D3+=1.37957
4 17189 D1=0.747894 D2=1.20483 D3+=1.41084
5 25640 D1=0.812458 D2=1.2726 D3+=1.57601
6 32153 D1=0.727411 D2=1.13511 D3+=1.42722
Memory estimate for binary LM:
type      kB
probing 1798 assuming -p 1.5
probing 2107 assuming -r models -p 1.5
trie     696 without quantization
trie     313 assuming -q 8 -b 8 quantization
trie     648 assuming -a 22 array pointer compression
trie     266 assuming -a 22 -q 8 -b 8 array pointer compression and quantization
=== 3/5 Calculating and sorting initial probabilities ===
Chain sizes: 1:1308 2:27744 3:159140 4:412536 5:717920 6:1028896
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
####################################################################################################
=== 4/5 Calculating and writing order-interpolated probabilities ===
Chain sizes: 1:1308 2:27744 3:159140 4:412536 5:717920 6:1028896
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
####################################################################################################
=== 5/5 Writing ARPA model ===
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Name:lmplz	VmPeak:12643224 kB	VmRSS:6344 kB	RSSMax:1969316 kB	user:0.196445	sys:0.514686	CPU:0.711161	real:0.682693

Subword-level

At subword-level, we recommend building a 6-gram model. Use the following command:

bin/lmplz --order 6 \
    --text my_dataset/language_model/corpus_subwords.txt \
    --arpa my_dataset/language_model/model_subwords.arpa \
    --discount_fallback

Note that the --discount_fallback option can be removed if your corpus is very large.

Word-level

At word-level, we recommend building a 3-gram model. Use the following command:

bin/lmplz --order 3 \
    --text my_dataset/language_model/corpus_words.txt \
    --arpa my_dataset/language_model/model_words.arpa \
    --discount_fallback

Note that the --discount_fallback option can be removed if your corpus is very large.

Predict with a language model

See the dedicated example.