I’ve recently wrote a program for solving sentence-based hangman game automatically. Here’s a postexplaining my approach using the British National Corpus, srilmlanguage model and hidden Markov model.
The Hangman Game
Last week, my friend Zero Cho was asked to write a program that solves thegame of hangman automatically. The rules aresimple: the questions are short sentences or phrases with their charactersconcealed, i.e. replaced with an underscore, while spaces and punctuations are revealed. The solvingprocess is an interactive process in which the solver program will guess a letter, and the questionsystem will reveal its locations. If you guessed 3 letters that are not in the sentence,you fail the question.
For example, the initial hint for the answer it's a sunny day
is __'_ _ _____ ___
,if a solver program guesses y
, the system will respond with __'_ _ ____y __y
.
My friend Zero used a statistical approach that first guesses vows with highest frequencies until the first correctguess. With one vow revealed, he then repeatedly choose the word with the most portion revealed tocontinue guessing. His approach to guessing a half-revealed word is also statistical, namely to find themost frequent English word that matches the pattern. His method is explainedin detail in his recent blog post.
The shortcoming of this approach is that it lacks consideration for correlations between words in thesentences. For example, even the word sorry
has a higher frequency than sunny
, the latter is morelikely to appear before the word day
. In my attempt, I trained aLanguage Model to capture such correlations, and usehidden Markov model to guess every word in the sentencesimultaneously.
Language Model
I use the state-of-the-art srilm package to train mylanguage model on a portion of the British National Corpus (BNC). For those who are familiar with srilm,the parameters I’ve used include order-5, Kneser-Ney discount, and interpolation. I use SWIG tointegrate srilm (in C) with my main program written in Python.
A language model is a statistical model that defines the probability of a sequence of words usingprobability distribution trained on a free text corpus.
For example, given a sequence of words $w_1, w_2, w_3, …, w_n$, the unigram (one-word) language modelcalculates the probability of such sequence as$$P(w{1}w{2}w{3}…w{n}) := P(w_1) * P(w_2) * P(w_3) * … * P(w_n)$$this is sometimes called an order-1 language model. The word probability can be calculated by countingits appearances in the corpus:$$ P(w) := \frac { count(w) } { |\ corpus\ | } $$An order-2 language model, or bigram language model, defines the sequence probability using conditionalprobability:$$P(w{1}w{2}w{3}…w{n}) := P(w_1) * P(w_2|w_1) * P(w_3|w_2) * … * P(w_n|w{n-1})$$intuitively, conditional probability can be calculated as:$$ P(w_a|w_b) := \frac { count(w{b}w_{a}) } { count(w_b) } $$Generally, higher order produces better results but also suffers more severely on the out-of-vocabulary problemand therefore sometimes needs to fall back to lower order conditional probabilities. Many theories have beenproposed to improve the performance of language models, including absolute discount estimation, good turing,Kneser-Ney smoothing. Formal definition of language model can be found onWikipedia.
Hidden Markov Model
Hidden Markov model is a statistical model that solves a sequence of hidden states given a sequence ofobservations, state transition probability distribution, and emission probability distribution. Forexample, in speech recognition, the recorded audio speech is the observation, the actual text are thehidden states. Another example is the optical character recognition (OCR) where the scanned images arethe observations. Emission probability distribution indicates the probability from state to observation,i.e., from text to audio or image. Language model is commonly used for state transition probabilitydistribution.
In the case solving the Hangman game, the hidden states are the English words in the answer sentences, e.g., it's
, a
,sunny
, day
, and the observations are some half-revealed sequences, e.g., __'s
, _
, s_nny
, __y
.The state transition probability is calculated by the trained language model, e.g., $P(day | sunny)$ ina order-1 language model. I define the emission probability as the normalize probability of all theEnglish words that matches the half-revealed pattern, e.g., normalized $P(day), P(pay), P(bay), P(hey),P(may), …, etc$. for state __y
, and zero otherwise. With the transition and emission probabilities defined, wecan use the Viterbi dynamic programming algorithm to calculate the optimal state sequence.
Formal definition and more information on HMM can be found onWikipedia pages here andhere, or in a slide I made when I was TA for a NLP coursehere (in Chinese).
Algorithm
It is obvious that to run the hidden Markov model on a fully concealed sequence will yield bad results.Therefore, at stage one my solver also start by guessing high frequency letters. Instead ofguessing vows only, I also use consonants, hoping that revealing some consonants will help the secondstage. The sorted list begins with e, a, i, s, r, n, t, o
, …, etc. Thisprocess stops until it had made one wrong guesses or have revealed more than four letters.
In the second stage, the trained hidden Markov model will tag the half-revealed words with English wordsequence that yield the highest language model probability and word frequencies. The solver will thenguess the letter that appears in most words. This process repeats until all characters arerevealed, and the number of failed guesses are recorded.
Results
Here are three examples and results. The interactive guessing progress and some internal states areshown in the three tables below.
“describe what is needed”
stage |errors |hint | HMM prediction| guess—– |——- |—- |——- | —–1 |0 |________ ____ __ ______
| N/A | e1 |0 |_e_____e ____ __ _ee_e_
| N/A | a1 |0 |_e_____e __a_ __ _ee_e_
| N/A | i1 |0 |_e___i_e __a_ i_ _ee_e_
| N/A | s1 |0 |_es__i_e __a_ is _ee_e_
| N/A | r2 |0 |_es_ri_e __a_ is _ee_e_
| describe what is needed | d2 |0 |des_ri_e __a_ is _eeded
| describe what is needed | n2 |0 |des_ri_e __a_ is needed
| describe what is needed | t2 |0 |des_ri_e __at is needed
| describe what is needed | t2 |0 |descri_e __at is needed
| describe what is needed | hbw (abrv.)2 |0 |describe what is needed
| (solved) |
“an honor roll of online options”
stage | errors | hint | HMM prediction | guesses |
---|---|---|---|---|
1 | 0 | __ _____ ____ __ ______ _______ | N/A | eaisr |
2 | 0 | a_ ____r r___ __ ___i_e ___i__s | at honor roll of police options | o |
2 | 0 | a_ _o_or ro__ o_ o__i_e o__io_s | at honor roll of office options | n |
2 | 0 | an _onor ro__ o_ on_ine o__ions | an honor roll of online options | l |
2 | 0 | an _onor roll o_ online o__ions | an honor roll of online options | t |
2 | 0 | an _onor roll o_ online o_tions | an honor roll of online options | hpf |
2 | 0 | an honor roll of online options | (solved) |
This one shows HMM produces different results as more characters are revealed.
“members of the supreme court”
stage | errors | hint | HMM prediction | guesses |
---|---|---|---|---|
1 | 0 | _______ __ ___ _______ _____ | N/A | e |
1 | 0 | _e__e__ __ __e ____e_e _____ | N/A | a |
2 | 1 | _e__e__ __ __e ____e_e _____ | members of the supreme court | rstoumchpbf |
2 | 1 | members of the supreme court |
This last one really shows the power of language model. With only e
revealed, the HMM process was able to guess the correct result.
Improvements
For stage one, instead of using just the highest frequency letters, conditional probability can also bea factor for choosing the next letter. For example, if s
is revealed as the second to last characterin some words, t
and e
may likely to be the last letter of such words.
For the two stages, we can decide to advance from stage 1 to 2 base on different factors instead ofheuristically. These may include the confident of the HMM results, the percentage of letters revealed,the before mentioned conditional probabilities and more. Global optimization approaches can be used tofind good parameters to determine the timing to start stage two. Alternatively, we can treat the twostages as different strategies and use genetic algorithm to apply them alternately.