Bright Wire

Building a Markov Model from source text and using it to generate new text.

Motivation

Markov Models can be used to create n-gram based language models - that is they can be used to calculate the probability of subsequent tokens after each n-gram.

Generated text from n-gram based language models, although not always syntactically correct, does roughly approximate the training corpus. Some of the generated text is in fact quite interesting!

Installing from Nuget

Create a new .NET 4.6 console application and include Bright Wire.

Install-Package BrightWire.Net4

Getting the Data

To train a Markov Model we need some source text to train on. Here we load The Beautiful and the Damned by F. Scott Fitzgerald.

Bright Wire includes a simple tokeniser that splits text into words, numbers and punctuation tokens, and forms sentences from those tokens.

// tokenise the novel "The Beautiful and the Damned" by F. Scott Fitzgerald
List<IReadOnlyList<string>> sentences;
using (var client = new WebClient()) {
    var data = client.DownloadString("http://www.gutenberg.org/cache/epub/9830/pg9830.txt");
    var pos = data.IndexOf("CHAPTER I");
    sentences = SimpleTokeniser.FindSentences(SimpleTokeniser.Tokenise(data.Substring(pos))).ToList();
}

Training the Model

When training a Markov Model we need to choose the window size - the count of tokens that are used as the basis for each probability table of subsequent tokens. For this example we are using n-grams of size 3.

// create a markov trainer that uses a window of size 3
var trainer = Provider.CreateMarkovTrainer3<string>();
foreach (var sentence in sentences)
    trainer.Add(sentence);
var model = trainer.Build().AsDictionary;

Generating Text

Now that we have a model we can generate some text (fifty sentences of never seen before content). Since the input data was split into sentences, we stop and reset after each end of sentence token.

The initial state of each sentence is empty, and a new token is chosen from the subsequent state probability distribution.

The newly generated token is added to the current state window, and the process repeated.

// generate some text
var rand = new Random();
for (var i = 0; i < 50; i++) {
    var sb = new StringBuilder();
    string prevPrev = default(string), prev = default(string), curr = default(string);
    while (true) {
        var transitions = model.GetTransitions(prevPrev, prev, curr);
        var distribution = new Categorical(transitions.Select(d => Convert.ToDouble(d.Probability)).ToArray());
        var next = transitions[distribution.Sample()].NextState;
        if (Char.IsLetterOrDigit(next[0]) && sb.Length > 0) {
            var lastChar = sb[sb.Length - 1];
            if (lastChar != '\'' && lastChar != '-')
                sb.Append(' ');
        }
        sb.Append(next);

        if (SimpleTokeniser.IsEndOfSentence(next))
            break;
        prevPrev = prev;
        prev = curr;
        curr = next;
    }
    Console.WriteLine(sb.ToString());
}

Selected Output

She disregarded this, possibly rather resented it, for she switched back to the table with her eyes shut so tight that blue moons formed and revolved against backgrounds of deepest mauve, Anthony staring blindly into the sunny street.
Now a male roaming the world in this condition is as helpless as a lion without teeth, and in contrast to the rather portentous character of his bedroom, was gay against the winter color of the room.
Perceiving that a certain fastidiousness would restrain her, he had acquired reticence.

Complete Source Code

View the complete source on GitHub.

Fork me on GitHub