Bright Wire

Building a Markov Model from source text and using it to generate new text.

Motivation

Markov Models can be used to create n-gram based language models - that is they can be used to calculate the probability of subsequent tokens after each n-gram.

Generated text from n-gram based language models, although not always syntactically correct, does roughly approximate the training corpus. Some of the generated text is in fact quite interesting!

Getting the Data

To train a Markov Model we need some source text to train on. Here we load The Beautiful and the Damned by F. Scott Fitzgerald.

Bright Wire includes a simple tokeniser that splits text into words, numbers and punctuation tokens, and forms sentences from those tokens.

using var reader = GetStreamReader(context, "beautiful_and_damned.txt", "http://www.gutenberg.org/cache/epub/9830/pg9830.txt");
var data = reader.ReadToEnd();
var pos = data.IndexOf("CHAPTER I", StringComparison.Ordinal);
var mainText = data[(pos + 9)..].Trim();

return new SentenceTable(context, SimpleTokeniser.FindSentences(SimpleTokeniser.Tokenise(mainText)));

Training the Model

When training a Markov Model we need to choose the window size - the count of tokens that are used as the basis for each probability table of subsequent tokens. For this example we are using n-grams of size 3.

// create a markov trainer that uses a window of size 3
var context = _sentenceTable.Context;
var trainer = context.CreateMarkovTrainer3(_empty);
foreach(var sentence in _sentenceTable.Column<IndexList>(0).EnumerateTyped())
    trainer.Add(sentence.Indices);

var ret = trainer.Build();
if (writeResults) {
    foreach(var sentence in GenerateText(ret))
        Console.WriteLine(sentence);
}

return ret;

Generating Text

Now that we have a model we can generate some text (fifty sentences of never seen before content). Since the input data was split into sentences, we stop and reset after each end of sentence token.

The initial state of each sentence is empty, and a new token is chosen from the subsequent state probability distribution.

The newly generated token is added to the current state window, and the process repeated.

public IEnumerable<string> GenerateText(MarkovModel3<uint> model, int count = 50)
{
    var context = _sentenceTable.Context;
    var table = model.AsDictionary;

    for (var i = 0; i < count; i++) {
        var sb = new StringBuilder();
        uint prevPrev = default, prev = default, curr = default;
        for (var j = 0; j < 1024; j++) {
            var transitions = table.GetTransitions(prevPrev, prev, curr);
            if (transitions == null)
                break;
            var distribution = context.CreateCategoricalDistribution(transitions.Select(d => d.Probability));
            var next = Append(transitions[distribution.Sample()].NextState, sb);

            if (SimpleTokeniser.IsEndOfSentence(next.String))
                break;
            prevPrev = prev;
            prev = curr;
            curr = next.Index;
        }
        yield return sb.ToString();
    }
}

Selected Output

Halcyon days like boats drifting along slow-moving rivers; spring evenings full of a heavenly glamour, was gay against the winter color of the room; he walked cautiously another step nearer the window with a sudden harsh snap.
Now a male roaming the world in this condition is as helpless as a lion without teeth, and in contrast to the rather portentous character of his bedroom, was gay against the winter color of the room.
Perceiving that a certain fastidiousness would restrain her, he had acquired reticence.

Complete Source Code

 View the complete source on GitHub

Fork me on GitHub