Sequence to Sequence with LSTM

Motivation

For machine learning tasks involved with classifying sequences of data there might not be a one-to-one mapping between input and output classifications. For example, when building neural networks that learn to translate between different languages it is unlikely that there will be the same number of words in every translated sentence compared to each input sentence.

One way the output can be varied in length against the input is with something called a sequence to sequence (S2S) recurrent neural network. This is when the network is divided into two separate classifiers, one called an encoder and the other the decoder.

The encoder is tasked with learning to generate a single embedding that effectively summarises the input and the decoder is tasked with learning to generate a sequence of output from that single embedding.

So a sequence to sequence architecture is actually a combination of two other architectures - one to many and many to one, each of which are useful on their own in other learning tasks.

Many to One

The first part of a S2S architecture is the encoder that learns how to encode the most relevant parts of a sequence of input into a single embedding. While used in S2S, this architecture is actually useful on its own.

For example, when comparing documents for similarity one approach is to create a document embedding for each document in a set of documents and then use a distance metric such as euclidean distance to compare documents to each other. A many to one recurrent neural network is one way to obtain such document embeddings.

Another example would be classifying sentences as either positive or negative sentiment. The single output label "positive" might apply to an entire sentence (which is composed of a sequence of words).

In Bright Wire, a many to one training set is a data table with a matrix input column and a vector output column. The matrix contains the sequence of input vectors and the output contains the target output vector.

To keep things simple in this tutorial, our input matrix will contain a sequence of five one-hot encoded characters and the output vector will contain a union of those characters. So for example if our sequence is {"A", "B", "D"} the first vector will contain "A", the second "B" and the third "D" and the output vector will be "ABD". In this case the encoder just needs to keep track of all the characters that it has seen and write them into the output vector.

We can create training and test data sets with a SequenceClassification helper class in Bright Wire. Our dictionary size (the number of possible characters) is 10, and each generated sequence is of length 5. Also, no character will repeat in any generated sequence. 1000 sequences are generated and split into training and test sets.

const int SIZE = 5, DICTIONARY_SIZE = 16;
var grammar = new SequenceGenerator(context, dictionarySize: DICTIONARY_SIZE, minSize: SIZE-1, maxSize: SIZE+1);
var sequences = grammar.GenerateSequences().Take(1000).ToList();
var builder = context.BuildTable();
builder.AddColumn(ColumnType.Matrix, "Sequence");
builder.AddColumn(ColumnType.Vector, "Summary").SetTarget(true);

foreach (var sequence in sequences) {
    var index = 0;
    var rows = new Vector<float>[sequence.Length];
    var charSet = new HashSet<char>();
    foreach (var ch in sequence)
    {
        charSet.Add(ch);
        rows[index++] = grammar.Encode(ch);
    }

    var target = grammar.Encode(charSet.Select(ch2 => (ch2, 1f)));
    builder.AddRow(context.CreateMatrixFromRows(rows), target);
}

The neural network is composed of two GRU recurrent layers that are linked by a Recurrent Bridge that manages the forward and backward propagation of the hidden states between the two GRU encoder layers.

Next, the network is trained for 5 iterations using rms prop gradient descent and a hidden memory size of 128. The binary classification error metric rounds outputs to either 0 or 1.

// build the network
const int HIDDEN_LAYER_SIZE = 128;
graph.Connect(engine)
    .AddGru(HIDDEN_LAYER_SIZE, "encoder1")
    .AddRecurrentBridge("encoder1", "encoder2")
    .AddGru(HIDDEN_LAYER_SIZE, "encoder2")
    .AddFeedForward(engine.DataSource.GetOutputSizeOrThrow())
    .Add(graph.TanhActivation())
    .AddBackpropagationThroughTime()
;

Since this is such a simple learning task, the network very quickly reaches ~100% accuracy.

One to Many

One to many neural networks generate a sequence of output from a single input vector. An example of this architecture might be learning to generate a sequence of commands after observing the current state of a system.

A one to many training data set is created with a data table that contains a vector input column and matrix output column. Along the same lines as the many to one example above, the following code creates a vector that summarises a sequence and the sequence itself encoded as one hot vectors.

var grammar = new SequenceGenerator(context, dictionarySize: 10, minSize: 5, maxSize: 5, noRepeat: true);
var sequences = grammar.GenerateSequences().Take(1000).ToList();
var builder = context.BuildTable();
var addColumns = true;

foreach (var sequence in sequences)
{
    var sequenceData = sequence
        .GroupBy(ch => ch)
        .Select(g => (g.Key, g.Count()))
        .ToDictionary(d => d.Key, d => (float)d.Item2)
    ;
    var summary = grammar.Encode(sequenceData.Select(kv => (kv.Key, kv.Value)));
    var rows = new Vector<float>[sequenceData.Count];
    var index = 0;
    foreach (var item in sequenceData.OrderBy(kv => kv.Key))
    {
        var row = grammar.Encode(item.Key, item.Value);
        rows[index++] = row;
    }

    var output = context.CreateMatrixFromRows(rows);
    if (addColumns) {
        addColumns = false;
        builder.AddFixedSizeVectorColumn(summary.Size, "Summary");
        builder.AddFixedSizeMatrixColumn(output.RowCount, output.ColumnCount, "Sequence").SetTarget(true);
    }
    builder.AddRow(summary, output);
}

In this case, it's a harder problem than the many to one as the neural network needs to learn the correct order of each output character (an "A" before a "D" etc). After 15 epochs the network reaches ~99% accuracy.

In this case the decoder is a single GRU layer followed by a feed forward with tanh activation.

// build the network
const int HIDDEN_LAYER_SIZE = 128;
graph.Connect(engine)
    .AddGru(HIDDEN_LAYER_SIZE)
    .AddFeedForward(engine.DataSource.GetOutputSizeOrThrow())
    .Add(graph.TanhActivation())
    .AddBackpropagation()
;

Sequence to Sequence

The simplest type of S2S network is a recurrent auto encoder. In this scenario, the encoder is learning to encode an input sequence into an embedding and the decoder is learning to decode that embedding into the same input sequence.

In a recurrent auto encoder the input and output sequence lengths are necessarily the same, but we are using the encoder's ability to find the relevant discriminative features of the input as it creates the single embedding from the input sequence.

Once the network has converged, we can throw the decoder away and use the encoder to create sequence embeddings. This is how we might build a single embedding from a sequence of words (the document) for the purposes of document comparison.

A S2S data set in Bright Wire is a data table with two Matrix columns. The matrices each contain rows that form the input and output sequences. If the matrices contain differing number of rows or if the attribute Seq2Seq is set on the table meta data then Bright Wire will treat this as a Sequence to Sequence classification setup.

As S2S networks have been shown to perform better with reversed output, the input sequences are reversed for the decoder output.

const int SEQUENCE_LENGTH = 4;
var grammar = new SequenceGenerator(context, 3, SEQUENCE_LENGTH, SEQUENCE_LENGTH, false);
var sequences = grammar.GenerateSequences().Take(1000).ToList();
var builder = context.BuildTable();
builder.AddColumn(ColumnType.Matrix, "Input");
builder.AddColumn(ColumnType.Matrix, "Output").SetTarget(true);

foreach (var sequence in sequences)
{
    var encodedSequence = grammar.Encode(sequence);
    var encodedSequence2 = grammar.Encode(Reverse(sequence));
    builder.AddRow(encodedSequence, encodedSequence2);
}

The network configuration is much the same as the one to many and many to one networks above, however the sequence to sequence pivot is used to manage the transition from encoder to decoder.

const uint BATCH_SIZE = 16;
const uint HIDDEN_LAYER_SIZE = 128;
const float TRAINING_RATE = 0.1f;

// indicate that this is Sequence to Sequence as the sequence lengths are the same
Training.MetaData.Set("Seq2Seq", true);
var trainingData = graph.CreateDataSource(Training);
var testData = trainingData.CloneWith(Test);
var engine = graph.CreateTrainingEngine(trainingData, errorMetric, TRAINING_RATE, BATCH_SIZE);

graph.Connect(engine)
    .AddGru(HIDDEN_LAYER_SIZE, "encoder1")
    .AddRecurrentBridge("encoder1", "encoder2")
    .AddGru(HIDDEN_LAYER_SIZE, "encoder2")
    .Add(graph.ReluActivation())
    .AddSequenceToSequencePivot("encoder2", "decoder")
    .AddGru(HIDDEN_LAYER_SIZE, "decoder")
    .AddFeedForward(engine.DataSource.GetOutputSizeOrThrow())
    .Add(graph.TanhActivation())
    .AddBackpropagationThroughTime()
;

The network reaches a final accuracy of ~99% after 15 epochs.

Complete Source Code

View the complete source on GitHub.