Text Clustering Four Ways

Motivation

Clustering is an example of unsupervised learning - we don't tell the computer what we expect it to do, rather we tell it to look for groups or patterns and then we sit back and wait to be pleasantly surprised by what it finds.

This tutorial shows four separate techniques for clustering text with Bright Wire

K-Means
NNMF
K-Means with Random Projections (or Random Indexing)
K-Means with Latent Semantic Analysis

The data that we want to cluster is a list of accepted papers to an AI conference. The data set is useful as it contains multiple levels of keywords. We can use some of the sets of keywords to cluster and then use the unseen set of keywords to evaluate the clustering.

Getting Started

First, download the data set to your computer, create a new console application and add a reference to Bright Wire.

Loading the Data

The data is parsed from the CSV file into a data table and then converted to a list of strongly typed AAAIDocuments.

IDataTable dataTable;
using (var reader = new StreamReader(dataFilePath)) {
    dataTable = reader.ParseCSV();
}

var KEYWORD_SPLIT = " \n".ToCharArray();
var TOPIC_SPLIT = "\n".ToCharArray();

var docList = new List<AAAIDocument>();
dataTable.ForEach(row => docList.Add(new AAAIDocument {
    Abstract = row.GetField<string>(5),
    Keyword = row.GetField<string>(3).Split(KEYWORD_SPLIT, StringSplitOptions.RemoveEmptyEntries).Select(str => str.ToLower()).ToArray(),
    Topic = row.GetField<string>(4).Split(TOPIC_SPLIT, StringSplitOptions.RemoveEmptyEntries),
    Group = row.GetField<string>(2).Split(TOPIC_SPLIT, StringSplitOptions.RemoveEmptyEntries),
    Title = row.GetField<string>(0)
}));

The AAAIDocument class looks like:

class AAAIDocument
{
    /// <summary>
    /// Free text description of the document
    /// </summary>
    public string Title { get; set; }

    /// <summary>
    /// Free text; author-generated keywords
    /// </summary>
    public string[] Keyword { get; set; }

    /// <summary>
    /// Free text; author-selected, low-level keywords
    /// </summary>
    public string[] Topic { get; set; }

    /// <summary>
    /// Free text; paper abstracts
    /// </summary>
    public string Abstract { get; set; }

    /// <summary>
    /// Categorical; author-selected, high-level keyword(s)
    /// </summary>
    public string[] Group { get; set; }
}

Converting AAAIDocuments to Vectors

Once again we can use the StringTableBuilder that maps each unique string to a unique string index. Each document then becomes a normalised vector with the weight of each entry proportional to the count of that string index within the document.

The sparse vectors are converted to dense vectors with the call to Vectorise.

var stringTable = new StringTableBuilder();
var classificationSet = docList.Select(d => d.AsClassification(stringTable)).ToArray();
var encodings = classificationSet.Vectorise();

K-Means

K-means clustering tries to partition the set of vectors into k randomly initialized clusters. Because of the random initialisation there are no guarantees as to the final outcome and the result will change each time.

Console.Write("Kmeans clustering...");
var outputPath = GetOutputPath("kmeans");
WriteClusters(outputPath, _vectors.KMeans(_context, _groupCount), _documentTable);
Console.WriteLine($"written to {outputPath}");

Non Negative Matrix Factorisation

Non Negative Matrix Factorisation (NNMF) is a technique that is only applicable to data that is uniformly positive, such as our counts of the document strings. Again, results will change each time but a cursory examination of the clustering results shows that it seems to do a better job on the data-set than k-means. The clusters are tighter and more interesting.

Console.Write("NNMF  clustering...");
var outputPath = GetOutputPath("nnmf");
WriteClusters(outputPath, _vectors.Nnmf(_context.LinearAlgebraProvider, _groupCount), _documentTable);
Console.WriteLine($"written to {outputPath}");

K-Means with Random Projection

Our document vectors are sparse. They contain space for every possible string in the vocabulary but each document will only use a small subset of those entries - the remaining entries will be zero. We can reduce the dimensions to improve computational performance while still preserving the significant information by randomly projecting the document vectors.

Although the projected document vectors are now length 512 (from around 1500) the result is much the same as the initial k-means clustering (while reducing the clustering computation by two thirds).

var lap = _context.LinearAlgebraProvider;
// create a term/document matrix with terms as columns and documents as rows
var matrix = lap.CreateMatrixFromRows(_vectors);

Console.Write("Creating random projection...");
var outputPath = GetOutputPath("projected-kmeans");
using var randomProjection = lap.CreateRandomProjection(_stringTable.Size + 1, 512);
using var projectedMatrix = randomProjection.Compute(matrix);
var vectorList2 = projectedMatrix.RowCount.AsRange().Select(i => projectedMatrix.Row(i)).ToList();
var lookupTable2 = vectorList2.Select((v, i) => Tuple.Create(v, _vectors[i])).ToDictionary(d => d.Item1, d => _documentTable[d.Item2]);
Console.Write("done...");

Console.Write("Kmeans clustering of random projection...");
WriteClusters(outputPath, vectorList2.KMeans(_context, _groupCount), lookupTable2);
vectorList2.ForEach(v => v.Dispose());
Console.WriteLine($"written to {outputPath}");

K-Means with Latent Semantic Analysis

Another technique for reducing the dimensions is to try to discard the least significant information (with the assumption that there is some noise in every data-set and we will get better results by removing it).

Latent Semantic Analysis (LSA) uses the Singular Value Decomposition of our term/document matrix to find the strongest correlated set of features and then constructs a reduced latent space in which the documents are expressed by those features.

In this case the latent document vectors are now length 256 so the k-means performance is now twice that of random projections. The final result is arguably better as well with LSA finding some interesting sets of documents that were missed by vanilla k-means.

Console.Write("Building latent term/document space...");
var outputPath = GetOutputPath("latent-kmeans");

var lap = _context.LinearAlgebraProvider;
// create a term/document matrix with terms as columns and documents as rows
var matrix = lap.CreateMatrixFromRows(_vectors);
var kIndices = k.AsRange().ToList();
var matrixT = matrix.Transpose();
matrix.Dispose();
var (_, floatVector, vt) = matrixT.Svd();
matrixT.Dispose();

var s = lap.CreateDiagonalMatrix(floatVector.AsIndexable().Values.Take((int)k).ToArray());
var v2 = vt.GetNewMatrixFromRows(kIndices);
using (var sv2 = s.Multiply(v2))
{
    v2.Dispose();
    s.Dispose();

    var vectorList3 = sv2.AsIndexable().Columns.ToList();
    var lookupTable3 = vectorList3.Select((v, i) => Tuple.Create(v, _vectors[i])).ToDictionary(d => (IFloatVector)d.Item1, d => _documentTable[d.Item2]);

    Console.WriteLine("Kmeans clustering in latent document space...");
    WriteClusters(outputPath, vectorList3.KMeans(_context, _groupCount), lookupTable3);
}

Console.WriteLine($"written to {outputPath}");

Conclusion

We haven't formally evaluated the results in this tutorial but a cursory examination of the four sets of results shows that NNMF is well suited to text clustering, while K-means in its three variants gives good but somewhat varied results.

In a real world application we might be more interested in the usefulness of the final clusters rather than the purity of the clusters themselves - an evaluation criteria that is specific to each context.

Complete Source Code

View the complete source on GitHub.