Thursday, April 23, 2009

Vector Spaces

I was at a semantic conference last year and for me one of the standout presentations was Peter Turney talking about the application of vectors to semantics and how they represent a kind of analog signal versus the digital crispness of AI logic and how you can use both kinds of logic and kind of move between the two to overcome the limitations associated with either one.

I've been using vector space techniques a lot since then and it turns out that they are indeed a useful tool to keep around.

Vectors provide a means to compare multivariate data in a relative way. Sorting on one variable at a time is easy with computers, but to sort by many at once requires some thought. Linear algebra provides some creative techniques to manipulate and gain other insights into your multivariate data, such as singular value decomposition, which leads to latent semantic analysis, wherein a reduction of dimensions can create insight into the data itself.

The other great thing about vector spaces is how imaginative you can be with how you use them. They essentially lend themselves to any kind of comparison or computation. All you need to do is stick your data into the matrix, normalise it intelligently and you can derive patterns and relationships from that data that you wouldn't have seen otherwise.

I started playing around with vector spaces with term/document or document/term matrices - generated by counting the words in a list of documents and putting the counts of each word per document in a row or column of your matrix. Then, so that long documents don't rank higher than shorter documents (with smaller word counts) the values are normalised with a term frequency/inverse document frequency (tf-idf).  At which point we can start comparing documents based on the cosine similarity of the document vectors.

0 comments: