Tuesday, October 11, 2011

Winter Blues Good for Ask Bluey

It's been a long winter here in Sydney.  Just as spring seemed certain to arrive, winter instead resumed with a vengeance - biting southerly winds, temperatures down to (gasp) 7 at night, and otherwise all round misery.  It's back to Ugg boots and heavy winter doonas and hot chai lattes.  Even Masterchef is back on TV – that unfortunate dreary bastion of cold winter nights.

However it’s also meant that I have had plenty of time for Ask Bluey and have been making good progress on getting the second beta version out the door.  This version has been about polish, simplification and stability and I think these goals have been met.

If you haven’t visited in a while then feel free to take a look – after all, who doesn’t want their own private search engine, scouring the internet 24/7 for personally relevant pages?

Friday, August 5, 2011

Data Storage, Caching and Retrieval in Azure

Both web and worker roles have a variety of places to store data in the Azure cloudscape.  Since processing large amounts of data is likely the raison d’etre of cloud applications in the first place, understanding all the possible places that data might be stored (and their trade-offs) is of some benefit when designing cloud applications.

What follows is more of my personal brain dump based on my experience with Azure.  I don’t include performance metrics because I suppose them to be continually changing and subject to variability based on the application in question.  I recommend collecting performance metrics in the design phase to give further weight to your own design choices.

LOCAL Memory (per Instance)

Obviously the fastest place to cache your data, but sadly quite limited in size (depending on the size of the Azure instances that you choose (and pay for)).

Also sadly not shareable between multiple instances running in the same web role.  Storing session state on one machine helps you very little when the next request lands on the next machine in your web farm.

LOCAL STORAGE (PER INSTANCE)

Storing to disk has exactly the same benefits and trade-offs that one might expect – a far greater capacity than local memory but far longer read and write speeds.  Where possible, asynchronous  writes may be possible to ameliorate some of the performance penalty.  Again, not shareable between instances.

SQL Azure

Since web roles with multiple instances often do need a place to share session data, the database represents a logical place.  The ACID properties of a relational database mean that even when individual instances of your application die and are recycled, the user should see no difference in their session from the remaining instances that are still serving requests.

However the size constraints of SQL Azure databases (and expense involved) mean that not every piece of data should simply be bundled into the database.  Large blobs of data are certainly a good candidate for blob storage, but their timestamps and other meta information might be stored in the database to aid retrieval time and to integrate with stored procedure logic.

Blob Storage

Blob storage is the cheapest way to store large amounts of data “in the cloud”, although the current offering is not especially compelling, performance wise.  Since you pay per transaction it also represents a poor choice to put things that you will be frequently accessing.

Blob storage is like the “stack” at your local library – non frequently accessed things can be stored at low cost for large amounts of time, and brought into the forward caches as appropriate.

Windows Azure AppFabric Caching

An alternative place to share session state between instances, AppFabric caching sounds good in theory but suffers a little in practise.  The idea is that it represents a high speed storage location that is easily shared between instances.  The reality is that it will shut you off if you use it too much, and the performance does not seem especially compelling.  It is however, considerably faster than blob storage.

Putting it all together

A well designed Azure application will take into account all the strengths and weaknesses of the storage options available and work with each accordingly.

Often data seems to move like a waterfall across the different tiers.  If an application might require some particular data it might check first with a centralised and reliable store such as the SQL database as to where that data is located and its latest timestamp.  Then it might check its various caches for the blob’s handle and timestamp – first in local memory, then local storage, then the AppFabric cache and finally from blob storage.  When it finds its data it might refresh the other (empty or invalidated) caches that failed it along the way.

Other times a backend process might invalidate some piece of non-SQL data, update both blob and cache storages and then send a message to each role instance that the data has changed.  Each role instance in turn can then query the cache service for the data, and fall back on blob storage if the cache drops the ball.  The role instance might store that data in local memory and local storage and simply assume that the data remains valid until a notification to the contrary is received.

But as with all advice, your mileage will vary ; )

Getting the Redirected Url from a WebClient

WebClient in .NET is a handy class for network access, but sometimes you need things that it doesn’t expose “out of the box”.  Fortunately it’s quite simple to extend.

One of the handy features of the class is that it automatically handles redirects on your behalf.  Suppose that you make a request to some URI that returns an HTTP status code of 301 – if the AllowAutoRedirect property on WebClient is set, it will automatically parse the HTTP response and make a subsequent call to the redirected URL all by itself.  Nice.

In my case however I wanted to find out the actual URL that it had redirected to, so I extended WebClient as below to always save the Uri that it was last getting a response from. This Uri would be different from the request Uri if the server had returned an HTTP status code indicating redirection.

public class MyWebClient : WebClient
{
Uri _responseUri;

public Uri ResponseUri
{
get { return _responseUri; }
}

protected override WebRequest GetWebRequest(Uri address)
{
HttpWebRequest request = base.GetWebRequest(address) as HttpWebRequest;
request.AutomaticDecompression = DecompressionMethods.Deflate | DecompressionMethods.GZip;
request.AllowAutoRedirect = true;
return request;
}

protected override WebResponse GetWebResponse(WebRequest request, IAsyncResult result)
{
WebResponse response = base.GetWebResponse(request, result);
_responseUri = response.ResponseUri;
return response;
}

protected override WebResponse GetWebResponse(WebRequest request)
{
WebResponse response = base.GetWebResponse(request);
_responseUri = response.ResponseUri;
return response;
}
}

Thursday, August 4, 2011

Azure Table Storage Helper Classes

Before I discovered that working with Azure’s table storage was not the be all and end all that I had once thought (see previous post), I put together some base classes that implement the key recommendations from the Whitepaper for Programming Table Storage.

These classes make working with Microsoft.WindowsAzure.StorageClient a little easier and less verbose, and means that you can use the following class to define a log table for your Azure cloud application.  (You need to create the empty table first with something like Azure Storage Explorer.  Note that Azure table names need to be lowercase or strange things might happen!)

The constructor defines the table name “log”, the account credentials to use (here as a static variable ABB, and whether to use https in connecting to the table (in this case false).

The second part of the class definition defines the structure of the table row, including the partition and row keys. In this case the partition key is the type of log message (error, warning, etc) and the row key is the timestamp of the message (in descending order). The message column contains the actual log message.

Note that the partition key and row key are always required for each table and together form the primary key for that row.

internal class LogTable : TableBase<LogTable.TableContext>
{
public LogTable() : base((baseAddress, credentials) => new TableContext("log", baseAddress, credentials), AccountCredentials.ABB, false) { }

public class Row : TableServiceEntity
{
public Row(string message, LogType type)
{
PartitionKey = type.ToString();
RowKey = (DateTime.MaxValue.Ticks - DateTime.UtcNow.Ticks).ToString();
Message = message;
}
public Row()
{
}

public string Message { get; set; }
}
public class TableContext : ContextBase<Row>
{
public TableContext(string tableName, string baseAddress, StorageCredentials credentials) : base(tableName, baseAddress, credentials) { }
}

public static void AddMessage(string message, LogType type = LogType.Error)
{
try {
var context = new LogTable().Context;
context.Add(new Row(message, type));
context.SaveAsync();
}
catch {
}
}
}


Adding a new log message



LogTable.AddMessage(“some message”);


LogTable.AddMessage(“some message”, LogType.Warning);


Getting a list of all log messages



var logMessages = new LogTable().ReadOnlyContext.Query.Select(r => r.Message);


Download


Saturday, July 30, 2011

To The Cloud!

First of all, what a great job Marketing has done with The Cloud.  Nobody knows what the hell it means, but everyone seems motivated to get on the cloud, somehow, anyhow.  At the heart of the cloud is a big secret that is easily glossed over  – the cloud is nothing new at all, just more of the same - big servers, centralised access, the same ideas that computing has been using from day one (there used to be mainframes that took up an entire office suite, now modern computers live in giant warehouses out in the sticks.  That right there is 50 years of Progress).

Aside from the brief and increasingly irrelevant “PC revolution” (the idea that large clusters of small machines would be cheaper than one large mainframe – an idea carefully promulgated by Microsoft and other interested parties), computing has always been a big business kind of thing to do, with big business efficiency gains of centralization and commodification the driving forces behind the push for cloud services.

Anyway, I too partook the cool aid and moved Ask Bluey to the cloud this year, from it’s previous home on shared hosting.  Shared hosting was actually surprisingly efficient and cost effective – it was able to accommodate thousands of visitors a day, but it was most definitely grinding to a halt when you threw in web crawlers, which accounted for around half of the traffic to the website).

The premise of being able to scale out Ask Bluey as traffic ebbed and flowed was hard to resist, but the reality is that architecting for the cloud really quite difficult to get right (although the end result is “better”, in the same way that truly cross platform C++ code feels “better” than code with implicit assumptions about bit alignment, or god forbid, non standard compiler extensions).

The first big mistake I made was buying into the NO SQL movement’s claim that the future of databases lay in non relational databases.  The idea is that you still have tables to put your structured data into, but that you have to manage the keyed links between these tables yourself (while a SQL database enforces referential integrity on your behalf).  What you gain is that these tables can be partitioned across machines, which means that they scale out better.  Maybe.

What I didn’t realise is that the “table storage” that the cloud providers offer is not NO SQL, but some weird hybrid of blob storage with two keys rather than one.  While it would certainly be possible to use table storage instead of a SQL database, it is not a good idea for two reasons.  First of all, the performance is not very good.  I found a table based lookup to take twice as long compared to a blob based lookup, which is an order of magnitude slower than a local disk based retrieval.  When you have to retrieve from multiple tables and deal with “upsert” semantics, the performance penalties are cumulative and prohibitive.

The second reason to use SQL over table storage is that you have to pay for each transaction.  They are billed in blocks of 10,000 per cent, so they sound so cheap that you don’t have to think too hard about them.  But they can add up!  In one month my “table storage” architecture blew $250 on these seemingly insignificant transactions.  Once I had moved back to SQL, the same usage pattern cost $10 per month, as SQL databases are billed on instance size rather than per transaction.

Table storage still has a place in a cloud based architecture, but it only really makes sense in a few limited scenarios.  An ideal usage would be an archive of tweets on a per user basis.  The partition key is on the user, and the row key is the date.  Buy keeping these large wads of data out of the database the SQL instance size is kept down, and the performance penalty becomes less important because of the “archive” aspects.  Obviously nobody wants their entire twitter history every single day, but it might be useful to get once or twice a year (the most recent tweets could be in the SQL database).

In this the way table storage becomes ancillary to a good old fashioned SQL database and again nothing has really changed in any aspect of computing, despite the hype.  In some ways this is reassuring ; )

Monday, July 4, 2011

First post in a Long Time

This is the obligatory – wow, it’s been years since I last posted.  In the last two (2!) years I have gotten married, become the father of a rather lovely daughter, moved closer to the beach, and launched a large and ambitious  project (Ask Bluey).

I feel it’s time to pause, reflect and record - and blogging seems a perfect way to approach that.

So stay posted…

; )

Tuesday, May 19, 2009

New Article on CodeProject - Search3D

In this article I talk about how you can use Singular Value Decomposition and Latent Semantic Analysis to cluster documents and to reduce multi-dimensional data down to three dimensions, which is visualisable in a 3D viewer.