When something goes wrong in a large-scale software system, how do you diagnose it? This article provides an overview of the journey our team took in developing diagnostic logging for a large scale Azure grid computing system.
The value of very detailed developer facing diagnostics logging can be debated. The main argument for it is to be able to gain as much information as possible to debug an issue; for those situations where reproducing a bug are not an option. The main arguments against is that verbose logging leads to a decrease in code readability, and can be a hindrance to performance.
For the purposes of this discussion, I will take as given that a very verbose level of diagnostics logging is required. A bug should only be observed once, and the development team should have enough information to fix it without reproducing.
Summary of requirements
We have the following logging levels, shown here in increasing levels of verbosity: Error, Warning, Info, Verbose, Debug, and Data. Error, Warning, and Info diagnostics are written in customer language, and understandable by anyone who uses the system. Everything below Info logs are intended for developer consumption.
We have two primary needs for logging: real-time visibility, and issue debugging. A closely related requirement is error reporting.
Real-time logging provides an almost real-time log of top-level system information. This provides a twitter-like feed of what’s going on. The amount of information that can be digested in real-time is limited, however, so we limit this level of logging to informational and up.
To support issue debugging, a very low level of logging is required, down to code control flow, and data operations. This information does not need to be real-time, but can be gathered after the fact by a developer.
Best effort and no performance compromises
We consider logging (unlike error handling/reporting) to be a best effort mechanism. What this means is that logging should not have the ability to crash a system. If a single log is lost, we do not consider it to be a system failure. If 5% of your logs are missing, then there’s an issue.
We also do not accept that there is a tradeoff between logging and performance other than in insolated high performance transactions and algorithms.
How we started: Azure Queues
I have long had logging as a primary feature in my toolbox of system scaffolding. When applying to a new Azure system working in an Agile team, the question was: how do we log as simply as possible with the least effort? The answer: Azure queues. We simply wrote log messages to a queue in plain text.
Azure tables would have worked as well, but the API was a bit more difficult to work with. The built-in Azure diagnostics features would have worked, but are a bit tricky to set up (no matter how many times I’ve done it, I have to look it up). Azure diagnostics also have a minimum one minute refresh interval –too long for my brain to hold a thought under stress. Also, built-in infrastructure has a risk to not grow with your needs.
As with all of the logging discussed here, we use in-memory client message buffers to have minimal effect on performance.
Queues did well for a long time. I’ve found that a single queue can scale to about 60 transactions per second (this may have gone up since I measured it early last year). We did, however, quickly outgrow queues as a logging mechanism. If you’re starting a new project, and want something fast, queues may work well for you.
The next step: custom WCF log service
Our next step was to build a WCF service that would provide a sink for client processes to send logs. What this involves is a WCF service that maintains an in-memory ring-buffer. We would store the last 100,000 logs, while a process asynchronously persisted them. We would maintain a ring-buffer for each log level, so that we could keep a long history of info events, without them being drowned by lower level logs. We would then serve these to a UI to provide real-time logging. We were able to get over 10,000 logs/second through the system using a single WCF service (with a very tweaked net.tcp endpoint). This was a big step up from queues, but in time we outgrew this as well.
Current solution: Queues, WCF service, and Page Blobs
Two things caused us to outgrow our pure WCF solution: the scale of our grid (up to the multiple 1,000s of VMs) and a limitation of Windows Azure that you can’t have private endpoints shared across services. The former reason caused the sheer amount of log messages coming in per second to overwhelm our diagnostics sink. The later reason was caused by us needing to run our system across multiple Windows Azure services. To resolve the second issue, we would have needed to host our logging endpoint on a public IP and deal with security.
Page blobs provided the answer. Page blobs are a Windows Azure Storage mechanism that can be appended to in blocks of 512. What we did was have each process (role instance) in our system write to its own page blob. The names of these blobs are rotated every day, and every 50MB to ensure manageability. This provides us the low level logging we require. When fully utilized, we can log a terabyte of information a day on a large grid. Due to the nature of this type of logging, we only use these logs for after-the-fact issue investigation.
For real time logging, we use a combination of queues and our WCF service. We use queues to transmit Info, Warning, and Error messages from each of our processes across the Windows Azure service boundary, to a diagnostics process, that also hosts the WCF service described earlier. This allows us to get near real-time diagnostics messages (Info and up) aggregated and able to be served to a UI. An overview diagram of this solution is provided.
Azure Diagnostics Diagram
I will try and pull some of this code together into a GitHub project if anyone is interested. Cheers.