Adjusting to a new product owner role

Having been operating in a tech lead or pseudo tech lead position for 6 of the last 7 years, I am now in the early stages an adventure of transition to an Agile Product Owner position. This is a mostly product management role, but with very close collaboration with the development teams. It is difficult, but I am confident. The following are key lessons learnt after 3 months.


Over communicate.

When in doubt talk to people and talk often. Add everyone on e-mail CC’s (or better, have everyone in ear’s shot in the office). Don’t be exclusive, especially if it’s a period of change in your organization. I would prefer for people to complain that I send them too many e-mails than them not feel included.


Make compromises.

If you’ve been a tech lead for a while, you’re probably used to negotiating work patterns with a small group (and potentially having the dominant voice). Everything in the organization will not be in your control, so let groups find their own way. You don’t have the bandwidth to spend all day teaching or debating every disagreement.  Get on with it.


Be patient.

Transition takes time. Be patient. Don’t get frustrated.  When in doubt, think of something valuable to do and do it.


The backlog is your friend.

This is your primary responsibility. Own it! A product owner without a good grip on his/her backlog is not doing their primary function. Always be the most informed about what’s coming up, its strategic value, and its details.


What value are you providing to the organization?

At least once each week ask yourself this. Don’t spend your energies on anything that is not on the top of this list. When looking for an answer, build on your strengths.


Concentrate more than you think you need to on personal interaction.

This is hard for us coming from technical backgrounds. Trust is the most important thing to a team. It is costly building it, but it’s now more important in your new role than the code.


It’s hard not coding as much.

Tell me about it. Find a solid weekend project and enjoy it. Also feel free to remind the devs that you can show them a thing or two if needed J


Software scheduling should include a discussion of risk

My last few blog articles have focused on process, and with this article I will continue down this path by discussing software scheduling, specifically how software scheduling should be a discussion of risk.

In a way all project management discussions can be discussions of risk.  Without uncertainty, scheduling could be done by a computer.  Software development is uncertain, and the degree of uncertainty is proportionate to the resources available and their familiarity with the software development technologies, their familiarity with the domain, along with numerous other factors.

To set the stage, let’s take a typical exchange between a project manager (non-technical team lead, technical manager, stakeholder, etc.) and a technical lead.  I have seen many conversations that proceed as follows:

Project manager:  We have provided you the requirements overview.  Are you able to estimate the work?

Technical lead: Yes, it looks like 6-8 weeks of work.

Project manager: Can you commit to delivering it within 8 weeks?

Technical lead: Yes.

Project manager: Ok, I will communicate this to the customer.  We have a lot riding on this, so let’s not mess up.  Can you provide me some progress milestones along the way?

Technical lead: Yes, there are about 5.  I will let you know as we complete them.

Project manager: Can we demo before 8 weeks?

Technical lead: Sure, we can demo progress every two weeks to the customer.

Project manager: Great news.  Thanks.

There is a lot good to this conversation.  The project manager and the technical team are interacting at the right level of granularity (i.e.: not at a day-to-day task basis).  The project manager is not interested in day-to-day deliverables, but instead at a high-level view of progress (X/5 high-level features complete).  Demos are scheduled in order to present progress to the customer.  There is, however, something missing from this conversation: What risks are out there?  We are making the project manager’s job too easy.

What if there is a 90% chance of delivering within 8 weeks, a 5% chance of delivering within 18, and a 5% chance of NEVER DELIVERING?

There are two things that could have happened here: The technical lead might not understand this risk distribution.  Another alternative is that the technical lead understood this risk distribution but hid it from the project manager because 90% chance of success is good enough.

What is really happening here is that the technical lead is taking on risk that the project stakeholders and the end customer are not aware of.

Is this bad?  Well, it depends.

Risk needs to be managed, mitigated, and worked into a business plan somewhere.  There are many organizations, where the 5-10% chance of non-delivery is acceptably managed within a technical team.  There are some organizations, where this 5-10% chance of non-delivery could have significant negative business impact and would be critical to be surfaced.

Instead of risks being managed by technical leads, risks could instead by identified, quantified, and surfaced so as to fit into a more holistic risk management framework.  Within the technology area, one can look at the way enterprise security risk management has developed to gain an appreciation for the way technology risks can be managed, aggregated, and surfaced to business stakeholders.

It is important to allow top level technical leads a great level of autonomy.  It positively impacts quality, schedule, and employee retention (based solely on personal experience).  I am not promoting decreasing this level of autonomy, but I am promoting a more sophisticated communication about scheduling and risk of delivery between technical leads and project stakeholders.

Identifying a team’s top risks are not terribly difficult:

  1. What technologies are we planning on using that we don’t know well?
  2. What domain problems do we need to solve that we don’t understand to the extent we need to?
  3. Is there a lack of availability in domain expertise?
  4. Is there a lot of work we are not good at (i.e.: UI design, database optimization)?
  5. What is the degree of uncertainty of external dependencies?
  6. What is our gut feel about how much work there is?

So how does a team lead integrate these into discussions of schedule?  To answer this requires a discussion of delivery estimation, which I may get to in another post, but is out of scope here.  I will end with another example discussion, this time including risk:

Project manager:  We have provided you the requirements document.  Are you able to estimate the work?

Technical lead: Yes, it looks like a high chance of delivery within 6-8 weeks of work, but we should discuss risks.

Project manager: OK.  What’s up?

Technical lead: Well,

  • We are working with a new external API, and although we have done a PoC, we don’t know how reliable it will be once we start reliability testing.  If this doesn’t go well, we could have up to another three weeks of implementation to achieve required reliability.
  • There is a new required algorithm and a new data structure that we are not 100% confident we can fully implement without external help.  We know resource Mr. Genius Quant Guy is very busy this release cycle.
  • We have a new hire and although very optimistic, might not work out well.  This could drain other productive resources more than we anticipated.

Project manager:  OK, I’ve recorded these risks.  What does delivery look like if any one of these occurs?

Technical lead:  Three weeks, unknown, and two weeks respectively.

Project manager: What is an approximate chance of occurrence?

Technical lead: 25%, 10%, and 25%, but those are real ballparks.

Project manager; It seems like not having a resource for the algorithm is the most risky part of this project because of how long it could drag it on for.

Technical lead: Agreed.

Project manager: OK, thanks for letting me know.  We will include you in discussions with the project sponsors when we decide how to mitigate and communicate these risks to the customer.  Can you provide updates on overall progress?

Technical lead: I appreciate the inclusion, and yes, we have 5 milestones that we will use to track progress.  These map to our 5 key features.

Project manager: Great.  I feel like I have a good grasp on this.  Now I just wish technical lead M would understand his deliverables as well.  Let’s talk next week.


There are no hard and fast rules to being a good team lead or succeeding at delivering a project.  If you are not discussing risk of delivery between those committing the business to deliverables (think $$$) and those responsible for delivering, a big piece is missing.  The best software development teams are agile, however most projects require long-term planning and delivery commitments.  When there is money on the line, treating development risks casually should be unacceptable.

Classic project management and agile sometimes need each other

A great agile team is the most efficient way to deliver high quality software.  However, software delivery is often only part of a larger project.  Sometimes pure agile is unable to achieve what’s needed of a set of stakeholders.  This is where agile by itself can be insufficient.  When multiple teams share a common goal and depend upon each other and the outside world to achieve a goal, classical project management has a place.

Classical project management breaks down a body of work, understands resources, identifies dependencies, and forms a strategic plan for achievement of a goal.  Project management also provides a mechanism for communication, dispute resolution, risk management, and other functions.

Agile software delivery builds high quality software very fast.  It does this by a constant focus on just-in-time delivery, engineering out dependencies, and removing waste in all phases of development.

Pushing agile throughout a large project is an option (think Toyota lean), but this is not always achievable or desirable.  Some teams may be very productive using non-agile processes.  So how do you organize deliverables across these teams with minimal friction?

I propose that agile practices should be capped at the team level, inter-team dependencies be rigorously minimized, and classical project management used to organize dependencies between teams.  Much as how we decouple software through isolation and encapsulation, the same should be done between teams.  Just as agile does not replace experienced software architecture, agile does not replace experienced project management.

Friction arises when project managers don’t understand agile, and software developers don’t understand project management.  Project managers will often try to understand the day-to-day tasks of a software team.  This is fraught with problems.  The day to day tasks of an agile team are agile by definition.  They have the ability to change on a day to day basis.  Trying to track deliverable progress of an agile team by task progress is arguably not agile and extremely inefficient.  Also, enormous communication is necessary to explain technical tasks to non-technical or semi-technical project managers.  I have even seen non-technical project managers try to have the developers estimate at a task level in order to estimate time to completion.  The only purpose of tasks should be to keep the hands-on developers communicating about the body of work.

Project managers should understand an agile team’s deliverables by business value, i.e.: features and releases containing a set of features, and not lower.  Features should be the only dependency an agile team has to the outside world.  The only communication of deliverables, estimation, and risks between project managers and an agile team should be done at this level.  This also enables project managers to ignore the details of the team’s resourcing (days off, number of developers, etc.) and treat the entire team as a single resource.  The teams are able to be agile, while the project gets what it needs.

In order to engage with project managers in this way, it may require more planning than some agile teams are used to.  During each planning phase of a project (every month is my preference) the team should discuss the body of work at the feature level and discuss schedule, estimation, delivery expectations, and risks with the other stakeholders in the project.  Just as project managers need to learn how to ignore day-to-day tasks of agile teams, agile teams need to learn how to engage project managers and plan.

Pair programming 3.5 years on

Pair programming is a topic that it’s easy to have a strong opinion about.  Three years ago I would have strongly opposed, saying things like “why pay two developers to do one’s job?”, or “I would expect every developer to write the highest quality code without pair programming.”  But as one of my favourite sayings goes, “hold strong opinions weakly.”

Three years ago the team I was on at the time wholly embraced pair programming for all production code.  We were taught it by a true legend, Jim Shore (  We did this in the context of a full embrace of Agile, TDD/BDD and XP.  This was not without a lot of scepticism on my part.  Within 6 months, I was completely convinced of the benefits of pair programming and insisted that all code was developed by a pair. 

Since then I have led development of a large system with a small (3-5 developer) expert team and have seen pair programming at its best and worst.

What does pair programming do well?

Pair programming ensures a level of quality commensurate with the skill of the development team.  It is harder to slack off while pairing and skip tests, write sub-standard code, or make stupid mistakes.  When working independently, it’s too easy to think “good enough” or “I won’t have to look at that again” and commit and push code that is not as good as it should be.  I have witnessed individuals write extremely high quality code, but I have too often seen them rush.  In my experience pairs are much more able to keep quality front and centre, and not feel rushed to say something is done prematurely.  The best pairing cultures constantly have a conversation with simple routine questions like “is this code good enough?”, “will another pair understand this in 5 years?”, “did we really understand the requirements?”.  These debates are hard to have with yourself.  No matter how much I value these reflective questions, I find myself asking them much less without someone else to talk to.

Pair programming maintains focus and momentum.  If one of the members of a pair is distracted (by a question from another developer, phone call, e-mail, water break), the other pair can maintain focus, and thus momentum on the task at hand.  Everyone needs mental breaks, and the dynamics of work necessitate communication with the outside world.  Pairing can maintain the momentum of a coding session and allow for these healthy interruptions.

Pairing decreases stupid mistakes.  When a boring task has to be done repeatedly (but automation is not desirable for some reason) mistakes are easily made.  This is especially true when the task requires very little mental effort other than staying focused.  When a pair together undertakes such tasks, it’s amazing how many small errors are made and caught simultaneously.

Pairing promotes a shared codebase.  When code is paired on, it’s much easier for everyone to take ownership of a codebase.  This both helps to promote a sense of responsibility, but also allows for easier criticism of existing code.

Pairing shares skills.  Developers have different skills.  Some skills take significant time to learn (OO and TDD/BDD patterns).  Some micro-skills, such as skills with many tools (vim, ReSharper, VS) can be picked up quickly.  Pairing rapidly shares skills at all levels without having to take time out for training.

What is pairing not great at?

Pairing is not great with highly creative work.  Highly creative work requires deep thought, imagination, and working through different ideas in a very individual way.  Pairing all the time may limit the extent to which new ideas are created to the detriment of the team.  This can be addressed by allowing individuals to split from a pair to work independently in an ad hoc and unscheduled manner.  This can also be addressed by having a fixed time each week for not pairing.

Pairing is not good with mixed level developers.  When an expert developer pairs with a junior developer, it quickly turns into either a training session, or the case where the junior is lost and mute while the expert continues without concern.  Both cases will be extremely frustrating for one or both parties.  Junior developers should be trained using explicit training mechanisms, which may include pairing, but should not be understood as normal work.  Professional software should be written by expert professional developers.  The purpose of junior work (both for the junior and the employer) should be for them to become expert professionals in their area as quickly as possible.  In most cases where an employer thinks there is a lot of junior work, I would wager there is something wrong in their assumptions.


When to pair and when not to pair?

I would pair until a team thoroughly understood when to pair and when not to pair without hard and fast rules.  As with much of professional software delivery, hard and fast rules don’t work at an expert level.  Whether our team paired or not, we would be constantly talking throughout the day about the overall body of work being undertaken.  I would understand what my peers were working on and visa-versa.  If I needed a pair to jump in for help, or if I needed to pair all day would be an ongoing part of the breakdown of work.  When in doubt, err on the side of pairing, and keep a vigilant eye on the quality of code being produced as a team.  Pair programming, independent work, and informal code reviews can all work together in an Agile/XP team that trusts each other’s skills and constantly communicates.

Migrating from NUnit to MSpec with psake and TeamCity

We have a large project with thousands of NUnit tests. We are starting to use MSpec for new projects due to it being less verbose than NUnit for BDD style tests. We would like to enable new tests in existing projects to be written in MSpec, but maintain the current NUnit tests and get all of the nice testing integration with TeamCity for both NUnit and MSpec.

First, we reference MSpec from one of our existing test assemblies using Nuget, and write an MSpec test.

Next, we modify our psake build script to run both NUnit and MSpec tests.

Existing psake unit testing task:

task UnitTests -depends Compile -description "Unit Tests" {
 exec{ & $nunit $nunitTestsNUnitFile /nologo /config:$buildConfiguration /noshadow "/exclude=LongRunning,EndToEnd,Deployment" }

Updated psake unit testing task:

task UnitTests -depends NUnitUnitTests, MSpecUnitTests -description "Unit Tests" {

task NUnitUnitTests -depends Compile -description "NUnit unit tests" {
 exec{ & $nunit $nunitTestsNUnitFile /nologo /config:$buildConfiguration /noshadow "/exclude=LongRunning,EndToEnd,Deployment" }

task MSpecUnitTests -depends Compile -description "MSpec unit tests" {
  $testDlls = ls "$srcDir\*\bin\$buildConfiguration" -rec `
    | where { $_.Name.EndsWith(".Tests.dll") } `
    | where { (Test-Path ([System.IO.Path]::GetDirectoryName($_.FullName) + "\Machine.Specifications.dll")) -eq $True } `
    | foreach { $_.FullName }

  $mspecExePath = Join-Path $srcDir "packages\Machine.Specifications.0.5.7\tools\mspec-clr4.exe"

  if($env:TEAMCITY_VERSION -ne $Null)
      exec{ & $mspecExePath $testDlls --teamcity }
      exec{ & $mspecExePath $testDlls }

As you can see, we have split the unit testing task into both NUnit and MSpec dependent tasks. We keep a flat project structure, so this allows us to perform a query to find all MSpec test DLLs (we assume that any test project build directory that contains Machine.Specifications.dll should be run with the MSpec runner). These DLLs can also contain NUnit tests.

To integrate with TeamCity, we check for the existance of the TEAMCITY_VERSION environment variable. Running the MSpec exe with –teamCity causes it to output in a way that TC understands. Now we have both our NUnit and MSpec tests in the list of tests in TeamCity.

If a developer wants to start using MSpec in a project, they add a reference to MSpec via Nuget.

Secure SSH proxy with Windows Azure Linux VM

A proxy server can provide a secure mechanism to route Internet traffic through when on an insecure network, such as at a hotel or coffee shop. This guide demonstrates how to create a Linux VM within Windows Azure, set up SSH, and set up your laptop to easily connect to it and enjoy secure browsing.


Create Azure account and subscription, and then browse to the management portal.

Sign-up for the VM preview.

Create a new Linux VM from the gallery (I’ve chosen Ubuntu, but you can use other distributions).

When prompted to, give it a username and password.  We will use an Extra-Small instance, and we will not upload an SSH key for this example.

VM Setup (cont). Note this image is slightly incorrect, as I used the DNS name SshProxy.

VM starting:

Once your VM is created, you can test that it’s available by logging on using SSH.  I’m using the SSH client that comes with Git, but feel free to use any.  PuTTY ( is a particularly good one.

Now that you have an SSH server available, you have many options for tunnelling traffic. A simple approach is to use PuTTY to set up a dynamic tunnel, and set your browser’s configuration to use the tunnel.

The following screenshots show the PuTTY setup.


Dynamic proxy:

Once you’ve done this, press the Open button. You will be prompted to login.

Now set up your browser to route all traffic through this port. The following shows the configuration for Firefox:

Now you can google “What’s my IP” and you should see it changed. You now have web browsing traffic routed through a secure tunnel.

Although this method works, it is not as easy as it could be, especially if you are changing browser settings every time you go to a coffee shop. Additionally, it will not secure connections made outside of your browser. I use ProxyCap ( to route all Internet traffic through the SSH server. This means I don’t need to use PuTTY, change browser settings, or worry about other processes that don’t have proxy configuration. ProxyCap is not free ($25?) but I have found it to be a very reliable application.

The following is the server configuration for ProxyCap:

In addition, you will need to set up the following rules: one for forwarding all traffic, the other for ignoring connections to the SSH server.

WebRole Deployment with Azure CmdLets

We recently set up a new Azure project.  We had previously built a custom Azure management API client for this purpose, but didn’t want to share this much code, so we looked for another solution.

We used the Azure CmdLets and were surprised at how quickly we were able to automate our deployment.

I’ve posted the code to:  You will not need the Azure SDK or tools installed to use this project (one of our requirements).  Please reference the file for more info.

Enigma Machine Exercise

At London Software Craftsmanship 2012, there was an excellent session held by the guys at Financial Agile where they had everyone attempt to build an Enigma encryption machine.  We were given about two hours to complete the exercise (with the promise of beer on success), and  unfortunately I did not.

The purpose of the exercise was to provoke discussions regarding TDD approaches, such as the value of varying levels of tests, the need for mocking, etc.

Link to code on GitHub,

Update: I rewrote the NUnit BDD style tests using MSpec as an exercise (branch MSpec).

Large scale logging in Windows Azure: Queues, WCF, and Page Blobs


When something goes wrong in a large-scale software system, how do you diagnose it?  This article provides an overview of the journey our team took in developing diagnostic logging for a large scale Azure grid computing system.

Logging philosophy

The value of very detailed developer facing diagnostics logging can be debated.  The main argument for it is to be able to gain as much information as possible to debug an issue; for those situations where reproducing a bug are not an option.  The main arguments against is that verbose logging leads to a decrease in code readability, and can be a hindrance to performance.

For the purposes of this discussion, I will take as given that a very verbose level of diagnostics logging is required.  A bug should only be observed once, and the development team should have enough information to fix it without reproducing.

Summary of requirements

We have the following logging levels, shown here in increasing levels of verbosity: Error, Warning, Info, Verbose, Debug, and Data.  Error, Warning, and Info diagnostics are written in customer language, and understandable by anyone who uses the system.  Everything below Info logs are intended for developer consumption.

We have two primary needs for logging: real-time visibility, and issue debugging.  A closely related requirement is error reporting.

Real-time logging

Real-time logging provides an almost real-time log of top-level system information.  This provides a twitter-like feed of what’s going on.  The amount of information that can be digested in real-time is limited, however, so we limit this level of logging to informational and up.

Issue debugging

To support issue debugging, a very low level of logging is required, down to code control flow, and data operations.  This information does not need to be real-time, but can be gathered after the fact by a developer.

Best effort and no performance compromises

We consider logging (unlike error handling/reporting) to be a best effort mechanism.  What this means is that logging should not have the ability to crash a system.  If a single log is lost, we do not consider it to be a system failure.  If 5% of your logs are missing, then there’s an issue.

We also do not accept that there is a tradeoff between logging and performance other than in insolated high performance transactions and algorithms.

How we started: Azure Queues

I have long had logging as a primary feature in my toolbox of system scaffolding.  When applying to a new Azure system working in an Agile team, the question was: how do we log as simply as possible with the least effort?  The answer: Azure queues.   We simply wrote log messages to a queue in plain text.

Azure tables would have worked as well, but the API was a bit more difficult to work with.  The built-in Azure diagnostics features would have worked, but are a bit tricky to set up (no matter how many times I’ve done it, I have to look it up).  Azure diagnostics also have a minimum one minute refresh interval –too long for my brain to hold a thought under stress.  Also, built-in infrastructure has a risk to not grow with your needs.

As with all of the logging discussed here, we use in-memory client message buffers to have minimal effect on performance.

Queues did well for a long time.  I’ve found that a single queue can scale to about 60 transactions per second (this may have gone up since I measured it early last year).  We did, however, quickly outgrow queues as a logging mechanism.   If you’re starting a new project, and want something fast, queues may work well for you.

 The next step: custom WCF log service

Our next step was to build a WCF service that would provide a sink for client processes to send logs.  What this involves is a WCF service that maintains an in-memory ring-buffer.  We would store the last 100,000 logs, while a process asynchronously persisted them.  We would maintain a ring-buffer for each log level, so that we could keep a long history of info events, without them being drowned by lower level logs.  We would then serve these to a UI to provide real-time logging.  We were able to get over 10,000 logs/second through the system using a single WCF service (with a very tweaked net.tcp endpoint).  This was a big step up from queues, but in time we outgrew this as well.

Current solution: Queues, WCF service, and Page Blobs

Two things caused us to outgrow our pure WCF solution: the scale of our grid (up to the multiple 1,000s of VMs) and a limitation of Windows Azure that you can’t have private endpoints shared across services.  The former reason caused the sheer amount of log messages coming in per second to overwhelm our diagnostics sink.  The later reason was caused by us needing to run our system across multiple Windows Azure services.  To resolve the second issue, we would have needed to host our logging endpoint on a public IP and deal with security.

Page blobs provided the answer.  Page blobs are a Windows Azure Storage mechanism that can be appended to in blocks of 512.  What we did was have each process (role instance) in our system write to its own page blob.  The names of these blobs are rotated every day, and every 50MB to ensure manageability.  This provides us the low level logging we require.  When fully utilized, we can log a terabyte of information a day on a large grid.  Due to the nature of this type of logging, we only use these logs for after-the-fact issue investigation.

For real time logging, we use a combination of queues and our WCF service.  We use queues to transmit Info, Warning, and Error messages from each of our processes across the Windows Azure service boundary, to a diagnostics process, that also hosts the WCF service described earlier.  This allows us to get near real-time diagnostics messages (Info and up) aggregated and able to be served to a UI.  An overview diagram of this solution is provided.

Azure Diagnostics Diagram

Azure Diagnostics Diagram

I will try and pull some of this code together into a GitHub project if anyone is interested.  Cheers.

Technology breaks: it’s how you react to it

After years developing and operating mission and life critical software systems, I have had a deep belief that failures in software systems, processes, and the people who manage them were unacceptable. I believed that any defect in these aspects of software delivery were able to be engineered away to the point of being a negligible risk. The best people, good practices, rigorous TDD, a high-level of automation, chaos testing, and manual QA will ensure software simply will not break… right?

There are some fields of software engineering that the answer to this has to be yes: missile guidance systems, x-ray machines, space shuttle navigation systems, etc.

Unless our customers want to pay for this level of quality (which would slow enterprise system development down to the point of losing the required business agility), this level of quality is prohibitively costly.

What we must do is understand that technology breaks. We make mistakes. Bugs will be written, and we can not cover all possible scenarios with the above mentioned strategies. Up-front software quality matters enormously. But what also matters is how we handle the situation when it breaks.

Do we actively monitor our software to find issues before our customers do?

Do we have a well-defined and practiced incident management process, so when incidents occur we are able to react to them in an organized, professional, and calm manner?

Do we transparently keep our customers informed during critical incidents?

Do we communicate the risks of software systems to our customers and work with them to understand that software sometimes fails?

Do we measure the severity, frequency, and duration of critical incidents?

Do we learn from our incidents? Do we take root cause analysis seriously?

Given limited resources, we can only take software quality to an extent for most business cases. What we can do is make sure we compensate for this by easing the pain of critical incidents, and having a good process in place to ensure our customers of our expertise, learn from our mistakes, and not allow ourselves to be drowned by the fire-drill.


Get every new post delivered to your Inbox.