Tuesday, November 5, 2013

Staying fast and good when making enterprise software

My startup Staq builds enterprise software using the techniques I've learned from building consumer web apps. As we bring more customers on and our revenue increases, we have more to lose when there's a bug in our code, and there's pressure to revert to a more traditional development process where the developers have less control. Specifically we're debating whether to move away from our current fast-paced "manual continuous deployment" process (where we're pushing code throughout the day, every day, and getting immediate feedback) to a more traditional commit-review-QA-release cycle. I have pretty strong feelings against going in that direction that I wanted to document here.

I know from experience that a process like that will inevitably slow us down, which we can't afford during this phase of hyper-growth. It can produce a boring, toxic work environment in the long run. Code you wrote today doesn't end up getting in front of a user until next week, and you've already forgotten the context to explain why you wrote that code the way you did. When there's a bug, you have to start from scratch when figuring out the problem. Feedback gets delayed, if you get it at all. 

So how we stay fast while keeping the software quality high? How do we stay fast and good?

I intend to establish a core value at Staq where we never, ever stop deploying. We make software at such a high level of quality, and our cluster immune system is so awesome, that it's nearly impossible to kill the app. When defects occur, they are so obvious and so easy to correct, that users rarely notice the problem.

In the early days there may be times when we have to compromise on this value. When that happens, I want it to feel like an ugly compromise. I want the situation to become a sign that we need to improve our systems to get back to the platonic ideal of Always Be Deploying. ("Coffee is for committers!")

Here's a partial list of strategies we came up with to make this work, most of them drawn from the lean startup movement:
  • Work in small batches: smaller changes are less likely to cause a disruption, and are easier to pinpoint when they do cause problems.
  • Don't take shortcuts: skipping tests and hacking things together ultimately add unacceptable risk and technical debt.
  • Practice five whys/root cause analysis: don't make the same mistake twice; gradually develop a cluster immune system
  • Expect 100% unit test code coverage, backed by appropriate integration and feature tests: catch problems as early as possible
  • Setup a continuous integration server: we have too many modules for any one developer to test all of them at once, so this server will become a backstop, alerting us when there is a problem in one of our modules.
  • Use Feature flags: while we are testing new features, hide them from users until they are fully baked.
  • Build staging environments that make it easy for developers to double-check their work
    • Since we're on Heroku we plan to use their awesome pipelines lab for this
    • We can also fork our database to create a sandbox for features that depend on database changes
    • This should never be construed as a required step before code can be released to production
  • Code reviews
    • Two times a day I review every commit in our system, really helps spot issues before they become problems
  • Build informative, actionable alerting systems

What else should we add to the list?

3 comments:

gtrak said...

I like Michael Nygard's talks, 'Disbanding the Deployment Army' talks about a lot of tradeoffs: http://www.youtube.com/watch?v=Luskg9ES9qI

Navjeet said...

Great tips. Thanks for sharing.

Michael Scepaniak said...

"Expect 100% unit test code coverage, backed by appropriate integration and feature tests".

Really? One of the purposes of fast release cycles is typically to get immediate feedback, ostensibly because you don't know just how valid a feature is until it is validated by actual production usage. That implies that many of these features are experiments. Do you really want to enforce the same level of test coverage on "experiments" as core features? Is this expectation set with an anticipated "hit rate" of feature experiments?

Regardless, great approach here. Thanks for writing, Mike.


Mike...