If you’re working on an application that’s been around for a while, has lots of usage and several different teams working on it, there’s a good chance it’s a dumpster fire. It happens to the best of us. Building software with real customer adoption is messy business. While there is no one size fits all, these are my opinions on building more reliable, scalable, and maintainable Internet applications.
1. Go All In On One Cloud
Pick the leading cloud provider that makes the most sense for your team and go all in. Don’t miss half the benefit of the cloud by limiting yourself to the lowest common denominator between cloud providers, which is basically just VMs. This is folly anyway as it is nearly impossible to build an application natively in the cloud without tentacles into proprietary resources from that provider. Trying to be on the light end of the lock in spectrum is not worth the trade off you’d be making in lost productivity and reliability.
2. Forget About Servers
Use containers and/or serverless for 100% of your dev, stage, and prod environments. There is almost no reason to use VMs directly for any new development. While containers and serverless have their own set of challenges they are an order of magnitude better than using configuration management (Ansible, Chef, Puppet) and are your only hope for getting your dev team to participate in DevOps, which is necessary to scale development of a cloud native application.
3. Commit Wholly To Infrastructure As Code
Don’t take shortcuts that question your ability to stand up the application from source in the future. Make defining your infrastructure as code a requirement, not a nice to have. This applies to your infrastructure primitives (e.g. building blocks directly in your AWS account) as well as any third party tools or services you may want to integrate with. If it can’t be brought to life entirely from source code, you should seriously consider an alternative that can. Your application code needs a specifically configured environment to run in. Setting up that environment by hand is like writing a program, deploying it to production, then deleting the source code. Let’s not even talk about that install guide document you pretend is accurate and up-to-date.
Your first line of code should be your Dockerfile, or maybe your serverless yaml or your Terraform. The point is, start with your infrastructure code, at least enough to define your immediate development environment, which becomes the blueprint for production. Store your infrastructure code with your app code, lock step with the application in the same git repo with each microservice. You should be able to clone a git repo from any computer, build the Docker image and immediately start working on it, no matter what language it’s written in or dependencies it has. “It works on my machine” is the joke of the last generation of software development. Think of your machine as a text editor and a Docker engine.
4. Embrace The Ephemeral
I think it’s helpful to think of your code running on a “server” that didn’t exist 5 minutes ago and won’t exist 5 minutes from now (or maybe seconds, or milliseconds). If you haven’t noticed, in the cloud they aren’t even called servers anymore, they’re instances, hinting at their proper use lifespan.
Never write out any state to local disk volumes. Pretend your environment doesn’t even have a local disk. Expect instances of your program to terminate abruptly due to auto-scaling or chaos experiments. Your operations need to be idempotent, and anticipate retries. You need good logging and usage metrics, which your application must be deliberately broadcasting. Throw away your ssh keys and keep the telemetry mindset of your application running on an inaccessible remote location.
This forces the issue of immutable infrastructure. It makes no sense to modify or “patch” something that has a lifespan of milliseconds to minutes. Change is always by terminating the existing resource and replacing it with a new one with new settings.
5. Use Managed Services Wherever Possible
Building and operating clusters of anything isn’t providing business value or helping you to quickly iterate. Wherever possible, skip all that undifferentiated heavy lifting and use a managed service.
Installing a database server for example, on a single server for development isn’t so hard, but doing it in production in a way that is highly available is another story entirely. You have to setup an auto-scaling configuration where nodes can launch and rejoin the cluster, setup and test failover and restore for your HA configuration, and setup and test all the associated monitoring and alarms. It’s easy to get something wrong here and since you’re usually in the data layer, these are costly mistakes. And since our dev and stage environments need to match our production environment exactly (apart from instance size), the easy single server installation isn’t acceptable in any environment. Maybe this would be manageable if we were doing this once for one huge database server, but in a microservices world each with their own isolated data layer, it’s just not practical to self manage with dozens of services. Recalling the earlier rule about bringing everything to life entirely from infrastructure code, usually the only way to accomplish this without a herculean effort is to use a managed service. If you’re insisting on installing everything on EC2 yourself you are missing half the benefit of the cloud.
6. Testing Should Make You Go Faster Not Slower
Testing should give you rails so you can speed, not gates to force you to come to a stop. Your tests (all of them – unit, integration, load, front end) should be a completely automated part of your build and deploy pipeline, and like your infrastructure code, should reside in the same git repo as the component they are testing. The same integration tests should run against all environments, especially production, which is the most important environment to test because it is the only one your customers actually see. This means you need to use test accounts in your real databases, your testing can’t rely on starting from a fresh database each run, which isn’t feasible in production, and demands exclusive use of an environment for testing, which obviously doesn’t scale. This requires a close partnership with development, it simply won’t happen if you treat QA as a siloed activity.
Testing should share the microservices mindset – when we replace a pipe in the basement, we don’t always need to reinspect the whole house including the windows in the attic. Testing is critical, but we don’t want testing at any cost. Don’t let outdated testing practices be the weak link in your agile process.
7. Always Be Deploying
Because I think this one is so important, I’m going to borrow the sentiment from the infamous Glengarry Glen Ross, “Because only one thing counts in this life. Get [the code deployed to production]! You hear me you ***** *****! A, B, [D]. A – always, B – be, [D] – [deploying]. Always be [deploying]. Always be [deploying]!”
The more time that passes between deployments, the higher your integration risk, and the greater chance you are blocking another service which you are a dependency for. This is exactly why we do CI/CD in development. While I’m not saying you need to do continuous deployment in production, the same risks accumulate as time goes on so you want to keep the change delta as low as possible. Plan on deploying every sprint. Most deployments are back-end services with minimal to no customer facing aspects, so this doesn’t necessarily mean visible UI changes all the time.
Each microservice should be independently testable and deployable. When working on a feature that spans multiple microservices, deploy each one as soon as it is ready, don’t squirrel them away to deploy all at once in some blaze of glory.
So put that coffee down, coffee is for deployers!
8. Write Your README Before Writing Any Code
As the saying goes, “If you don’t know where you’re going, any road will take you there.” I love the Amazonian practice of starting a new project or feature by first writing a press release announcing it. I think that causes you to think bigger and to go into the work with a more clear picture of the most important parts of what you’re building. Perhaps the microservice equivalent of that is writing your README first. Give a couple sentences about what the thing does. Show usage examples demonstrating the main interfaces. Show examples of how to build, test, and deploy it. Do this first, in the README in your git repo, not last in some external document management system.
And for the love of God, before any code is written, think about how you’re naming things. Nothing shows lack of thought about how your service fits into the bigger picture of your ecosystem of services like giving it a vague or misleading name. Renaming things seldom happens, or is really expensive to do later.
9. Open Source From The Beginning
Even if you don’t want third party contributions, I think open sourcing your project reinforces a few helpful perspectives. First you are forced to handle secrets appropriately, which makes it a lot easier to get a short term consultant involved, because you can actually share your code base without worrying about changing the secrets later (let’s be honest, doesn’t always happen). Open sourcing your project may also inspire confidence from your customers about the longevity of your platform and inspire the best from your developers who will hold themselves to a higher standard knowing their work will be on display.
More importantly, it serves as a constant reminder that while it takes valuable skill to produce the source code, it has no intrinsic value and can be easily duplicated. I know this is a difficult pill to swallow when you’re spending boatloads of money on development but it’s the truth. Your value is in your data set, relationships, reputation, and execution.
At least look to open source components of your application which may be general use tools, or with a small amount of additional thought could have a general use interface rather than being specific to your application.
10. Iterate on Customer Experience
Andy Jassy ended his 2018 AWS re:Invent keynote with some remarks which encapsulates what this is all about.
“For any of us who are trying to build long standing sustainable businesses, which is hard, the most important thing by far is to listen really carefully to what your customers want from you and how they want the experience to improve, and then to be able to experiment and innovate at a really rapid clip to keep improving the customer experience. That is the only way that any of us will be able to survive over a long period of time. That’s what you should care about, giving your builders the most capable platform and allow them to keep iterating on that customer experience.”
While of course he’s suggesting that AWS specifically be the foundation of that platform, the sentiment rings true, and that’s what all of this is ultimately about, investing our resources iterating on customer experience, and not get bogged down with boring IT stuff.