Building software can be hard. Requirements can be swept under the rug, only to find out that: Whoops. We shouldn’t have forgotten about those. Stakeholders requests can silently be forgotten, only to be brought up later, eroding trust. Decisions can take a long time to make if the right people are missing, and even if the room doesn’t know they have the power to. Developers may also be blocked on their work with not knowing that one critical piece of information. Who best to alleviate the previously mentioned pains other than the team’s Lead?
Call the position a Lead Developer. Call it a Development Manager. Call it whatever. Even if you don’t have the title, the ability to influence and lead people to make the team’s product, people, or processes better are well needed in all development teams.
As a Lead, your back is on the line when it comes to everything your team does. The glory you pass down onto the individual team members, or the entire team. The failures you have to suck up and own yourself. Since the engineering lead is on the line when it comes to the team’s output and performance, it’s a large incentive to use your experiences, skills, and contacts to supercharge your team.
One of those methods of influence I have been using recently is picking up and coming to some sort of closure for decisions that haven’t been made, or information that is needed by the development team.
I am of the type of Lead who will perform a gut check and directly ask a developer if they’re blocked on missing information. If the way to unblock them is clear and simple, I point them in the right direction, backing it up with whatever details about the technical, vision, or user story – all without having to reach out to the person best suited to answer. If something is of importance, where the wrong answer could waste time or affect the product in a negative way, reaching out to the person who would know the answer is often necessary. Making it your personal mission to figure that out, and report back to the dev about the answer builds trust that yes, you the dev Lead can help.
Side note: If the dev is skilled enough in knowing a problem area and is able to talk with stakeholders or the people necessary to help solve their problem, encourage them to own figuring this out themselves instead of dealing with it yourself. Empowering your dev to be more independent through dealing with people they may not have met grows the number of contact they have, improves their ability to be resourceful, and can result in being more engaged with the problem. Since this may be an uncharted area for the dev, one on one time is quite valuable for talking about your report’s recent situations, helping them problem solve, and strategizing.
We are all in the 21st century working at high tech organizations – meetings are terrible since we have a wealth of different synchronous and asynchronous tools to get the same or better outcome from a meeting. Therefore I don’t like attending most meetings. Though sometimes you just have to get multiple people into a physical or virtual room and talk things through. Gaining the skills to be a meeting facilitator is very beneficial. It’s basically the practice of having an agenda, leading a meeting, keeping people on track, coming to conclusions on the talking points, and lastly creating action items. Without a meeting facilitator, it can be easy for a meeting to become taken over by one speaker or topic, leaving all other items to talk about untouched. Action items can also fall by the wayside, by either not being discussed, or people not being held accountable, which can absolutely demotivate people on the effectiveness of that meeting, especially if it’s recurring.
Sometimes you might be missing one critical person in the room. It’s always painful to know that We’re not going to get to a decisive answer on what we should do since we’re missing Jimmy. Getting good at honing in on this skill helps make your meetings productive, either by cancelling them to save everyone’s time, or consulting with the missing people beforehand. Giving this intuition as feedback to other people who host meetings can only help reduce this from happening in the future. No one likes wasting time.
It’s one thing to have the meeting and come out feeling Great! Everyone knows what needs to be done. Time to sit back and watch my genius planning unfold. Wrong. That’s half of the battle. You still have to course correct from time to time. This could mean following up on the people assigned action items to see if they need help or are blocked, freeing up devs from tasks that are of lesser of priority, and making sure the right people are being notified when action items are completed.
But when the stars do align and the team gets shit done, don’t stay entirely humble. Remember to give yourself some credit for accelerating the team.
For businesses to outperform their competitors and bring ideas to the market fast, Software Development has evolved towards a continuous delivery model of shipping small, incremental improvements to software. This method works incredibly well for Software-as-a-Service (SaaS) companies, which can deliver features to their customers as soon as features are fit to release.
The practice of Continuous Delivery require the master branch to be in a readily shippable state. Thus decreasing the time to ship a change to production encourages faster iteration and smaller, less riskier, changes to be made. Additionally, Continuous Deployment, the shipping of the master branch as soon as changes make it to master, is achievable through a comprehensive suite of automated tests.
For a development team, keeping this cycle on the order of minutes to tens of minutes is paramount. Slowing down means a slower iteration cycle, therefore resulting in larger and riskier changes being made.
I have noticed my team slowing down by using our handful of staging servers more often than is necessary.
Thankfully we can get back to better than we left off and learn a few things along the way!
Why we have staging servers/environments
My team builds the platform for Shopify’s Help Centre and the Merchant facing experience for contacting Support. This same app is also contributed to by our 20 Technical Writers on the Documentation team.
Technical Writers work alongside the many product teams at Shopify to create and update documentation based on what the product team is building. Part of the process of continuously delivering this documentation is a member of the product team reviewing the changed pages for accuracy.
This is often achieved through a Technical Writer publishing content to one of a handful of staging servers, then directing the product teams to visit the staging server.
This workflow makes sense for the most part, since non-technical people can simply visit the staging server to view the unpublished changes. This workflow of having many staging servers isn’t a scalable solution, but that’s for another post.
An effect of having all of these available staging servers is that developers use them to perform various tasks such as:
Sharing their work for other developers to look at
Testing out risky changes in a production-like environment
It can be pretty easy to rationalize slowing down as being more careful, but this is just a fallacy.
This is an alternative outlook on shipping software since things can go wrong. But when developers are given the freedom to move fast, and are not held down by strict process, most of the time the best risk-reward balance is made. When things do go wrong, having a safety net of tests and production tooling to make it easy to figure out what went wrong, along with the ability to revert back to a previous state. The impact is therefore minimal.
Over the past few months I have observed a number of situations where developers have used staging environments instead of better alternatives.
One of the biggest slowdowns in iteration cycle is the time to get your code reviewed by someone else. It’s an incredibly important step, but there are shortcuts that can be taken. One of those shortcuts being reviewing code on a staging server.
It takes way longer to deploy code to a staging server than it does to locally checkout someone’s branch and run the code locally. Getting into the habit of pulling down someones changes, reviewing the code, and performing some exploratory testing with a running instance of the app enables a deeper inspection and understanding of the code.
Additionally, using staging servers to test out code “because it doesn’t work on my machine” is an anti-pattern. Developers must prioritize having all features working locally for everyone, at any time, by default. A dysfunctional local development environment just feeds the vicious cycle of more and more things should be tested on staging. Putting the time in to make everything testable in the local development environment pays dividends in speed and developer happiness.
Shipping large, risky changes by vetting that they work on staging first give developers the shortcut to iterate at a slower pace. Here’s a concrete example showing how much extra time it takes to test out code on staging.
Dev B is reviewing Dev A’s code. Dev B looks over the changeset, and then asks Dev A to put their code up on staging so that they can verify that the code works as expected. Dev A pushes their code to a staging branch, waits for CI to pass, waits for the deploy to succeed, then notifies Dev B that they can test out the changes. Dev B then gets around to going through the steps to verify that the new changes behave as expected. Dev B then finally gives their sign-off on the changeset, or requests further changes. This entire process, mostly spent waiting for builds and CI, can take 30 minutes or more.
Now lets see what a modified version of the process looks like if Dev B reviews Dev A’s code on their local machine. Dev B looks over Dev A’s changeset, then pulls down the code to their local machine for further inspection. Dev B starts up the app locally and goes through the steps to verify that the new changes behave as expected. Dev B optionally has the ability to poke around the changed code to gain a better understanding of how it fits in with the existing code. Dev B signs-off on the changeset, or requests further changes from Dev A. This process can take 5 minutes or more, but is magnitudes faster than using a staging environment.
As we can see, the time taken to verify that Dev A’s code works correctly in staging takes at least six times longer on average due to having to wait for code to build, deploys to occur, and even unneeded conversations to coordinate using the staging environment. The same outcome can be performed much faster by replacing many of the steps with faster equivalents. For example, running CI and performing a deploy isn’t needed when running code locally. There’s also no time spent coordinating with Dev A to put their code up on the staging environment.
There may be perceived speed with using the staging environment to review someone’s changes, but this is only a fallacy. Dev B may think: “If I just need to visit the staging environment to review Dev A’s code, then I save myself time from having to stash my local changes, pull down the code, and start the app.” Correct, this saves Dev B’s time, but overall causes Dev A to take more of a hit to their time. Dev A has to push their code up to the staging env, causing CI to run, a deploy to occur, then notify Dev B to take a look tens of minutes later.
Where staging environments make sense
With all hardfast rules there are some exceptions. One of those exceptions is to validate new configuration for production systems. For example, since it’s not simple to run a local Kubernetes cluster, it’s safer to verify risky changes to Kubernetes Deployment config files by using a production like environment: staging.
Another example is where lives or the wellbeing of people can be on the line. An example of this would be developing a payment processing service where breaking things could result in financial consequences for users of the system. Even a voting system would be an example of a critical system where it’s necessary to take the time to make sure everything is working correctly.
Chatting with another developer about this blog post, I asked for some examples as to what kinds of things they use their staging environment for.
Dark launching new functionality
Shipping to production can have a certain amount of risk. A code change could crash the app, break a feature, or even cause a worse user experience. What if we could ship to production and drastically reduce these risks?
Let’s talk about dark launching new features and changes. Dark launching is the practice of shipping new code to production, but hiding it from most users to prevent accidentally breaking things or negatively affecting the user’s experience. This could be implemented a number of different ways:
Using the new logic if a special parameter is added to the page’s URL
A special cookie set in the user’s browser to enable the new logic
A/B testing of the current and new logic
Enabling the new logic only for employees
A beta flag that can turn on and off the logic at runtime
For example, my team is building out a new search backend. The team is able to ship small and incremental changes for this project without having to worry about breaking any of the existing search functionality. For the existing frontend code to integrate into the new backend code, the team is using URL parameters to dark launch this new search backend in production. This gives us great confidence of the new search backend will work since it’s being continually tested in production. Additionally, we’ll be using an A/B test to verify that the new search backend is better than the existing search backend according to our success metrics.
Dark launching new functionality is another pattern that removes the need for staging environments. It does take some thought to figure out the best way to toggle on or off the new functionality, but when used well dark launching can minimize the impact of new code breaking production.
Later that day after convincing my team that staging servers were holding us back, one of our developers wasn’t able to test out our ticket submission form locally since it depended on another service to be running. Our app was missing the proper local development credentials to connect to this other service.
A few Slack messages later with the team resulted in a combined effort to fix the local development environment. One change to the local development environment made developing locally as simple if not simpler than using the staging environment.
Two months later the team is able to hold themselves to not using any of the staging environments. There have been a few times where the idea of making an exception has come up. I talked them off the ledge by suggesting to make less riskier changes by splitting things up into smaller pull requests, and even dark launching their feature.
If I have convinced you on staging servers being used too much for the wrong purposes, or are taking my more extreme view of just don’t use staging servers, here is some practical advice to move towards these goals if you’re not there already.
Start with thinking about yourself. From the features, projects, and bugfixes that have been shipped by yourself over the past few months, which have used a staging server to verify that they’ll work correctly in production? If there have been any, ask yourself what the reason was for having to use the staging server.
Take those reasons and figure out if each one could have been prevented by one or a combination of the following:
If the local development environment was more like production I could have avoided using staging
If the code change could have been dark launched to production I could have avoided using staging
If we had more confidence with our tests catching regressions then I could have avoided using staging
Some of the improvements that can be made to limit the amount of times staging servers are used can seem like a lot of work. But think of this from a different perspective: how much time is wasted due to these inefficiencies being here?