Staging environments slow developers down

For businesses to outperform their competitors and bring ideas to the market fast, Software Development has evolved towards a continuous delivery model of shipping small, incremental improvements to software. This method works incredibly well for Software-as-a-Service (SaaS) companies, which can deliver features to their customers as soon as features are fit to release.

The practice of Continuous Delivery require the master branch to be in a readily shippable state. Thus decreasing the time to ship a change to production encourages faster iteration and smaller, less riskier, changes to be made. Additionally, Continuous Deployment, the shipping of the master branch as soon as changes make it to master, is achievable through a comprehensive suite of automated tests.

For a development team, keeping this cycle on the order of minutes to tens of minutes is paramount. Slowing down means a slower iteration cycle, therefore resulting in larger and riskier changes being made.

I have noticed my team slowing down by using our handful of staging servers more often than is necessary.

Thankfully we can get back to better than we left off and learn a few things along the way!

Why we have staging servers/environments

My team builds the platform for Shopify’s Help Centre and the Merchant facing experience for contacting Support. This same app is also contributed to by our 20 Technical Writers on the Documentation team.

Technical Writers work alongside the many product teams at Shopify to create and update documentation based on what the product team is building. Part of the process of continuously delivering this documentation is a member of the product team reviewing the changed pages for accuracy.

This is often achieved through a Technical Writer publishing content to one of a handful of staging servers, then directing the product teams to visit the staging server.

This workflow makes sense for the most part, since non-technical people can simply visit the staging server to view the unpublished changes. This workflow of having many staging servers isn’t a scalable solution, but that’s for another post.

An effect of having all of these available staging servers is that developers use them to perform various tasks such as:

  • Sharing their work for other developers to look at
  • Testing out risky changes in a production-like environment

It can be pretty easy to rationalize slowing down as being more careful, but this is just a fallacy.

This is an alternative outlook on shipping software since things can go wrong. But when developers are given the freedom to move fast, and are not held down by strict process, most of the time the best risk-reward balance is made. When things do go wrong, having a safety net of tests and production tooling to make it easy to figure out what went wrong, along with the ability to revert back to a previous state. The impact is therefore minimal.

Photo by Hanson Lu on Unsplash
Photo by Hanson Lu on Unsplash

The Repercussions

Over the past few months I have observed a number of situations where developers have used staging environments instead of better alternatives.

One of the biggest slowdowns in iteration cycle is the time to get your code reviewed by someone else. It’s an incredibly important step, but there are shortcuts that can be taken. One of those shortcuts being reviewing code on a staging server.

It takes way longer to deploy code to a staging server than it does to locally checkout someone’s branch and run the code locally. Getting into the habit of pulling down someones changes, reviewing the code, and performing some exploratory testing with a running instance of the app enables a deeper inspection and understanding of the code.

Additionally, using staging servers to test out code “because it doesn’t work on my machine” is an anti-pattern. Developers must prioritize having all features working locally for everyone, at any time, by default. A dysfunctional local development environment just feeds the vicious cycle of more and more things should be tested on staging. Putting the time in to make everything testable in the local development environment pays dividends in speed and developer happiness.

How slow?

Shipping large, risky changes by vetting that they work on staging first give developers the shortcut to iterate at a slower pace. Here’s a concrete example showing how much extra time it takes to test out code on staging.

Dev B is reviewing Dev A’s code. Dev B looks over the changeset, and then asks Dev A to put their code up on staging so that they can verify that the code works as expected. Dev A pushes their code to a staging branch, waits for CI to pass, waits for the deploy to succeed, then notifies Dev B that they can test out the changes. Dev B then gets around to going through the steps to verify that the new changes behave as expected. Dev B then finally gives their sign-off on the changeset, or requests further changes. This entire process, mostly spent waiting for builds and CI, can take 30 minutes or more.

Now lets see what a modified version of the process looks like if Dev B reviews Dev A’s code on their local machine. Dev B looks over Dev A’s changeset, then pulls down the code to their local machine for further inspection. Dev B starts up the app locally and goes through the steps to verify that the new changes behave as expected. Dev B optionally has the ability to poke around the changed code to gain a better understanding of how it fits in with the existing code. Dev B signs-off on the changeset, or requests further changes from Dev A. This process can take 5 minutes or more, but is magnitudes faster than using a staging environment.

As we can see, the time taken to verify that Dev A’s code works correctly in staging takes at least six times longer on average due to having to wait for code to build, deploys to occur, and even unneeded conversations to coordinate using the staging environment. The same outcome can be performed much faster by replacing many of the steps with faster equivalents. For example, running CI and performing a deploy isn’t needed when running code locally. There’s also no time spent coordinating with Dev A to put their code up on the staging environment.

There may be perceived speed with using the staging environment to review someone’s changes, but this is only a fallacy. Dev B may think: “If I just need to visit the staging environment to review Dev A’s code, then I save myself time from having to stash my local changes, pull down the code, and start the app.” Correct, this saves Dev B’s time, but overall causes Dev A to take more of a hit to their time. Dev A has to push their code up to the staging env, causing CI to run, a deploy to occur, then notify Dev B to take a look tens of minutes later.

Photo by Ruslan Keba on Unsplash
Photo by Ruslan Keba on Unsplash

Where staging environments make sense

With all hardfast rules there are some exceptions. One of those exceptions is to validate new configuration for production systems. For example, since it’s not simple to run a local Kubernetes cluster, it’s safer to verify risky changes to Kubernetes Deployment config files by using a production like environment: staging.

Another example is where lives or the wellbeing of people can be on the line. An example of this would be developing a payment processing service where breaking things could result in financial consequences for users of the system. Even a voting system would be an example of a critical system where it’s necessary to take the time to make sure everything is working correctly.

Antipatterns

Chatting with another developer about this blog post, I asked for some examples as to what kinds of things they use their staging environment for.

One example was verifying that updating UI component libraries looked the same between development and production. Since there’s no real good way to test that the UI doesn’t look broken, it’s quite a manual process to verify the many screens and states look fine. One gotcha that was mentioned was that the production build of the Javascript and CSS assets can be different from the development build. This of course means that there is a difference between development and production, which means that bugs can slip through and get to their users.

Off the top of my head a few suggestions came to mind. One idea was to make development more like the production environment (however that may be). During the testing process create a production build of the Javascript and CSS assets locally and use that to verify that the UI looks fine. Lastly, if possible make smaller changes that are easier to review and reason about.

Photo by Romain Hus on Unsplash
Photo by Romain Hus on Unsplash

Dark launching new functionality

Shipping to production can have a certain amount of risk. A code change could crash the app, break a feature, or even cause a worse user experience. What if we could ship to production and drastically reduce these risks?

Let’s talk about dark launching new features and changes. Dark launching is the practice of shipping new code to production, but hiding it from most users to prevent accidentally breaking things or negatively affecting the user’s experience. This could be implemented a number of different ways:

  • Using the new logic if a special parameter is added to the page’s URL
  • A special cookie set in the user’s browser to enable the new logic
  • A/B testing of the current and new logic
  • Enabling the new logic only for employees
  • A beta flag that can turn on and off the logic at runtime

For example, my team is building out a new search backend. The team is able to ship small and incremental changes for this project without having to worry about breaking any of the existing search functionality. For the existing frontend code to integrate into the new backend code, the team is using URL parameters to dark launch this new search backend in production. This gives us great confidence of the new search backend will work since it’s being continually tested in production. Additionally, we’ll be using an A/B test to verify that the new search backend is better than the existing search backend according to our success metrics.

Dark launching new functionality is another pattern that removes the need for staging environments. It does take some thought to figure out the best way to toggle on or off the new functionality, but when used well dark launching can minimize the impact of new code breaking production.

Immediate improvements

Later that day after convincing my team that staging servers were holding us back, one of our developers wasn’t able to test out our ticket submission form locally since it depended on another service to be running. Our app was missing the proper local development credentials to connect to this other service.

A few Slack messages later with the team resulted in a combined effort to fix the local development environment. One change to the local development environment made developing locally as simple if not simpler than using the staging environment.

Two months later the team is able to hold themselves to not using any of the staging environments. There have been a few times where the idea of making an exception has come up. I talked them off the ledge by suggesting to make less riskier changes by splitting things up into smaller pull requests, and even dark launching their feature.

Photo by Jodie Walton on Unsplash
Photo by Jodie Walton on Unsplash

Recommendations

If I have convinced you on staging servers being used too much for the wrong purposes, or are taking my more extreme view of just don’t use staging servers, here is some practical advice to move towards these goals if you’re not there already.

Start with thinking about yourself. From the features, projects, and bugfixes that have been shipped by yourself over the past few months, which have used a staging server to verify that they’ll work correctly in production? If there have been any, ask yourself what the reason was for having to use the staging server.

Take those reasons and figure out if each one could have been prevented by one or a combination of the following:

  • If the local development environment was more like production I could have avoided using staging
  • If the code change could have been dark launched to production I could have avoided using staging
  • If we had more confidence with our tests catching regressions then I could have avoided using staging

Some of the improvements that can be made to limit the amount of times staging servers are used can seem like a lot of work. But think of this from a different perspective: how much time is wasted due to these inefficiencies being here?

A few Gotchas with Shopify API Development

I had a fun weekend with my roommate hacking on the Shopify API and learning the Ruby on Rails framework. Shopify makes it super easy to begin building Shopify Apps for the Shopify App Store – essentially the Apple App Store equivalent for Shopify store owners to add features to their customer facing and backend admin interfaces. Shopify provides two handy Ruby gems to speed up development: shopify_app and shopify_api. An overview of the two gems are given and then their weaknesses are explained.

Shopify provides a handy gem called shopify_app which makes it simple to start developing an app for the Shopify App Store. The gem provides Rails generators to create controllers, add webhooks, configure the basic models and add the required OAuth authentication –  just enough to get started.

The shopify_api gem is a thin wrapper of the Shopify API. shopify_app integrates it into the controllers automatically, making requests for a store’s data very simple.

Frustrations With the API

The process of getting a developer account and developer store created takes no time at all. The API documentation is clear for the most part. Though attempting to develop using the Plus APIs can be frustrating when using the APIs for the first time. For example, querying the Discount API, Gift Card API, Multipass API, or User API results in unhelpful 404 errors.  The development store’s admin interface is misleading as a discounts section can be accessed where discounts may be added and removed.

By default, anyone who signs up to become a developer only has access to the standard API endpoints, leaving no access to the Plus endpoints. These Plus endpoints are only available to stores which pay for Shopify Plus, and after digging into many Shopify discussion boards it was explained by a Shopify employee that developers need to work with a store who pays for Shopify Plus to get access to those Plus endpoints. The 404 error when accessing the API didn’t explain this and only added confusion to the situation.

One area that could be improved is that there is little mention of tiered developer accounts. The API should at least give a useful error message in the response’s body explaining what is needed to gain access to it.

Webhooks Could be Easier to Work With

The shopify_app gem provides a simple way to define any webhooks that should be registered with the Shopify API for the app to function. The defined webhooks are registered only once after the app is added to a store. During development you may add and remove many webhooks for your app. Since defined webhooks are only registered when the app is added to a store the most straightforward way to refresh the webhooks is to remove the app from the store and then add it again.

This can become pretty tedious which is why I did some digging around in the shopify_app code and created the following code sample to synchronize the required webhooks with the Shopify API. Simply hit this controller or call the containing code somewhere in the codebase.

If there’s a better solution to this problem please let me know.

Lastly, to keep track of your sanity the httplog gem is useful to track the http calls that shopify_app, shopify_api and any other gem makes.

Wrapping Up

The developer experience on the Shopify API and app store is quite pleasing. It has been around long enough to build up a flourishing community of people asking questions and sharing code. I believe the issues outlined above can be easily solved and will make Shopify a more pleasing platform.

The Software Engineering Daily Podcast is Highly Addictive

Over the past several months the Software Engineering Daily podcast has entered my regular listening list. I can’t remember where I discovered it, but I was amazed at the frequency at which new episodes were released and the breadth of topics. Since episodes come out every weekday there’s always more than enough content to listen to. I’ve updated My Top Tech, Software and Comedy Podcast List to include Software Engineering Daily. Here are a few episodes that have stood out:

Scheduling with Adrian Cockroft was quite timely as part of my final paper for my undergraduate degree focused on the breadth of topics in scheduling. Adrian discussed many of the principles of scheduling and related them to how they were applied at Netflix and earlier companies. Scheduling is really a necessity for software developers to know as scheduling occurs in all layers of the software and hardware stack.

Developer Roles with Dave Curry and Fred George was very entertaining and informative as it presented the idea of “Developer Anarchy”, a different structure to running, (or not running), development teams. Instead of hiring Project Managers, Quality Assurance, or DBAs to fill a specific niche of a development team, you mainly hire programmers and leave them to perform all of those tasks according to what they deem is necessary.

Infrastructure with Datanauts’ Chris Wahl and Ethan Banks entertained as much as it informed. This episode had a more casual setting as the hosts told stories and brought years of experience to the current and future direction of infrastructure in all layers of the stack. Comparing the current success of Kubernetes to the not-so-promising OpenStack was quite informative as it showed that multiple supporting organizations drove the OpenStack project to have different priorities and visions, whereas Google, being the single organization to drive Kubernetes, is shown to have one single, unified vision.


EDIT 2017-02-26 – Add Datanauts episode

Better Cilk Development With Docker

I’m taking a course that focuses on parallel and distributed computing. We use a compiler extension for GCC called Cilk to develop parallel programs in C/C++. Cilk offers developers a simple method for developing parallel code, and as a plus it now comes included in GCC since version 4.9.

The unjust thing with this course is that the professor provides a hefty 4GB Ubuntu virtual machine just for running the GNU compiler with Cilk. No sane person would download an entire virtual machine image just to run a compiler.

Docker comes to the rescue. It couldn’t be more space effective and convenient to use Cilk from a Docker container. I’ve created a simple Dockerfile containing the latest GNU compiler for Ubuntu 16.04. Here are some Gists showing how to build and run a Dockerfile which contain the dependencies needed to build and run Cilk programs.

Old Habits Die Hard: Copy and Paste

Copy and paste is bad.

Every single person who uses a computer learns how to copy and paste.

Copy and paste is necessary to perform many tasks.

Old habits die hard.

 

 

Email and Word documents and illegally downloaded movies all expect you to use copy and paste because it’s how you’re supposed to do things: copy this email into that folder, move that paragraph of text into the next chapter, copy those illegally downloaded movies to the external hard drive for safekeeping.

There’s a time and place for copy and paste, but why resort to it when you do the same task multiple times every day? It’s passable when the situation can’t be made better, or can it?

Sounds like old habits die hard.

 

 

Sure, copy and paste is quick when you’re good at it, but the time adds up. For example, take the process of navigating into a bunch of files and folders to copy the same five files to a different place. Let’s be earnest here and say it takes a minute of this theoretical person’s time. Based on the work they do, they repeat the same copy and paste job ten more times that day. This time adds up.

Yes, old habits die hard.

 

 

Humans are excellent at copy and paste. Guess what else is excellent at copy and paste as well? Computers!!! Computers are better than humans in every way possible when it comes to performing repetitive copy and paste tasks. Speed. Accuracy. Longevity. It’s a combination which doesn’t disappoint.

Luckily where my soliloquy is headed involves people who program computers for a profession: Programmers. Programmers write programs to make computers do things for humans. Copy and paste is one of them. So why are Programmers still using copy and paste to do things themselves repetitively instead of programming a computer to do it for them?

Old habits die hard…

 

 

Let this sink in for a moment…

 

 

Programmers are proficient in telling the computer what to do, namely copy and paste. But they’re still using copy and paste things themselves because they’re really good at it. It’s been a habit since they started using a computer however many decades ago.

This shocks me, especially in the sense where programmers are paid very well to program computers, but instead they’re spending a chunk of their time performing repetitive copy and paste tasks, not to mention they’re fully qualified to program the computer to do it for them.

It’s a bad habit of programmers to repetitively copy and paste. Knowing so and continuing to do must involve masochism. Be a better programmer and get the computer to copy and paste for you!