RubyConf 2019 Talks – Day 2

Here’s a continuation of the previous post covering day 1, this one instead on the talks I attended for Day 2 of RubyConf 2019! Headings are linked to a video of the talk.

Injecting Dependencies For Fun and Profit

Chris Hoffman discussed the basics and the benefits of dependency injection, mentioning that it’s an alternative to mocking when testing. The benefit of dependency injection is that all of the classes using dependency injection list their dependencies explicitly in the initializer. This benefits people who go to read that code later, especially new devs to the team, since all of a classes dependencies are centralized in one location. It also makes testing classes in isolation easier since test doubles of the dependencies can be passed into the classes initializer, compared to the implicit method of mocking objects which can lead to dependencies being forgotten deep in the class.

One of the interesting patterns that Chris’ company adopted (and that I don’t necessarily agree with) to manage dependencies with dependency injection in their codebase is to have a dependency god object. This object is initialized at the start of the program and contains a reference to each dependency in their system. This dependency object is then passed by reference into each classes initializer. When a class needs a dependency it refers to the dependency in the dependency god object. This appears to be a purely functional way of using dependency injection compared to the more popular solution of using globally accessible dependency objects. dry-rb‘s auto_inject is a common dependency injection library which uses the globally accessible dependency pattern.

Overall, dependency injection is a great pattern for scaling medium to large codebases and making testing components simpler.

The Fewer the Concepts, the Better the Code

David Copeland presented the idea of programming with fewer concepts for better code comprehension between many developers. This talk was a bit of a shock since it goes against many of my ideals, but I fully enjoyed challenging my beliefs on the subject. For context, David’s team was just himself and a lead with a non-computer-science background at a small company. When David’s code was reviewed by his lead, the code was critiqued to be simpler and easier to understand. Over time David figured out that using more generic programming language concepts, such as for, if, return, etc. common to most procedural programming languages was what his lead was pushing him towards.

The talk then went into an example of some code which a Rubyist would have written with each, map, implicit returns, etc. contains many more concepts that a developer would have to know about compared to the same code written with much fewer concepts. An example of the benefit of writing code to use these generic programming language concepts is that learning to use new programming languages can be much simpler since they all generally have the same shared concepts. Onboarding new developers onto the team can be much faster if the dev only has to understand a small subset of programming language features. The Go programming language was compared to this practice as it has a smaller number of concepts than other programming languages.

At the end of the talk I asked the question about whether this style of programming may outweigh benefits by making it easier to introduce more bugs. Using functional programming language features such as the Enumerable collection of functions in Ruby can make code much easier to reason about. David agreed that more bugs are definitely a possibility, but he doesn’t have anecdotal evidence from his team.

Disk is Fast, Memory is Slow. Forget all you Think you Know

Another controversial talk I wanted to challenge my beliefs with, this time challenging the principle of memory not always being faster than disk. Daniel Magliola presented this conundrum in the form of a improvement he was attempting to make. The improvement was making metrics available for his cluster of forking Unicorn processes. When using Prometheus to collect metrics from apps, it queries each app at a specific endpoint to read in the metrics and their values. The problem with forking web servers is that when the request comes in to return all of the metrics, the request is dispatched to one of the Unicorn processes, only returning that process’ metrics, not the group of forked Unicorn processes as it should.

Daniel went down the rabbit hole on this GitHub issue looking for performant ways to support metrics collection for forking webservers. With the goal of keeping the recording of metrics as close to 1 microsecond as possible, some solutions that were investigated involved storing metrics in Redis, the Ruby class PStore which transactionally stores a hash to a file, and tenderlove/mmap library to share a memory mapped hash to each process. Unfortunately none of the potential solutions could beat 1 microsecond.

The solution Daniel discovered, and expertly discussed throughout his talk was using plain old files and file locks. This solution ended up only taking ~6 microseconds per metric write and was much more reliable and simpler than dealing with mmap’ed memory, or more running infrastructure. The title of the talk was misleading, and was touched on near the end of the talk as this file-based solution benefitted from operating system optimizations made to cache writes in main memory and disk caches. According to the program the file was updated successfully to disk, with proper locking to prevent multiple writers tripping over each other, but this was all possible by the performant abstraction our modern operating systems provide us with.

Digging up Code Graves in Ruby

Noah Matisoff went into how code graves, aka dead code comes about. Oftentimes developers can be modifying existing parts of code and stop calling other pieces of code, either in the same or different file. That code may still have tests, so test code coverage metrics can’t really help here. Feature flags, where 100% of users are going through one code path and not the other are also prime candidates for code that doesn’t need to exist.

Code coverage tools can be run in production, or in development and help give a good idea of what parts of code are never reached. Static code analysis tools can also help determine if code isn’t referenced anywhere, but it is a hard problem to solve with Ruby since the language isn’t typed and is quite dynamic. Another solution to help keep dead code out of codebases was to add todos to the codebase. Todos can be setup to remind developers to remove bits of code from the codebase or perform other actions. Some automations were shown to make todos more actionable.

RubyConf 2019 Talks – Day 1

I attended RubyConf this year in Nashville, Tennessee with a few of my teammates from Shopify. What a great city and a great first time attending RubyConf!

I took notes on many of the talks I attended and here are the summaries for the first of the three days. Day 2 is available here. Headings that have links go to a video of the talk.

Matz Keynote – Ruby Progress Report

Matz started off the conference with his talk on the upcoming Ruby 3, talking about some upcoming features with it, and the timeline. Ruby 3 will absolutely be released at the end of 2020, removing half-baked features if necessary to keep it on track. This probably also means that if the 3×3 performance goals aren’t fully met, then it’ll still be shipped. He spent some time on talking about being a Rubyist, as the majority of attendees were new to RubyConf, encouraging people to have discussions and contribute to the future of Ruby.

Matz went into some of the new features going into Ruby 2.7 and Ruby 3, and some of the features or experiments being removed. Some of the biggest hype was around the addition of pattern matching, the just in time compiler (JIT), emojis (though Matz didn’t think so), type checking, static analysis, and an improved concurrency model via guilds (think Javascript workers) and fibers. Some features or experiments that were removed were the .: (shorthand for Object#method), the pipeline operator, deprecating automatic conversion of hash to keyword arguments. Some attendees were vocal about getting more rationale about removing these features, and Matz was more than accommodating to explain in more detail.

No Return: Beyond Transactions in Code and Life

Avdi Grimm’s talk focused on discussing the unlifelike constraints that are imposed on users when performing things online. For example, filling out a survey or form online may result in the user losing their progress if they exit their browser. In real life this doesn’t happen, so why should we constrain these transactions so much? Avdi recommends that when building out these processes, these transactions, that we should instead think of it as a narritive, one stream of information sharing that only requires the user to complete a step when it’s really necessary. Avdi related this to our code by suggesting a few concepts that can make our programs more narrative-like such as embracing state and history of data by utilizing event-sourcing/storming and temporal modelling, failing forwards in code by treating exceptions as data and expecting failures, and interdependence in code by using back pressure, and circuit breakers.

Investigative Metaprogramming

Betsy Haibel talked about an effective way of figuring out a bug during a potentially painful upgrade of their Rails app to 6.0. Through the use of metaprogramming, she was able to fix a frozen hash modification bug that would have otherwise been quite difficult to debug. She accomplished this feat by monkey patching the Hash#freeze method, saving a backtrace whenever it is called. Then in the Hash#[]= method, rescue any runtime exceptions that occur and start a debugger session. This helped her narrow down exactly where the hash was frozen earlier on in the code.

Besty then went into detail on what metaprogramming is, and how it differs from language to language. Java, for example has distinct loadtime and runtime phases when the application is starting up. Ruby, on the other hand is both loading classes and executing code at the same time since it’s all performed together during runtime.

Lastly, the talk provided a pattern for using metaprogramming to investigate bugs or other problems in code. Through reflection, recording, and reviewing, the same pattern can be applied to help debug even the most complex code. The reflection step makes up determining what part of the code early on leads to the program failing. The moment that it occurs can be found by inspecting the backtrace at that point in time. Next is the recording step where we want to patch the code that we’ve identified from the reflection step to save the backtrace. This can be done either by saving the caller to an instance variable, class variable, logging. To get a foothold into the code, the patching can be accomplished by using Module#prepend or even the TracePoint library. Lastly, reviewing is the step in which we observe an event in the system (eg. an Exception) and either pause the world or log some info for further reading. An example of this would be to put in a breakpoint or debugger statement, optionally making it a conditional breakpoint to help filter through the many occurrences.

Ruby Ate My DSL!

Daniel Azuma presented about what DSLs (Domain Specific Languages) are, the benefits of them, and how they work. One of the biggest takeaways from this talk was that DSLs are more like Domain Specific Ruby as we’re not building our own language, instead the user of these DSLs should fully expect to be able to use Ruby while using DSLs.

Daniel also went on to mention how to build your own DSL, mentioning a few gotchas as he went. One of those was that since instance_eval is used throughout implementing DSLs, that we should be aware of users clobbering existing instance variables and methods. One solution is to have a naming convention for the DSLs internal instance variables and private methods (eg. prefixing with underscore characters). Alternatively, another way of preventing this clobbering from going on is to separate the DSL objects from the implementation which operates on those objects. This then has the effect that the user of the DSL has the minimum surface area needed to set the DSL up, removing the possibility of overwriting instance variables or methods the internal DSL needs to run.

Design DSLs which look and behave like classes. Specifically, whenever blocks are used, have them map to an instance of a class. RSpec is a great example of this where describe and it calls are blocks which create instances of classes. The it call creates instances that belong to the describe instance. Where things get more interesting and lifelike is if helper methods and instance variables defined higher up in a DSL can be used further down in the DSL. This is the concept of lexical scoping.

Lastly, constants are a pain to work with in Ruby. They don’t behave as expected when using blocks and evals. Some DSLs provide alternatives to constants, for example RSpec’s let.

mruby/c: Running on Less Than 64KB RAM Microcontroller

Hitoshi HASUMI presented mruby/c, an mruby implementation focused on very resource constrained devices. Where mruby focuses on devices with 400k of memory, mruby/c is for devices with 40k of memory. Devices with this small amount of memory can be microcontrollers which are cheap to run and offer many benefits over devices which run operating systems. Some benefits are instantaneous startup and being more secure.

Hitoshi focused his talk on the work he performed building out IoT devices to monitor temperatures of ingredients at a sake brewery in Japan. These devices had a way for workers to measure temperatures, display the reading, as well as send that reading back to a server for further processing. Hitoshi made it clear that there are many different thing that could go wrong in the intense environment of a brewery. High temperatures, hardware failure, resource constraints, etc.

The latter half of the talk was focused on how mruby/c works and how to use it. mruby/c uses the same bytecode as mruby, but removes a few features that regular Ruby developers are used to having, namely: modules and the stdlib. mruby/c compiles down to C files and provides it’s own realtime operating system. Hitoshi finishes the talk with plugging a number of libraries and tools that he’s developed to help with debugging, testing, and generating code. Those being mrubyc-debugger, mrubyc-test, and mrubyc-utils, respectively.

Statistically Optimal API Timeouts

Daniel Ackerman discussed the widespread use of APIs and how timeouts for those remote requests are not being configured efficiently. He introduces the problem that timeouts should be optimized for the best user experience – the fastest response. Given a slow responding API request, we should timeout if we have high confidence that the request is taking too long. He prefixed the rest of his talk explaining that setting the timeout to the 95th percentile is a quick but accurate estimate.

Since APIs are all different, Daniel presents a mathematical proof of determining statistically optimal API request timeouts. By analyzing a histogram of the API response times, we can determine the optimal timeout that balances user experience with timing out requests. Slow API requests often mean that the service is under heavy load or not responding.

The Ultimate Guide to Ruby Timeouts was mentioned as a go-to source for configuring timeouts and knowing which exceptions are raised for many commonly used libraries. Definitely a useful resource. Daniel finished his talk with a plug to his gem rb_maxima, a library which makes it easy to use the Maxima algebraic system from Ruby.

Collective Problem Solving in Software

Jessica Kerr talked about the idea of cameratas – the concept of a group of people who discuss and influence the trends of a certain area. More formally, camerata came from the Florentine Camerata, a group of renaissance musicians and artists gathered in Florence, Italy who helped develop the genre of opera. Their work was revolutionary at the time.

Jessica then related it to the great ideas that have come out of ThoughtWorks, a London-based consulting company. Their incredible contributions over the years have included the concepts of Agile, CI, CD, and DevOps to name a few, have influenced the entire software industry and has set the bar higher.

In general, great teams make great people. Software teams are special in that they consist of the connections between the people in the team as well as the tools that the team uses. Jessica relates this to a big socio-technical system, introducing the term symmathesy to capture the idea that teams and their tools learn from each other. No one person has a full understanding of the systems they work on, therefore the more symmathesy going on in the team, the better the team and system is. This is similar to the concept of senior developers being able to understand the bigger picture when it comes to teams, tools, and people compared to new developers usually concerned about their small bit of code.

The talk was closed by encouraging dev teams to incentivize putting the team first compared to the individual, grow teams by increasing the flow of information sharing and back and forth with their tools. Lastly, great developers are symmathesized.


Summaries of Day 2’s talks are available here

Twenty Five

My friends half-jokingly call it my quarter-century birthday. This year I’m a day late, and according to previous posts, sick again at this same time of year surprisingly. It may have been from the surprise birthday party my friends threw. Oh well, lets get on with the show.

It’s always hard to remember back to what October of last year was like, especially after a year like this. Around November of 2018 I had a wonderful surprise promotion: becoming my own team’s manager. With that came a whole new challenge of understanding people instead of just software. I was thrown in the deep end one day and had to figure out what managing people was all about. From talking with colleagues to reading books, I performed a lot of research over the past year to understand what it means to manage people, especially as a manager of a development team.

Besides career-related achievements, I also had the great opportunity to vacation in Mexico and New York with a bunch of my close buddies, and to partake in a few extra ski trips over the winter to Mont Tremblant and Camp Fortune. I surprised myself with my skiing skills – I must not have gone very recently. Over the summer a number of visits to the cottage was made, one of the times with a bunch of good friends.

Something I wasn’t expecting to have happened at all was to get my SSI Open Water Diving certification. This allows me to go scuba diving to maximum depths of 60 ft. A friend of mine suggested the idea to a few of us and we all had nothing to lose, so why not! Five pool dive sessions, three classroom sessions, and four dives at Morrison’s Quarry over a weekend gave the three of us the ability to travel anywhere around the world and go diving. It was a great learning experience as we had excellent instructors and one on one training as luck would have it. The three of us will be planning a diving trip in the new year!

This was the year where more weddings made it my way. One of them was for my longtime cottage neighbour who is around my age, and another wedding was for my Aunt. Both were vastly different weddings, but both were very enjoyable.

In late fall and early spring there’s a large period of time where bad weather holds me back from going running and cycling in the summertime, or skating on the canal in the wintertime. To keep myself active I bought myself an indoor trainer for my road bike. I was able to train indoors a few times a week pretty consistently over the winter. Keeping this training up for all of the cold months allowed me to jump on my bike in the spring and not miss a beat.

Here’s a few interesting stats of mine from the past year:

  • 57 hours of running/cycling/training, 1220 km total
  • 14 articles for this blog written, 7 published
  • 7 books read – the either being In The Plex, or the Elon Musk biography
  • 1932 GitHub contributions from work and personal projects

🍻 to another year!

Accelerate your team as its Lead

Building software can be hard. Requirements can be swept under the rug, only to find out that: Whoops. We shouldn’t have forgotten about those. Stakeholders requests can silently be forgotten, only to be brought up later, eroding trust. Decisions can take a long time to make if the right people are missing, and even if the room doesn’t know they have the power to. Developers may also be blocked on their work with not knowing that one critical piece of information. Who best to alleviate the previously mentioned pains other than the team’s Lead?

Call the position a Lead Developer. Call it a Development Manager. Call it whatever. Even if you don’t have the title, the ability to influence and lead people to make the team’s product, people, or processes better are well needed in all development teams.

As a Lead, your back is on the line when it comes to everything your team does. The glory you pass down onto the individual team members, or the entire team. The failures you have to suck up and own yourself. Since the engineering lead is on the line when it comes to the team’s output and performance, it’s a large incentive to use your experiences, skills, and contacts to supercharge your team.

One of those methods of influence I have been using recently is picking up and coming to some sort of closure for decisions that haven’t been made, or information that is needed by the development team.

I am of the type of Lead who will perform a gut check and directly ask a developer if they’re blocked on missing information. If the way to unblock them is clear and simple, I point them in the right direction, backing it up with whatever details about the technical, vision, or user story – all without having to reach out to the person best suited to answer. If something is of importance, where the wrong answer could waste time or affect the product in a negative way, reaching out to the person who would know the answer is often necessary. Making it your personal mission to figure that out, and report back to the dev about the answer builds trust that yes, you the dev Lead can help.

Side note: If the dev is skilled enough in knowing a problem area and is able to talk with stakeholders or the people necessary to help solve their problem, encourage them to own figuring this out themselves instead of dealing with it yourself. Empowering your dev to be more independent through dealing with people they may not have met grows the number of contact they have, improves their ability to be resourceful, and can result in being more engaged with the problem. Since this may be an uncharted area for the dev, one on one time is quite valuable for talking about your report’s recent situations, helping them problem solve, and strategizing.

We are all in the 21st century working at high tech organizations – meetings are terrible since we have a wealth of different synchronous and asynchronous tools to get the same or better outcome from a meeting. Therefore I don’t like attending most meetings. Though sometimes you just have to get multiple people into a physical or virtual room and talk things through. Gaining the skills to be a meeting facilitator is very beneficial. It’s basically the practice of having an agenda, leading a meeting, keeping people on track, coming to conclusions on the talking points, and lastly creating action items. Without a meeting facilitator, it can be easy for a meeting to become taken over by one speaker or topic, leaving all other items to talk about untouched. Action items can also fall by the wayside, by either not being discussed, or people not being held accountable, which can absolutely demotivate people on the effectiveness of that meeting, especially if it’s recurring.

Sometimes you might be missing one critical person in the room. It’s always painful to know that We’re not going to get to a decisive answer on what we should do since we’re missing Jimmy. Getting good at honing in on this skill helps make your meetings productive, either by cancelling them to save everyone’s time, or consulting with the missing people beforehand. Giving this intuition as feedback to other people who host meetings can only help reduce this from happening in the future. No one likes wasting time.

It’s one thing to have the meeting and come out feeling Great! Everyone knows what needs to be done. Time to sit back and watch my genius planning unfold. Wrong. That’s half of the battle. You still have to course correct from time to time. This could mean following up on the people assigned action items to see if they need help or are blocked, freeing up devs from tasks that are of lesser of priority, and making sure the right people are being notified when action items are completed.

But when the stars do align and the team gets shit done, don’t stay entirely humble. Remember to give yourself some credit for accelerating the team.

Staging environments slow developers down

For businesses to outperform their competitors and bring ideas to the market fast, Software Development has evolved towards a continuous delivery model of shipping small, incremental improvements to software. This method works incredibly well for Software-as-a-Service (SaaS) companies, which can deliver features to their customers as soon as features are fit to release.

The practice of Continuous Delivery require the master branch to be in a readily shippable state. Thus decreasing the time to ship a change to production encourages faster iteration and smaller, less riskier, changes to be made. Additionally, Continuous Deployment, the shipping of the master branch as soon as changes make it to master, is achievable through a comprehensive suite of automated tests.

For a development team, keeping this cycle on the order of minutes to tens of minutes is paramount. Slowing down means a slower iteration cycle, therefore resulting in larger and riskier changes being made.

I have noticed my team slowing down by using our handful of staging servers more often than is necessary.

Thankfully we can get back to better than we left off and learn a few things along the way!

Why we have staging servers/environments

My team builds the platform for Shopify’s Help Centre and the Merchant facing experience for contacting Support. This same app is also contributed to by our 20 Technical Writers on the Documentation team.

Technical Writers work alongside the many product teams at Shopify to create and update documentation based on what the product team is building. Part of the process of continuously delivering this documentation is a member of the product team reviewing the changed pages for accuracy.

This is often achieved through a Technical Writer publishing content to one of a handful of staging servers, then directing the product teams to visit the staging server.

This workflow makes sense for the most part, since non-technical people can simply visit the staging server to view the unpublished changes. This workflow of having many staging servers isn’t a scalable solution, but that’s for another post.

An effect of having all of these available staging servers is that developers use them to perform various tasks such as:

  • Sharing their work for other developers to look at
  • Testing out risky changes in a production-like environment

It can be pretty easy to rationalize slowing down as being more careful, but this is just a fallacy.

This is an alternative outlook on shipping software since things can go wrong. But when developers are given the freedom to move fast, and are not held down by strict process, most of the time the best risk-reward balance is made. When things do go wrong, having a safety net of tests and production tooling to make it easy to figure out what went wrong, along with the ability to revert back to a previous state. The impact is therefore minimal.

Photo by Hanson Lu on Unsplash
Photo by Hanson Lu on Unsplash

The Repercussions

Over the past few months I have observed a number of situations where developers have used staging environments instead of better alternatives.

One of the biggest slowdowns in iteration cycle is the time to get your code reviewed by someone else. It’s an incredibly important step, but there are shortcuts that can be taken. One of those shortcuts being reviewing code on a staging server.

It takes way longer to deploy code to a staging server than it does to locally checkout someone’s branch and run the code locally. Getting into the habit of pulling down someones changes, reviewing the code, and performing some exploratory testing with a running instance of the app enables a deeper inspection and understanding of the code.

Additionally, using staging servers to test out code “because it doesn’t work on my machine” is an anti-pattern. Developers must prioritize having all features working locally for everyone, at any time, by default. A dysfunctional local development environment just feeds the vicious cycle of more and more things should be tested on staging. Putting the time in to make everything testable in the local development environment pays dividends in speed and developer happiness.

How slow?

Shipping large, risky changes by vetting that they work on staging first give developers the shortcut to iterate at a slower pace. Here’s a concrete example showing how much extra time it takes to test out code on staging.

Dev B is reviewing Dev A’s code. Dev B looks over the changeset, and then asks Dev A to put their code up on staging so that they can verify that the code works as expected. Dev A pushes their code to a staging branch, waits for CI to pass, waits for the deploy to succeed, then notifies Dev B that they can test out the changes. Dev B then gets around to going through the steps to verify that the new changes behave as expected. Dev B then finally gives their sign-off on the changeset, or requests further changes. This entire process, mostly spent waiting for builds and CI, can take 30 minutes or more.

Now lets see what a modified version of the process looks like if Dev B reviews Dev A’s code on their local machine. Dev B looks over Dev A’s changeset, then pulls down the code to their local machine for further inspection. Dev B starts up the app locally and goes through the steps to verify that the new changes behave as expected. Dev B optionally has the ability to poke around the changed code to gain a better understanding of how it fits in with the existing code. Dev B signs-off on the changeset, or requests further changes from Dev A. This process can take 5 minutes or more, but is magnitudes faster than using a staging environment.

As we can see, the time taken to verify that Dev A’s code works correctly in staging takes at least six times longer on average due to having to wait for code to build, deploys to occur, and even unneeded conversations to coordinate using the staging environment. The same outcome can be performed much faster by replacing many of the steps with faster equivalents. For example, running CI and performing a deploy isn’t needed when running code locally. There’s also no time spent coordinating with Dev A to put their code up on the staging environment.

There may be perceived speed with using the staging environment to review someone’s changes, but this is only a fallacy. Dev B may think: “If I just need to visit the staging environment to review Dev A’s code, then I save myself time from having to stash my local changes, pull down the code, and start the app.” Correct, this saves Dev B’s time, but overall causes Dev A to take more of a hit to their time. Dev A has to push their code up to the staging env, causing CI to run, a deploy to occur, then notify Dev B to take a look tens of minutes later.

Photo by Ruslan Keba on Unsplash
Photo by Ruslan Keba on Unsplash

Where staging environments make sense

With all hardfast rules there are some exceptions. One of those exceptions is to validate new configuration for production systems. For example, since it’s not simple to run a local Kubernetes cluster, it’s safer to verify risky changes to Kubernetes Deployment config files by using a production like environment: staging.

Another example is where lives or the wellbeing of people can be on the line. An example of this would be developing a payment processing service where breaking things could result in financial consequences for users of the system. Even a voting system would be an example of a critical system where it’s necessary to take the time to make sure everything is working correctly.

Antipatterns

Chatting with another developer about this blog post, I asked for some examples as to what kinds of things they use their staging environment for.

One example was verifying that updating UI component libraries looked the same between development and production. Since there’s no real good way to test that the UI doesn’t look broken, it’s quite a manual process to verify the many screens and states look fine. One gotcha that was mentioned was that the production build of the Javascript and CSS assets can be different from the development build. This of course means that there is a difference between development and production, which means that bugs can slip through and get to their users.

Off the top of my head a few suggestions came to mind. One idea was to make development more like the production environment (however that may be). During the testing process create a production build of the Javascript and CSS assets locally and use that to verify that the UI looks fine. Lastly, if possible make smaller changes that are easier to review and reason about.

Photo by Romain Hus on Unsplash
Photo by Romain Hus on Unsplash

Dark launching new functionality

Shipping to production can have a certain amount of risk. A code change could crash the app, break a feature, or even cause a worse user experience. What if we could ship to production and drastically reduce these risks?

Let’s talk about dark launching new features and changes. Dark launching is the practice of shipping new code to production, but hiding it from most users to prevent accidentally breaking things or negatively affecting the user’s experience. This could be implemented a number of different ways:

  • Using the new logic if a special parameter is added to the page’s URL
  • A special cookie set in the user’s browser to enable the new logic
  • A/B testing of the current and new logic
  • Enabling the new logic only for employees
  • A beta flag that can turn on and off the logic at runtime

For example, my team is building out a new search backend. The team is able to ship small and incremental changes for this project without having to worry about breaking any of the existing search functionality. For the existing frontend code to integrate into the new backend code, the team is using URL parameters to dark launch this new search backend in production. This gives us great confidence of the new search backend will work since it’s being continually tested in production. Additionally, we’ll be using an A/B test to verify that the new search backend is better than the existing search backend according to our success metrics.

Dark launching new functionality is another pattern that removes the need for staging environments. It does take some thought to figure out the best way to toggle on or off the new functionality, but when used well dark launching can minimize the impact of new code breaking production.

Immediate improvements

Later that day after convincing my team that staging servers were holding us back, one of our developers wasn’t able to test out our ticket submission form locally since it depended on another service to be running. Our app was missing the proper local development credentials to connect to this other service.

A few Slack messages later with the team resulted in a combined effort to fix the local development environment. One change to the local development environment made developing locally as simple if not simpler than using the staging environment.

Two months later the team is able to hold themselves to not using any of the staging environments. There have been a few times where the idea of making an exception has come up. I talked them off the ledge by suggesting to make less riskier changes by splitting things up into smaller pull requests, and even dark launching their feature.

Photo by Jodie Walton on Unsplash
Photo by Jodie Walton on Unsplash

Recommendations

If I have convinced you on staging servers being used too much for the wrong purposes, or are taking my more extreme view of just don’t use staging servers, here is some practical advice to move towards these goals if you’re not there already.

Start with thinking about yourself. From the features, projects, and bugfixes that have been shipped by yourself over the past few months, which have used a staging server to verify that they’ll work correctly in production? If there have been any, ask yourself what the reason was for having to use the staging server.

Take those reasons and figure out if each one could have been prevented by one or a combination of the following:

  • If the local development environment was more like production I could have avoided using staging
  • If the code change could have been dark launched to production I could have avoided using staging
  • If we had more confidence with our tests catching regressions then I could have avoided using staging

Some of the improvements that can be made to limit the amount of times staging servers are used can seem like a lot of work. But think of this from a different perspective: how much time is wasted due to these inefficiencies being here?