Twenty Five

My friends half-jokingly call it my quarter-century birthday. This year I’m a day late, and according to previous posts, sick again at this same time of year surprisingly. It may have been from the surprise birthday party my friends threw. Oh well, lets get on with the show.

It’s always hard to remember back to what October of last year was like, especially after a year like this. Around November of 2018 I had a wonderful surprise promotion: becoming my own team’s manager. With that came a whole new challenge of understanding people instead of just software. I was thrown in the deep end one day and had to figure out what managing people was all about. From talking with colleagues to reading books, I performed a lot of research over the past year to understand what it means to manage people, especially as a manager of a development team.

Besides career-related achievements, I also had the great opportunity to vacation in Mexico and New York with a bunch of my close buddies, and to partake in a few extra ski trips over the winter to Mont Tremblant and Camp Fortune. I surprised myself with my skiing skills – I must not have gone very recently. Over the summer a number of visits to the cottage was made, one of the times with a bunch of good friends.

Something I wasn’t expecting to have happened at all was to get my SSI Open Water Diving certification. This allows me to go scuba diving to maximum depths of 60 ft. A friend of mine suggested the idea to a few of us and we all had nothing to lose, so why not! Five pool dive sessions, three classroom sessions, and four dives at Morrison’s Quarry over a weekend gave the three of us the ability to travel anywhere around the world and go diving. It was a great learning experience as we had excellent instructors and one on one training as luck would have it. The three of us will be planning a diving trip in the new year!

This was the year where more weddings made it my way. One of them was for my longtime cottage neighbour who is around my age, and another wedding was for my Aunt. Both were vastly different weddings, but both were very enjoyable.

In late fall and early spring there’s a large period of time where bad weather holds me back from going running and cycling in the summertime, or skating on the canal in the wintertime. To keep myself active I bought myself an indoor trainer for my road bike. I was able to train indoors a few times a week pretty consistently over the winter. Keeping this training up for all of the cold months allowed me to jump on my bike in the spring and not miss a beat.

Here’s a few interesting stats of mine from the past year:

  • 57 hours of running/cycling/training, 1220 km total
  • 14 articles for this blog written, 7 published
  • 7 books read – the either being In The Plex, or the Elon Musk biography
  • 1932 GitHub contributions from work and personal projects

🍻 to another year!

Accelerate your team as its Lead

Building software can be hard. Requirements can be swept under the rug, only to find out that: Whoops. We shouldn’t have forgotten about those. Stakeholders requests can silently be forgotten, only to be brought up later, eroding trust. Decisions can take a long time to make if the right people are missing, and even if the room doesn’t know they have the power to. Developers may also be blocked on their work with not knowing that one critical piece of information. Who best to alleviate the previously mentioned pains other than the team’s Lead?

Call the position a Lead Developer. Call it a Development Manager. Call it whatever. Even if you don’t have the title, the ability to influence and lead people to make the team’s product, people, or processes better are well needed in all development teams.

As a Lead, your back is on the line when it comes to everything your team does. The glory you pass down onto the individual team members, or the entire team. The failures you have to suck up and own yourself. Since the engineering lead is on the line when it comes to the team’s output and performance, it’s a large incentive to use your experiences, skills, and contacts to supercharge your team.

One of those methods of influence I have been using recently is picking up and coming to some sort of closure for decisions that haven’t been made, or information that is needed by the development team.

I am of the type of Lead who will perform a gut check and directly ask a developer if they’re blocked on missing information. If the way to unblock them is clear and simple, I point them in the right direction, backing it up with whatever details about the technical, vision, or user story – all without having to reach out to the person best suited to answer. If something is of importance, where the wrong answer could waste time or affect the product in a negative way, reaching out to the person who would know the answer is often necessary. Making it your personal mission to figure that out, and report back to the dev about the answer builds trust that yes, you the dev Lead can help.

Side note: If the dev is skilled enough in knowing a problem area and is able to talk with stakeholders or the people necessary to help solve their problem, encourage them to own figuring this out themselves instead of dealing with it yourself. Empowering your dev to be more independent through dealing with people they may not have met grows the number of contact they have, improves their ability to be resourceful, and can result in being more engaged with the problem. Since this may be an uncharted area for the dev, one on one time is quite valuable for talking about your report’s recent situations, helping them problem solve, and strategizing.

We are all in the 21st century working at high tech organizations – meetings are terrible since we have a wealth of different synchronous and asynchronous tools to get the same or better outcome from a meeting. Therefore I don’t like attending most meetings. Though sometimes you just have to get multiple people into a physical or virtual room and talk things through. Gaining the skills to be a meeting facilitator is very beneficial. It’s basically the practice of having an agenda, leading a meeting, keeping people on track, coming to conclusions on the talking points, and lastly creating action items. Without a meeting facilitator, it can be easy for a meeting to become taken over by one speaker or topic, leaving all other items to talk about untouched. Action items can also fall by the wayside, by either not being discussed, or people not being held accountable, which can absolutely demotivate people on the effectiveness of that meeting, especially if it’s recurring.

Sometimes you might be missing one critical person in the room. It’s always painful to know that We’re not going to get to a decisive answer on what we should do since we’re missing Jimmy. Getting good at honing in on this skill helps make your meetings productive, either by cancelling them to save everyone’s time, or consulting with the missing people beforehand. Giving this intuition as feedback to other people who host meetings can only help reduce this from happening in the future. No one likes wasting time.

It’s one thing to have the meeting and come out feeling Great! Everyone knows what needs to be done. Time to sit back and watch my genius planning unfold. Wrong. That’s half of the battle. You still have to course correct from time to time. This could mean following up on the people assigned action items to see if they need help or are blocked, freeing up devs from tasks that are of lesser of priority, and making sure the right people are being notified when action items are completed.

But when the stars do align and the team gets shit done, don’t stay entirely humble. Remember to give yourself some credit for accelerating the team.

Staging environments slow developers down

For businesses to outperform their competitors and bring ideas to the market fast, Software Development has evolved towards a continuous delivery model of shipping small, incremental improvements to software. This method works incredibly well for Software-as-a-Service (SaaS) companies, which can deliver features to their customers as soon as features are fit to release.

The practice of Continuous Delivery require the master branch to be in a readily shippable state. Thus decreasing the time to ship a change to production encourages faster iteration and smaller, less riskier, changes to be made. Additionally, Continuous Deployment, the shipping of the master branch as soon as changes make it to master, is achievable through a comprehensive suite of automated tests.

For a development team, keeping this cycle on the order of minutes to tens of minutes is paramount. Slowing down means a slower iteration cycle, therefore resulting in larger and riskier changes being made.

I have noticed my team slowing down by using our handful of staging servers more often than is necessary.

Thankfully we can get back to better than we left off and learn a few things along the way!

Why we have staging servers/environments

My team builds the platform for Shopify’s Help Centre and the Merchant facing experience for contacting Support. This same app is also contributed to by our 20 Technical Writers on the Documentation team.

Technical Writers work alongside the many product teams at Shopify to create and update documentation based on what the product team is building. Part of the process of continuously delivering this documentation is a member of the product team reviewing the changed pages for accuracy.

This is often achieved through a Technical Writer publishing content to one of a handful of staging servers, then directing the product teams to visit the staging server.

This workflow makes sense for the most part, since non-technical people can simply visit the staging server to view the unpublished changes. This workflow of having many staging servers isn’t a scalable solution, but that’s for another post.

An effect of having all of these available staging servers is that developers use them to perform various tasks such as:

  • Sharing their work for other developers to look at
  • Testing out risky changes in a production-like environment

It can be pretty easy to rationalize slowing down as being more careful, but this is just a fallacy.

This is an alternative outlook on shipping software since things can go wrong. But when developers are given the freedom to move fast, and are not held down by strict process, most of the time the best risk-reward balance is made. When things do go wrong, having a safety net of tests and production tooling to make it easy to figure out what went wrong, along with the ability to revert back to a previous state. The impact is therefore minimal.

Photo by Hanson Lu on Unsplash
Photo by Hanson Lu on Unsplash

The Repercussions

Over the past few months I have observed a number of situations where developers have used staging environments instead of better alternatives.

One of the biggest slowdowns in iteration cycle is the time to get your code reviewed by someone else. It’s an incredibly important step, but there are shortcuts that can be taken. One of those shortcuts being reviewing code on a staging server.

It takes way longer to deploy code to a staging server than it does to locally checkout someone’s branch and run the code locally. Getting into the habit of pulling down someones changes, reviewing the code, and performing some exploratory testing with a running instance of the app enables a deeper inspection and understanding of the code.

Additionally, using staging servers to test out code “because it doesn’t work on my machine” is an anti-pattern. Developers must prioritize having all features working locally for everyone, at any time, by default. A dysfunctional local development environment just feeds the vicious cycle of more and more things should be tested on staging. Putting the time in to make everything testable in the local development environment pays dividends in speed and developer happiness.

How slow?

Shipping large, risky changes by vetting that they work on staging first give developers the shortcut to iterate at a slower pace. Here’s a concrete example showing how much extra time it takes to test out code on staging.

Dev B is reviewing Dev A’s code. Dev B looks over the changeset, and then asks Dev A to put their code up on staging so that they can verify that the code works as expected. Dev A pushes their code to a staging branch, waits for CI to pass, waits for the deploy to succeed, then notifies Dev B that they can test out the changes. Dev B then gets around to going through the steps to verify that the new changes behave as expected. Dev B then finally gives their sign-off on the changeset, or requests further changes. This entire process, mostly spent waiting for builds and CI, can take 30 minutes or more.

Now lets see what a modified version of the process looks like if Dev B reviews Dev A’s code on their local machine. Dev B looks over Dev A’s changeset, then pulls down the code to their local machine for further inspection. Dev B starts up the app locally and goes through the steps to verify that the new changes behave as expected. Dev B optionally has the ability to poke around the changed code to gain a better understanding of how it fits in with the existing code. Dev B signs-off on the changeset, or requests further changes from Dev A. This process can take 5 minutes or more, but is magnitudes faster than using a staging environment.

As we can see, the time taken to verify that Dev A’s code works correctly in staging takes at least six times longer on average due to having to wait for code to build, deploys to occur, and even unneeded conversations to coordinate using the staging environment. The same outcome can be performed much faster by replacing many of the steps with faster equivalents. For example, running CI and performing a deploy isn’t needed when running code locally. There’s also no time spent coordinating with Dev A to put their code up on the staging environment.

There may be perceived speed with using the staging environment to review someone’s changes, but this is only a fallacy. Dev B may think: “If I just need to visit the staging environment to review Dev A’s code, then I save myself time from having to stash my local changes, pull down the code, and start the app.” Correct, this saves Dev B’s time, but overall causes Dev A to take more of a hit to their time. Dev A has to push their code up to the staging env, causing CI to run, a deploy to occur, then notify Dev B to take a look tens of minutes later.

Photo by Ruslan Keba on Unsplash
Photo by Ruslan Keba on Unsplash

Where staging environments make sense

With all hardfast rules there are some exceptions. One of those exceptions is to validate new configuration for production systems. For example, since it’s not simple to run a local Kubernetes cluster, it’s safer to verify risky changes to Kubernetes Deployment config files by using a production like environment: staging.

Another example is where lives or the wellbeing of people can be on the line. An example of this would be developing a payment processing service where breaking things could result in financial consequences for users of the system. Even a voting system would be an example of a critical system where it’s necessary to take the time to make sure everything is working correctly.

Antipatterns

Chatting with another developer about this blog post, I asked for some examples as to what kinds of things they use their staging environment for.

One example was verifying that updating UI component libraries looked the same between development and production. Since there’s no real good way to test that the UI doesn’t look broken, it’s quite a manual process to verify the many screens and states look fine. One gotcha that was mentioned was that the production build of the Javascript and CSS assets can be different from the development build. This of course means that there is a difference between development and production, which means that bugs can slip through and get to their users.

Off the top of my head a few suggestions came to mind. One idea was to make development more like the production environment (however that may be). During the testing process create a production build of the Javascript and CSS assets locally and use that to verify that the UI looks fine. Lastly, if possible make smaller changes that are easier to review and reason about.

Photo by Romain Hus on Unsplash
Photo by Romain Hus on Unsplash

Dark launching new functionality

Shipping to production can have a certain amount of risk. A code change could crash the app, break a feature, or even cause a worse user experience. What if we could ship to production and drastically reduce these risks?

Let’s talk about dark launching new features and changes. Dark launching is the practice of shipping new code to production, but hiding it from most users to prevent accidentally breaking things or negatively affecting the user’s experience. This could be implemented a number of different ways:

  • Using the new logic if a special parameter is added to the page’s URL
  • A special cookie set in the user’s browser to enable the new logic
  • A/B testing of the current and new logic
  • Enabling the new logic only for employees
  • A beta flag that can turn on and off the logic at runtime

For example, my team is building out a new search backend. The team is able to ship small and incremental changes for this project without having to worry about breaking any of the existing search functionality. For the existing frontend code to integrate into the new backend code, the team is using URL parameters to dark launch this new search backend in production. This gives us great confidence of the new search backend will work since it’s being continually tested in production. Additionally, we’ll be using an A/B test to verify that the new search backend is better than the existing search backend according to our success metrics.

Dark launching new functionality is another pattern that removes the need for staging environments. It does take some thought to figure out the best way to toggle on or off the new functionality, but when used well dark launching can minimize the impact of new code breaking production.

Immediate improvements

Later that day after convincing my team that staging servers were holding us back, one of our developers wasn’t able to test out our ticket submission form locally since it depended on another service to be running. Our app was missing the proper local development credentials to connect to this other service.

A few Slack messages later with the team resulted in a combined effort to fix the local development environment. One change to the local development environment made developing locally as simple if not simpler than using the staging environment.

Two months later the team is able to hold themselves to not using any of the staging environments. There have been a few times where the idea of making an exception has come up. I talked them off the ledge by suggesting to make less riskier changes by splitting things up into smaller pull requests, and even dark launching their feature.

Photo by Jodie Walton on Unsplash
Photo by Jodie Walton on Unsplash

Recommendations

If I have convinced you on staging servers being used too much for the wrong purposes, or are taking my more extreme view of just don’t use staging servers, here is some practical advice to move towards these goals if you’re not there already.

Start with thinking about yourself. From the features, projects, and bugfixes that have been shipped by yourself over the past few months, which have used a staging server to verify that they’ll work correctly in production? If there have been any, ask yourself what the reason was for having to use the staging server.

Take those reasons and figure out if each one could have been prevented by one or a combination of the following:

  • If the local development environment was more like production I could have avoided using staging
  • If the code change could have been dark launched to production I could have avoided using staging
  • If we had more confidence with our tests catching regressions then I could have avoided using staging

Some of the improvements that can be made to limit the amount of times staging servers are used can seem like a lot of work. But think of this from a different perspective: how much time is wasted due to these inefficiencies being here?

Brodie: Building Shopify’s new Help Centre

One of the primary projects which has defined the existence of my team at Shopify was a complete rebuild of the Help Centre’s platform. The prior Help Centre utilized Jekyll (the static site generator) with a number of features added over the past five years to provide documentation to our merchants, partners, and prospective customers.

The rebuild took about six months, and successfully launched with multiple languages in July 2018.

Deacon Brodie

This post will first discuss the limitations we encountered with using Jekyll for a number of years on a Help Centre which has grown to 15 technical writers and 1600 pages. Next, a number of upcoming features are outlined which the new platform should easily accommodate for. Following that, a high level overview of Brodie, the library we built to replace Jekyll. Next, Brodie’s internals are explained with details on how it integrates with Ruby on Rails. This post then ends with links to related code discussed throughout this post.

Jekyll’s Limitations

As of February 2018, Shopify’s Help Centre consisted of 1600 pages, 3000 images, and 300 partials/includes. This amount of content can really slow down Jekyll’s build time. A clean build takes 80 seconds, while changing a single character on a page requires 15 seconds for a partial rebuild. This really slows down the workflow for our technical writers, as well as developers who maintain the heavy Javascript-based Support page.

Static sites, where a server serves up html files, can only get you so far. Features considered dynamic must be implemented using client-side Javascript. This has proven to be difficult and even restrictive to the features that could be added to the site, especially when it comes to features which require running on a server and not in the user’s browser. Things such as authenticating Shopify merchants before they contact Support is more difficult considering that all of the functionality lives in Javascript, or another app is relied upon.

The original Deacon Brodie’s Tavern in Edinburgh

Even other companies have blogged about the hoops they’ve jumped through to scale Jekyll too.

Upcoming Features

Allowing users to login to the Help Centre with their Shopify credentials can provide a more personalized experience. Based on the shops the Merchant has access to, the pages in the Help Centre can be tailored to their Country, the features that they use, and the growth stage of their business. The API documentation can be extended to provide the logged in user the ability to query their shop’s API.

Enabling the ability for merchants to login to the Help Centre can simplify the process of talking with Support. Once logged in, users would be able to bypass verifying who they are to a Support representative, since they’ve already proven who they are by logging into the Help Centre. This saves time on both ends of the conversation and keeps the user focused on their problem.

A short history of Deacon Brodie’s life

Features could also be added to enhance the workflow of our technical writers. As a logged in technical writer a few features could be enabled such as showing all pages regardless of being hidden or being an early-release feature, a link to view the page on GitHub, or even a link to view the current page in Google Analytics. Improvements such as these make it much quicker to access to relevant data.

Being able to correlate the Help Centre pages visited by a user before they contact Support can help infer how successful pages are at helping answer the user’s question. Pages which do poorly can be updated, and pages which are successful can be studied for trends. Resources can be better focused on areas of the Help Centre pages which need it. Additionally, combining the specific pages visited to Support interactions opens the opportunity to perform A/B tests. A Help Centre page can have two or more versions, and the version which results in the least amount of Support interactions could be considered the winning version. Currently there is no way to definitively correlate the two.

Many Support organizations gauge the effectiveness of their Help Centre content (self-help) by comparing potential Support interactions solved by Help Centre pages to the number of actual Support interactions. A so called deflection ratio, where the higher the self-help-to-support-interaction ratio the better. This ratio can be more accurately calculated by better tracking of the user’s journey through these various Shopify properties before they contact Support.

Lastly, Internationalization (aka I18n) and Localization means translating pages into different languages and cultural norms. I18n would enable the Help Centre to be used by people other than those who know English, or prefer reading in a language they understand better. I18n support can be hacked into Jekyll, but as was discussed earlier with 1600 pages already slowing down the build times, Jekyll will absolutely cripple when there exists multiple localized versions of each page. Therefore, having an app that can scale to a much larger number of pages is required for I18n and localization to even be considered.

The Solution

To enable our Help Centre to scale way past 1600 pages, and support complex server-side features, a scrappy team was formed to rebuild the Help Centre platform in Ruby on Rails.

Rewriting any of the content pages or partials wouldn’t be feasible for the time or resources we had – therefore maintaining compatibility with the existing content files was paramount.

Allowing the number of pages in the Help Centre to keep growing, but to dramatically reduce the 80 second clean build time, and the 15 second page rebuild time requires an architectural shift. Moving away from Jekyll’s model of pre-rendering all pages at build time to the model of rendering only what’s needed at request time. Instead of performing all computational work up-front, performing smaller batches of work at request time spreads out the cost.

The Deacon Brodies Pub in Ottawa, steps away from Shopify HQ

Ruby on Rails was chosen as the new technology stack for the Help Centre for a few reasons. The limits were being reached with Jekyll, therefore we technically couldn’t continue using it. Shopify’s internal tooling and production systems  heavily integrate with Rails applications, therefore building on Rails to utilize these would save a lot of developer time. Shopify also employs the largest base of Rails developers, so tapping into that workforce and knowledge base is very beneficial for future development.

Ruby on Rails brings a number of complementary features such as a solid MVC framework, simple caching abstractions for application code and views, as well as a strong and healthy community of libraries and users. These benefits make Rails a great selling point for building new features faster and easier than the prior Jekyll system.

One of the things that has been working quite well over the past few years has been the workflow for our technical writers. It consists of using a text editor (such as Atom) to edit Markdown and Liquid code, then using Git and GitHub to open a Pull Request for peer review of the changes. Automated tests check for broken links, missing images, incorrectly formed HTML and Javascript.  Once the changes are approved and all tests have passed, the Pull Request is merged and shipped to production.

Since there isn’t a good reason to change the technical writer’s current workflow we’re more than happy to design the new documentation site with the existing workflow in mind.

One of the main features of the platform my team built was the flexible content rendering engine. It’s equivalent to Jekyll on steroids. Here I’ll discuss the heart of the system, Brodie, the ERB-Liquid-Markdown rendering engine.

Brodie

Brodie is the library we’ve purpose-built for Shopify’s new Help Centre. It renders any file that contains ERB, Liquid, and Markdown, or a combination of the three into HTML.

Brodie is named after Deacon Brodies, an Ottawa pub which is itself named after Deacon William Brodie, an 18th-century city councillor in Edinburgh who moonlighted as a burglar and gambler.

Deacon Brodie’s double life inspired the Robert Louis Stevenson story Strange Case of Dr Jekyll and Mr Hyde.

Brodie, and the custom extensions built on-top of it, enable a smooth transition from Jekyll to Rails. Shopify’s 1600 pages, 3000 images, and 300 partials/includes can be rendered by Brodie without modification. Additionally, the workflow of the technical writers is not disturbed. They continue to use their favourite text editor to modify content files, Git and GitHub to perform reviews, and to utilize the existing Continuous Delivery pipeline for fast validation and shipping.

Views in Rails are rendered using templates. A template is a file that consists of code that defines what the user will see. In a Rails app the template file will usually consist of ERB mixed into HTML. A template file like this would belong in the app/views/ directory and would have a descriptive name such as homepage.html.erb.

The magic in Rails treats templates differently based on its filename. Let’s break it down. homepage represents the template’s filename. Rails knows to look for this template based on this name. The html part represents what the format the template should output to. Lastly, erb is the portion which specifies what language the template file is written in. This naming convention enables Rails to dynamically render views just by looking at the filename.

Rails provides template handlers to render ERB to HTML, as well as JSON and a few others. Rails offers the ability to extend its rendering system by plugging in new template handlers. This is where Brodie integrates with Rails applications. Brodie provides its own template handler to take content files and convert the ERB, Liquid, and Markdown to HTML.

Rails exposes this via the ActionView::Template.register_template_handler(:md, Content) where :md is the file extension to act on, and Content is the Class to use as the template rendering engine (template handler). Next we’ll go over how a template handler works.

Rendering Templates

The only interface a template handler is required to respond to is call with one parameter being the template to render. This method should return a string of code that will render the view. This string will be eval‘ed by the template later. Returning a string of code is a Rails optimization which inlines much of the code required to render the template. This reduces the number of methods needing to be called, speeding up the already time consuming rendering process.

When Rails needs to render a view it takes the specified template and calls the proper template handler on itself. The handler returns a string that contains the code that renders the template. The Template class combines the code with other code, then evals the stringified code.

For example, the ERB-Liquid-Markdown renderer has a call method like the following:

def call(template)
  compiled_source = erb_handler.call(template)
  "Brodie::Handlers::Content.render(begin;#{compiled_source};end, local_assigns)"
 end

Brodie first renders the ERB present in the template’s content with the existing ERB handler that comes with Rails. Brodie then returns a string of code which calls the “render” method on itself. That render method is shown next:

def render(source, local_assigns = {})
  markdown.call(
    liquid.call(source, local_assigns)
  ).html_safe
end

Here is where the actual rendering of the Liquid and Markdown occur. When this code is eval‘ed the parameter local_assigns is included for passing in variables when rendering a view. This is how variables are magically passed from Rails controllers into views.

Left: The old Jekyll site. Right: The new Rails site. The Help Centre rebuild looks the same but has a completely new backend

It’s as straightforward as that for rendering ERB, Liquid, and Markdown together. The early days of Brodie were spent understanding the ins-and-outs of ActiveView enough to validate that this approach was a sane practice and not breaking in edge cases.

Further Reading

The current documentation is really limited when it comes to Templates and Template Handlers. I would suggest building a small template handler, setting breakpoints and walk through the source. Here’s a great example of a template handler for Markdown.

Additionally, looking over the source code and comments is the best way to get an understanding of the ActiveView internals. The main entry point into ActiveView is the render method from TemplateRenderer. Template would be best to check out next as it concerns itself with actually rendering templates. Lastly, Handlers would be good to check out to see how Rails can register and fetch Template Handlers.

Keep Continuously Testing

One of the more powerful tools on my toolbelt is a red-green development cycle. With this cycle of purposefully writing tests which fail (red), then writing the code required to make them pass (resulting in green tests), allows me to keep my mind focused on the current task at hand, and saves me time.

For example, take the annual Advent of Code set of challenges: given a problem and example input with output, create working code which solves the problem for any given input.

Small challenges such as Advent of Code provide the perfect scenario for Test Driven Development (TDD). Given an Advent of Code challenge, I would always write a test which asserts some output. Since those tests would fail, I would then proceed to write code which returns the correct answer.

My optimal TDD coding setup. Clockwise, from left: source file, test file, automatic test output, interactive Ruby console.

I can’t emphasize the time saved from automatically running my tests whenever any file is saved in my working directory. The time savings has payed off time and time again because it has saved me a number of keystrokes off of running tests manually each time. Saving my source or test file automatically causes my tests to start running in another terminal pane.

The one way I’m able to accomplish this is by utilizing two tools: ag and entr. These two shell tools used in combination allow for a powerful red-green development process.

ag (commonly known as the Silver Searcher, succeeded by rg (RipGrep)) is a program which lists files recursively. entr is a program which runs a given shell command whenever a file is modified. These two tools piped together like ag -l | entr -c rails test enable a TDD workflow because ag provides a recursive list of files, then entr watches those files and runs the provided command. In this case, the command to run every time is rails test. The -c is a parameter for entr to clear the console each time the command is rerun.

Download ag and entr

I would highly recommend trying out these two tools in combination next time you encounter your next TDD situation – it’s certainly saved me a lot of time repeating keyboard commands. Even if you don’t do TDD regularly or at all, I would recommend trying out these new tools.

Download ag from Github, or equivalently, rg (RipGrep) if you’re feeling up to the challenge. entr can also be fetched from Github.

For a quicker install

brew install the_silver_searcher entr on Mac OS.

apt-get install silversearcher-ag entr on Ubuntu-like systems.

Why “&” doesn’t actually break your HTML URLs

Writing tests for some code which generated HTML ended up surfacing one peculiarity with how HTML encodes URLs. The valid URL https://example.com?a=b&c=d would always get modified when inserted into HTML like so: <a href="https://example.com?a=b&amp;c=d">foo</a>.

One of my teammates commented on this during a code review – why the & character is converted to &amp; in the resulting HTML. That URL didn’t look right since the &amp; would break the URL query string.

Even more confusing was that the HTML in the URL still worked since Google Chrome and other browsers converted the URL in the HTML from its &amp; form back to &. Were the browsers just being helpful by handling these developer mistakes, much like it already does with closing missing HTML elements?

The fake bug hunt

Over two hair pulling days of reading GitHub issues, StackOverflow, HTML standards, source code, and more, it was clear that there was a clear divide in understanding. One group of people who understood this as a bug in their library of choice and another group who understood that this wasn’t a bug.

I was definitely in the former group of people until I finally found a helpful blog post clearing up the confusion. Even this StackOverflow answer concisely summed why this is, in a few quick sentences.

Simply stated, lone & characters in HTML are invalid and must be escaped to &amp;.

Since HTML is a descendant of XML, HTML implements a subset of XML’s rules. One of those rules is that a & character by itself is invalid since & is used as an escape character for a character entity reference (eg. &amp;, &lt;).

The confusion arises when people don’t know that this rule exists. Many, including myself, was blaming it on their HTML parsing libraries such as Nokogiri and libxml2. Others blamed their web app of choice since it sends them invalid HTML or XML that their HTML parser doesn’t know how to deal with.

Conclusion

Another way of understanding the same problem is that a URL on its own has its own rules around which characters must be encoded. HTML also has different encoding rules. So when URLs are used in HTML, the URL may look invalid, but given that it is in HTML, HTML has its own rules around what characters need escaping. This can lead to funky looking URLs, but rest assured that using a HTML parsing library or a browser will properly encode and decode any sort of data stored within HTML.

This explains why our browsers see &amp; in the raw HTML and know to convert it back to &. This also confirms that it is completely fine seeing &amp; characters in tests comparing HTML.

Twenty-four!

It’s hard to come up with the content for this post while fending off a sickness, but I know it’s a yearly ceremony for myself to look back and reflect on the prior year. As always, what better of a time to do this than on my Birthday!

One word can really describe my primary focus over the past year: Career. This time last year I was just about to pass the 90 day mark at Shopify.

Whether it’s been building close and trustworthy friendships with teammates and other colleagues, levelling up new hires through mentoring, or continually delivering impactful work – this year has been nothing more than exemplary.

Let’s get right into things! Since this time last year I moved to downtown Ottawa and am now living without roommates. Crazy to think that it only happened 10 months ago since it feels like forever, but I am enjoying all the perks of having no roommates and downtown life.

This summer was one of my most active to date. There was always something going on during the week or weekend from July straight through August. I had the opportunity to travel with a friend to his home town of Fredericton, New Brunswick. For the first time being on the east-coast I was expecting an east-coast accent out of everyone, but the place seemed more like Ontario than not. It was a great time hanging out with his friends and attending a party at the local hotel.

I had a great time with friends at two music festivals: Bluesfest and Escapade. Escapade was especially fun since there was a number of great acts: Alesso, Tchami, Zedd, and Kaskade. One private festival I went to was about 15 people camping at a buddies lakeside property in Quebec. A DJ booth was set up and the trance music was going on late into the night.

Being so close to downtown has its perks – I was within walking distance to both festivals.

Another bunch of cool moments were centred around exploring other Shopify offices. Barrel Hall in Waterloo was the coolest looking since it was once a distillery. It still has all of the characteristic aging barrels and wooden structure. Montreal’s office has the best artwork and looks like the most liveable city. Toronto’s offices and the city in general was a grand party since it’s mostly new to me, but I have friends, family, and colleagues everywhere.

Barrel Hall, Shopify’s Waterloo Office, definitely had the most character.

Even though I didn’t do as much cycling as last year, this year I took advantage of Ottawa’s Sunday Bikedays. On Sundays throughout the summer certain parkways were closed for the morning. This allowed for coasting down some long stretches of smooth roadway. The midpoint for some of these outings were spent taking a break at a local brewpub. Some friends joined me every once in a while too!

One of the many destinations – where the Rideau Canal meets the Ottawa River.

Investing and personal finance has become a hobby of mine and ever more important as I get older. It’s better earlier than later to learn about the do’s and don’ts of personal finance. A year of listening to related podcasts, plenty of reading, and managing my savings has enabled me to go from zero to pretty competent. I’m lucky to have a representative group of people to bounce ideas and plans off of.

I went to my first conference, BSides Ottawa, which was quite fun. I met a number of colleagues and played in my first capture the flag event. I found out that I can defend, but am not too good at attacking. I’ll try again to attend this year!

December was when myself and a few others started a rewrite for Shopify’s Help Centre. Unknown to us, there was quite a lot of feature creep – either from “that one little feature that’s existed forever”, to adding multiple language support. This resulted in the project taking seven months, but we’re glad to have done it. Throughout the process we started and built our own kick-ass team. When the rewrite shipped it went off without a hitch! 🎇 Now all of our current projects hinge off of the benefits that this rewrite brought.

Some of the team which traveled to Montreal to launch the Help Centre.

I attended a few training sessions that should benefit my career – Visualizing Software Architecture with the C4 Model, as well as Agile Scrum training. The latter has definitely transformed my team for the better.

There was a lot of various work events – planned or unplanned, official or unofficial – which I’m pretty grateful to have experienced with friends and colleagues. Alas, there are too many to mention.

For example: that one time we had a marching band…

Here’s to another year of learning, growth, exploration, and good times.

Twilio’s TaskRouter Quickstart

My team and I are exploring different services and technologies in the area of contact centres. We develop and maintain the tools for over 1000 support agents, with the number rapidly rising. Making smart, long-term business and technology decisions are paramount. One of the technologies we looked into was Twilio and its ecosystem – specifically TaskRouter.

Twilio’s TaskRouter provides a clean interface for building contact centres. Its goal is to take the tedious infrastructure and plumbing work out of building a custom contact centre, exposing the right APIs to implement domain logic. TaskRouter is a high-level service since it orchestrates voice, SMS, and other communication channels with the ability to assign incoming interactions across a workforce of agents ready to take those interactions.

Twilio-Ruby

To get a head start at understanding how TaskRouter works, I spent a day looking at Twilio’s Ruby quickstart guide for TaskRouter. Wow, was I in for a frustrating time.

The quickstart guide takes the reader through a number of steps, both inside of the Twilio Console as well as building a small Ruby Sinatra app. After completing the quickstart the reader should have a fully functioning call centre with an interactive voice response (IVR) to greet and queue any user that calls in.

Some of the things that made the quickstart harder to complete is that the Ruby code examples included throughout used an older version of the twilio-ruby gem. Because of this, the code examples didn’t work with the latest version. This was both a bad and good thing. Bad in that the existing code examples wouldn’t work out of the box, but good in the fact that I had to put in some extra effort into learning where the docs and other sources of help exist, and having a deeper understanding of how the Twilio API works.

I compiled a list of resources that would assist anyone going through the same or a similar situation. It certainly helped me complete the TaskRouter quickstart.

  • The README for the twilio-ruby gem provided a great overview of what functionality it provides and how the gem is to be used
  • The v4 to v5 upgrade guide for the twilio-ruby gem showed that there was some sense to this chaos by providing the rationale and examples for updating old versions of the twilio-ruby code to the latest (v5). This was where I had my moment of understanding for the quickstart code examples.
  • Using JWT tokens was part of the last section of the quickstart. Since twilio-ruby changed the way it uses tokens, its code examples had to be updated too. The main Twilio docs on JWT goes into intricacies around building policies contained within JWT tokens
  • My lead/manager was quite happy when I mentioned to him that the twilio-ruby gem no longer uses title case for situations where camel-case or snake-case would have been better to Ruby styling. TwiML was affected by this for a number of gem versions up until v5. Since TwiML is used frequently throughout the quickstart the docs for using TwiML in twilio-ruby helped during those times.
  • Lastly, if all else fails, feel free to reference my resulting code from the TaskRouter quickstart. It’s available here on GitHub.

How Does Symmetric and Public Key Encryption Work?

With the release of Rails 5.2 and the changes with how secrets are securely stored, I thought it would be timely to write about the benefits and downsides of secrets management in Rails. It would be valuable to compare how Rails handles secrets, how Shopify handles secrets, and a few other methods from the open source community. On my journey to write about this I got caught up in explaining how symmetric and public key encryption work. So the post comparing different Rails secret management gems will have to wait until another post.

Managing secrets is now more challenging

A majority of applications created these days integrate with other applications – whether it’s for communicating with other business-critical systems, or purely operational such as log aggregation. Secrets such as usernames, passwords, and API keys are used by these apps in production to communicate with other systems securely.

The early days of the Configuration Management, and then later the DevOps movements have rallied and popularized a wide array of methodologies and tools around managing secrets in production. Moving from a small, artisanal, hand-crafted set of long-running servers to the modern short-lifetime cloud instance paradigm now requires the discipline to manage secrets securely and repeatedly, with the agility to revoke and update credentials in a matter of hours if not minutes.

While there’s many ways to handle secrets while developing, testing, and deploying Rails applications, it’s important to bring up the benefits and downsides to the different methods, particularly around production. Different levels of security, usability, and adoption exist with different technologies. Public/private key encryption, also known as RSA encryption, is one of the technologies. Symmetric key encryption is also another common encryption technology.

There exist many ways to handle secrets within Rails and webapps in general. It’s important to understand the underlying concepts before settling on one method or another because making the wrong decision may result in secrets being insecure, or the security being too hard to use.

Let’s first discuss the different types of encryption that are characteristic of the majority of secret management libraries and products out there.

Symmetric Key Encryption

Symmetric key encryption may be the simplest form of encryption to understand, but don’t let that trick you into thinking that it’s not secure. Symmetric key encryption involves one key used to both encrypt and decrypt data. This key will have to be kept secret and only be shared with trusted people and systems. Once secrets are encrypted with the key, that encrypted data can be readily shared and transferred without worry of the unencrypted data being read.

A simple example of symmetric key encryption can be explained. The most straightforward method utilizes the binary XOR function. (This example is not representative of state of the art symmetric key encryption algorithms in use, but it does get the point across). The binary XOR function means “one or the other, but not both”. Here is an example that shows the complete set of inputs and outputs for one binary digit:

1 XOR 1 = 0
1 XOR 0 = 1
0 XOR 1 = 1
0 XOR 0 = 0

A more complicated example would be:

10101010 XOR 01010101 = 11111111
11111111 XOR 11111111 = 00000000
11111111 XOR 01010101 = 10101010

Note that line 1 and 3 are related. The output of line 1 is part of the input of line 3. The second parameter of line 1 is used as the second parameter of line 3 too. Notice that the output of line 3 is the same as the first input of line 1. As demonstrated here, the XOR function will return the same input if the result of the function is fed back into itself a second time. A further example will show this property.

Given the property that any higher form of data representation can be broken down to binary, we can then show the example of hexadecimal digits being XOR’ed with another parameter.

12345678 XOR deadbeef = cc99e897

Given the key is the hexadecimal characters deadbeef and the data to be encrypted is 12345678, the result of the XOR is the incomprehensible result cc99e897. Guess what? This cc99e897 is encrypted. It can be saved and passed around freely. The only way to get the secret input (ie. 12345678) is to XOR it again with the key deadbeef. Let’s see this happen!

cc99e897 XOR deadbeef = 12345678

Fact check it yourself if you don’t believe me, but we just decrypted the data! This is the simplest example of course, so there’s a lot more that goes into symmetric key encryption that keeps it secure. Things like block-based, and stream-based algorithms, and even larger key sizes augment the simple XOR algorithm to make it more secure. It may be simple for someone who wants to break the encryption to guess the key in this example, but it becomes much harder the longer the key size is.

This is what makes symmetric key encryption so powerful – the ability to encrypt and decrypt data with a single key. With this property comes the need to keep this single key secret and separate from the data. When symmetric key encryption is used in practice, the smaller amount of people and systems that have the key the better. Humans can easily lose the key, leave jobs, or worse: share the key with people of malicious intent.

Public Key Encryption

Quite opposite to how symmetric key encryption works, public key encryption, (or asymmetric key encryption, or RSA encryption) uses two distinct keys. In its simplest form the public key is used for encryption and the private key is used for decryption. This method of encryption separates the need for the user who is encrypting the data from having the ability to decrypt the data. Put plainly, it allows for anyone to encrypt data with the public key while the owner of the private key is the only one able to decrypt the data. The public key can be shared with anyone without compromising the security of the encrypted data.

Some tradeoffs between symmetric and public key encryption is that the private key (the key used to decrypt data) is never shared with other parties, whereas the same key is used in symmetric key encryption. Also, a downside of public key encryption is that there are multiple keys to manage, therefore it brings a higher level of overhead compared to symmetric key encryption.

Let’s dig into a simple example. Given a public key (n=55, e=3) and a private key (n=55, d=27) we can show the math behind public key encryption. (These numbers were fetched from here).

Encrypting

To encrypt data the function is:

c = m^e mod n

Where m is the data to encrypt, e is the public key, mod is the modulus function, n is the shared modulus, and c is the encrypted data.

For the number 42 to be encrypted we can plug it into the formula quite simply:

c = 42^3 mod 55
c = 3

c = 3 is our encrypted data.

Decrypting

Decrypting takes a similar route. For this a similar formula is used:

m = c^d mod n

Where c is the encrypted data, d is part of the private key, mod is the modulus function, n is the shared modulus, and m is the decrypted data. Lets decrypt the encrypted data c = 3:

m = 3^27 mod 55
m = 42

And there we have it, our decrypted data is back!

As we can see, a separate key is used for encryption and decryption. It’s worth restating that this example here is very simplified. Many more mathematical optimizations, and larger key sizes are used to make public key encryption secure.

Signing – a freebie with public key encryption

Another benefit to using RSA public and private keys is that given the private key is only held by one user, that user can sign a piece of data to verify that it was them who actually sent it. Anyone who has the matching public key can verify that the data was signed by the private key and that the data was not tampered with during transit.

When Bob needs to receive data from Alice and Bob needs to be sure it was sent by Alice, as well as not tampered with while being sent, Alice can hash the data and then encrypt that hash with her private key. This encrypted hash is then sent along with the data to Bob. Bob can then use Alice’s public key to decrypt the hash and compare it to a hash of the data that he performs. If both of the hashes match, Bob knows that the data was truly from Alice and was not tampered with while being sent to him.

Wrapping up

To pick one method of encryption as the general winner at this abstract level is nonsensical. It makes sense to have a use case and pick the best encryption method for it by finding the best fit at the abstract level first, then finding a library which offers that method of encryption.

A following post will go into the tradeoffs between different encryption methods in relation to keeping secrets in Ruby on Rails applications. It will take a practical approach, explaining some of the benefits of one encryption method over another, and then give some examples of well-known libraries for each category.

Parallel GraphQL Resolvers with Futures

My team and I are building a GraphQL service that wraps multiple RESTful JSON services. The GraphQL server connects to backend services such as Zendesk, Salesforce, and even Shopify itself.

Our use case involves returning results from these backend services all from the same GraphQL query. When the GraphQL server goes out to query all of these backend services, each backend service can take multiple seconds to respond. This is a terrible experience if queries take many seconds to complete.

Since we’re running the GraphQL server in Ruby, we don’t get provided the nice asynchronous IO that would come with the NodeJS version of GraphQL. Because of this, the GraphQL resolvers run serially instead of in parallel – thus a GraphQL query to five backend services which take one second each to fetch data from will result in the query taking five seconds to run.

For our use case, having a GraphQL query that takes five seconds is a bad experience. What we would prefer is 2 seconds or less. This means performing some optimizations when GraphQL goes to do the HTTP requests to the backend services. Our idea is to parallelize those HTTP requests.

First Approaches

To parallelize those HTTP requests we took a look at non-blocking HTTP libraries, different GraphQL resolvers, and Ruby concurrency primitives.

Typhoeus

Knowing that running the HTTP requests in parallel is the direction to explore, we first took a look at the Ruby library Typhoeus. Typhoeus offers a simple abstraction for performing parallel HTTP requests by wrapping the C library libcurl. Below is one of the many possible ways to use Typhoeus.

After playing around with Typheous, we quickly found out that it wasn’t going to work without extending the GraphQL Ruby library. It became clear that it was nontrivial to wrap a GraphQL resolver’s life cycle with a Hydra from Typhoeus. A Hydra basically being a Future that runs multiple HTTP requests in parallel and returns when all requests are complete.

Lazy Execution

We also took a look at the GraphQL Ruby’s lazy execution features. We had a hope that the lazy execution would automatically optimize by running resolvers in parallel. It didn’t. Oh well.

We also tried a perverted version of lazy execution. I can’t remember why or how we came up with this method, but it was obviously overcomplicated for no good reason and didn’t work 😆

Threads and Futures

We looked back and understood the shortcomings of the earlier methods – namely, we had to find a concurrency method that would allow us to do the HTTP requests in the background without blocking the main thread until it needed the data. Based on this understanding we took a look at some Ruby concurrency primitives – both Futures (from the Concurrent Ruby library), and Threads.

I highly recommend using higher-order concurrency primitives such as Futures, and the like because of their well-defined and simple APIs, but for hastily hacking something together to see if it would work I experimented with Threads.

My teammate ended up figuring out a working example of Futures faster than I could hack my threads example together. (I’m glad they did, since we’ll see why next.) Here is a simple use of Futures in GraphQL:

It’s not clear at first, but according to the GraphQL Ruby docs, any GraphQL resolver can return the data or can return something that can then return the data. In the code example above, we use the latter by returning a Concurrent::Future in each resolver, and having the lazy_resolve(Concurrent::Future, :value!) in the GraphQL schema. This means that when a resolver returns a Concurrent::Future, the lazy_resolve part tells GraphQL Ruby to call :value! on the future when it really needs the data.

What does all of this mean? When GraphQL goes to fulfill a query, all the resolvers involved with the query quickly spawn Futures that start executing in the background. GraphQL then moves to the phase where it builds the result. Since it now needs the data from the Futures, it calls the potentially blocking operation value! on each Future.

The beautiful thing here is that we don’t have to worry about whether the Futures have finished fetching their data yet. This is because of the powerful contract we get with using Futures – the call to value! (or even just value) will block until the data is available.

Conclusion

We ended up settling on the last design – utilizing Futures to allow the main thread to put as much asynchronous work into background.

As seen through our thought process, all that we needed was to find a way to start execution of a long-running HTTP request, and give back control to the main thread as fast as possible. It was quite clear throughout the early ideas of utilizing concurrent HTTP request libraries (Typhoeus) that we were on the right track, but weren’t understanding the problem perfectly.

Part of that was not understanding the GraphQL Ruby library. Part of it was also being fuzzy on our concurrent primitives and libraries. Once we had taken a look at GraphQL Ruby’s lazy loading features, it became clear to us that we needed to kick-off the HTTP request and immediately give back control to the GraphQL Ruby library. Once we understood this, the solution became clear and we became confident after some prototypes that used Futures.

I enjoyed the problem solving process we went through, as well as this writing that resulted from it. The problem solving process ended up teaching the both of us some valuable engineering lessons about collaborative, up-front prototyping and design since we couldn’t have achieved this outcome on our own. Additionally, writing about this success can help others with our direct problem, not to mention learning about the different techniques that we met along the way.