Brodie: Building Shopify’s new Help Centre

One of the primary projects which has defined the existence of my team at Shopify was a complete rebuild of the Help Centre’s platform. The prior Help Centre utilized Jekyll (the static site generator) with a number of features added over the past five years to provide documentation to our merchants, partners, and prospective customers.

The rebuild took about six months, and successfully launched with multiple languages in July 2018.

Deacon Brodie

This post will first discuss the limitations we encountered with using Jekyll for a number of years on a Help Centre which has grown to 15 technical writers and 1600 pages. Next, a number of upcoming features are outlined which the new platform should easily accommodate for. Following that, a high level overview of Brodie, the library we built to replace Jekyll. Next, Brodie’s internals are explained with details on how it integrates with Ruby on Rails. This post then ends with links to related code discussed throughout this post.

Jekyll’s Limitations

As of February 2018, Shopify’s Help Centre consisted of 1600 pages, 3000 images, and 300 partials/includes. This amount of content can really slow down Jekyll’s build time. A clean build takes 80 seconds, while changing a single character on a page requires 15 seconds for a partial rebuild. This really slows down the workflow for our technical writers, as well as developers who maintain the heavy Javascript-based Support page.

Static sites, where a server serves up html files, can only get you so far. Features considered dynamic must be implemented using client-side Javascript. This has proven to be difficult and even restrictive to the features that could be added to the site, especially when it comes to features which require running on a server and not in the user’s browser. Things such as authenticating Shopify merchants before they contact Support is more difficult considering that all of the functionality lives in Javascript, or another app is relied upon.

The original Deacon Brodie’s Tavern in Edinburgh

Even other companies have blogged about the hoops they’ve jumped through to scale Jekyll too.

Upcoming Features

Allowing users to login to the Help Centre with their Shopify credentials can provide a more personalized experience. Based on the shops the Merchant has access to, the pages in the Help Centre can be tailored to their Country, the features that they use, and the growth stage of their business. The API documentation can be extended to provide the logged in user the ability to query their shop’s API.

Enabling the ability for merchants to login to the Help Centre can simplify the process of talking with Support. Once logged in, users would be able to bypass verifying who they are to a Support representative, since they’ve already proven who they are by logging into the Help Centre. This saves time on both ends of the conversation and keeps the user focused on their problem.

A short history of Deacon Brodie’s life

Features could also be added to enhance the workflow of our technical writers. As a logged in technical writer a few features could be enabled such as showing all pages regardless of being hidden or being an early-release feature, a link to view the page on GitHub, or even a link to view the current page in Google Analytics. Improvements such as these make it much quicker to access to relevant data.

Being able to correlate the Help Centre pages visited by a user before they contact Support can help infer how successful pages are at helping answer the user’s question. Pages which do poorly can be updated, and pages which are successful can be studied for trends. Resources can be better focused on areas of the Help Centre pages which need it. Additionally, combining the specific pages visited to Support interactions opens the opportunity to perform A/B tests. A Help Centre page can have two or more versions, and the version which results in the least amount of Support interactions could be considered the winning version. Currently there is no way to definitively correlate the two.

Many Support organizations gauge the effectiveness of their Help Centre content (self-help) by comparing potential Support interactions solved by Help Centre pages to the number of actual Support interactions. A so called deflection ratio, where the higher the self-help-to-support-interaction ratio the better. This ratio can be more accurately calculated by better tracking of the user’s journey through these various Shopify properties before they contact Support.

Lastly, Internationalization (aka I18n) and Localization means translating pages into different languages and cultural norms. I18n would enable the Help Centre to be used by people other than those who know English, or prefer reading in a language they understand better. I18n support can be hacked into Jekyll, but as was discussed earlier with 1600 pages already slowing down the build times, Jekyll will absolutely cripple when there exists multiple localized versions of each page. Therefore, having an app that can scale to a much larger number of pages is required for I18n and localization to even be considered.

The Solution

To enable our Help Centre to scale way past 1600 pages, and support complex server-side features, a scrappy team was formed to rebuild the Help Centre platform in Ruby on Rails.

Rewriting any of the content pages or partials wouldn’t be feasible for the time or resources we had – therefore maintaining compatibility with the existing content files was paramount.

Allowing the number of pages in the Help Centre to keep growing, but to dramatically reduce the 80 second clean build time, and the 15 second page rebuild time requires an architectural shift. Moving away from Jekyll’s model of pre-rendering all pages at build time to the model of rendering only what’s needed at request time. Instead of performing all computational work up-front, performing smaller batches of work at request time spreads out the cost.

The Deacon Brodies Pub in Ottawa, steps away from Shopify HQ

Ruby on Rails was chosen as the new technology stack for the Help Centre for a few reasons. The limits were being reached with Jekyll, therefore we technically couldn’t continue using it. Shopify’s internal tooling and production systems  heavily integrate with Rails applications, therefore building on Rails to utilize these would save a lot of developer time. Shopify also employs the largest base of Rails developers, so tapping into that workforce and knowledge base is very beneficial for future development.

Ruby on Rails brings a number of complementary features such as a solid MVC framework, simple caching abstractions for application code and views, as well as a strong and healthy community of libraries and users. These benefits make Rails a great selling point for building new features faster and easier than the prior Jekyll system.

One of the things that has been working quite well over the past few years has been the workflow for our technical writers. It consists of using a text editor (such as Atom) to edit Markdown and Liquid code, then using Git and GitHub to open a Pull Request for peer review of the changes. Automated tests check for broken links, missing images, incorrectly formed HTML and Javascript.  Once the changes are approved and all tests have passed, the Pull Request is merged and shipped to production.

Since there isn’t a good reason to change the technical writer’s current workflow we’re more than happy to design the new documentation site with the existing workflow in mind.

One of the main features of the platform my team built was the flexible content rendering engine. It’s equivalent to Jekyll on steroids. Here I’ll discuss the heart of the system, Brodie, the ERB-Liquid-Markdown rendering engine.

Brodie

Brodie is the library we’ve purpose-built for Shopify’s new Help Centre. It renders any file that contains ERB, Liquid, and Markdown, or a combination of the three into HTML.

Brodie is named after Deacon Brodies, an Ottawa pub which is itself named after Deacon William Brodie, an 18th-century city councillor in Edinburgh who moonlighted as a burglar and gambler.

Deacon Brodie’s double life inspired the Robert Louis Stevenson story Strange Case of Dr Jekyll and Mr Hyde.

Brodie, and the custom extensions built on-top of it, enable a smooth transition from Jekyll to Rails. Shopify’s 1600 pages, 3000 images, and 300 partials/includes can be rendered by Brodie without modification. Additionally, the workflow of the technical writers is not disturbed. They continue to use their favourite text editor to modify content files, Git and GitHub to perform reviews, and to utilize the existing Continuous Delivery pipeline for fast validation and shipping.

Views in Rails are rendered using templates. A template is a file that consists of code that defines what the user will see. In a Rails app the template file will usually consist of ERB mixed into HTML. A template file like this would belong in the app/views/ directory and would have a descriptive name such as homepage.html.erb.

The magic in Rails treats templates differently based on its filename. Let’s break it down. homepage represents the template’s filename. Rails knows to look for this template based on this name. The html part represents what the format the template should output to. Lastly, erb is the portion which specifies what language the template file is written in. This naming convention enables Rails to dynamically render views just by looking at the filename.

Rails provides template handlers to render ERB to HTML, as well as JSON and a few others. Rails offers the ability to extend its rendering system by plugging in new template handlers. This is where Brodie integrates with Rails applications. Brodie provides its own template handler to take content files and convert the ERB, Liquid, and Markdown to HTML.

Rails exposes this via the ActionView::Template.register_template_handler(:md, Content) where :md is the file extension to act on, and Content is the Class to use as the template rendering engine (template handler). Next we’ll go over how a template handler works.

Rendering Templates

The only interface a template handler is required to respond to is call with one parameter being the template to render. This method should return a string of code that will render the view. This string will be eval‘ed by the template later. Returning a string of code is a Rails optimization which inlines much of the code required to render the template. This reduces the number of methods needing to be called, speeding up the already time consuming rendering process.

When Rails needs to render a view it takes the specified template and calls the proper template handler on itself. The handler returns a string that contains the code that renders the template. The Template class combines the code with other code, then evals the stringified code.

For example, the ERB-Liquid-Markdown renderer has a call method like the following:

def call(template)
  compiled_source = erb_handler.call(template)
  "Brodie::Handlers::Content.render(begin;#{compiled_source};end, local_assigns)"
 end

Brodie first renders the ERB present in the template’s content with the existing ERB handler that comes with Rails. Brodie then returns a string of code which calls the “render” method on itself. That render method is shown next:

def render(source, local_assigns = {})
  markdown.call(
    liquid.call(source, local_assigns)
  ).html_safe
end

Here is where the actual rendering of the Liquid and Markdown occur. When this code is eval‘ed the parameter local_assigns is included for passing in variables when rendering a view. This is how variables are magically passed from Rails controllers into views.

Left: The old Jekyll site. Right: The new Rails site. The Help Centre rebuild looks the same but has a completely new backend

It’s as straightforward as that for rendering ERB, Liquid, and Markdown together. The early days of Brodie were spent understanding the ins-and-outs of ActiveView enough to validate that this approach was a sane practice and not breaking in edge cases.

Further Reading

The current documentation is really limited when it comes to Templates and Template Handlers. I would suggest building a small template handler, setting breakpoints and walk through the source. Here’s a great example of a template handler for Markdown.

Additionally, looking over the source code and comments is the best way to get an understanding of the ActiveView internals. The main entry point into ActiveView is the render method from TemplateRenderer. Template would be best to check out next as it concerns itself with actually rendering templates. Lastly, Handlers would be good to check out to see how Rails can register and fetch Template Handlers.

Keep Continuously Testing

One of the more powerful tools on my toolbelt is a red-green development cycle. With this cycle of purposefully writing tests which fail (red), then writing the code required to make them pass (resulting in green tests), allows me to keep my mind focused on the current task at hand, and saves me time.

For example, take the annual Advent of Code set of challenges: given a problem and example input with output, create working code which solves the problem for any given input.

Small challenges such as Advent of Code provide the perfect scenario for Test Driven Development (TDD). Given an Advent of Code challenge, I would always write a test which asserts some output. Since those tests would fail, I would then proceed to write code which returns the correct answer.

My optimal TDD coding setup. Clockwise, from left: source file, test file, automatic test output, interactive Ruby console.

I can’t emphasize the time saved from automatically running my tests whenever any file is saved in my working directory. The time savings has payed off time and time again because it has saved me a number of keystrokes off of running tests manually each time. Saving my source or test file automatically causes my tests to start running in another terminal pane.

The one way I’m able to accomplish this is by utilizing two tools: ag and entr. These two shell tools used in combination allow for a powerful red-green development process.

ag (commonly known as the Silver Searcher, succeeded by rg (RipGrep)) is a program which lists files recursively. entr is a program which runs a given shell command whenever a file is modified. These two tools piped together like ag -l | entr -c rails test enable a TDD workflow because ag provides a recursive list of files, then entr watches those files and runs the provided command. In this case, the command to run every time is rails test. The -c is a parameter for entr to clear the console each time the command is rerun.

Download ag and entr

I would highly recommend trying out these two tools in combination next time you encounter your next TDD situation – it’s certainly saved me a lot of time repeating keyboard commands. Even if you don’t do TDD regularly or at all, I would recommend trying out these new tools.

Download ag from Github, or equivalently, rg (RipGrep) if you’re feeling up to the challenge. entr can also be fetched from Github.

For a quicker install

brew install the_silver_searcher entr on Mac OS.

apt-get install silversearcher-ag entr on Ubuntu-like systems.

Why “&” doesn’t actually break your HTML URLs

Writing tests for some code which generated HTML ended up surfacing one peculiarity with how HTML encodes URLs. The valid URL https://example.com?a=b&c=d would always get modified when inserted into HTML like so: <a href="https://example.com?a=b&amp;c=d">foo</a>.

One of my teammates commented on this during a code review – why the & character is converted to &amp; in the resulting HTML. That URL didn’t look right since the &amp; would break the URL query string.

Even more confusing was that the HTML in the URL still worked since Google Chrome and other browsers converted the URL in the HTML from its &amp; form back to &. Were the browsers just being helpful by handling these developer mistakes, much like it already does with closing missing HTML elements?

The fake bug hunt

Over two hair pulling days of reading GitHub issues, StackOverflow, HTML standards, source code, and more, it was clear that there was a clear divide in understanding. One group of people who understood this as a bug in their library of choice and another group who understood that this wasn’t a bug.

I was definitely in the former group of people until I finally found a helpful blog post clearing up the confusion. Even this StackOverflow answer concisely summed why this is, in a few quick sentences.

Simply stated, lone & characters in HTML are invalid and must be escaped to &amp;.

Since HTML is a descendant of XML, HTML implements a subset of XML’s rules. One of those rules is that a & character by itself is invalid since & is used as an escape character for a character entity reference (eg. &amp;, &lt;).

The confusion arises when people don’t know that this rule exists. Many, including myself, was blaming it on their HTML parsing libraries such as Nokogiri and libxml2. Others blamed their web app of choice since it sends them invalid HTML or XML that their HTML parser doesn’t know how to deal with.

Conclusion

Another way of understanding the same problem is that a URL on its own has its own rules around which characters must be encoded. HTML also has different encoding rules. So when URLs are used in HTML, the URL may look invalid, but given that it is in HTML, HTML has its own rules around what characters need escaping. This can lead to funky looking URLs, but rest assured that using a HTML parsing library or a browser will properly encode and decode any sort of data stored within HTML.

This explains why our browsers see &amp; in the raw HTML and know to convert it back to &. This also confirms that it is completely fine seeing &amp; characters in tests comparing HTML.

Twenty-four!

It’s hard to come up with the content for this post while fending off a sickness, but I know it’s a yearly ceremony for myself to look back and reflect on the prior year. As always, what better of a time to do this than on my Birthday!

One word can really describe my primary focus over the past year: Career. This time last year I was just about to pass the 90 day mark at Shopify.

Whether it’s been building close and trustworthy friendships with teammates and other colleagues, levelling up new hires through mentoring, or continually delivering impactful work – this year has been nothing more than exemplary.

Let’s get right into things! Since this time last year I moved to downtown Ottawa and am now living without roommates. Crazy to think that it only happened 10 months ago since it feels like forever, but I am enjoying all the perks of having no roommates and downtown life.

This summer was one of my most active to date. There was always something going on during the week or weekend from July straight through August. I had the opportunity to travel with a friend to his home town of Fredericton, New Brunswick. For the first time being on the east-cost I was expecting an east-coast accent out of everyone, but the place seemed more like Ontario than not. It was a great time hanging out with his friends and attending a party at the local hotel.

I had a great time with friends at two music festivals: Bluesfest and Escapade. Escapade was especially fun since there was a number of great acts: Alesso, Tchami, Zedd, and Kaskade. One private festival I went to was about 15 people camping at a buddies lakeside property in Quebec. A DJ booth was set up and the trance music was going on late into the night.

Being so close to downtown has its perks – I was within walking distance to both festivals.

Another bunch of cool moments were centred around exploring other Shopify offices. Barrel Hall in Waterloo was the coolest looking since it was once a distillery. It still has all of the characteristic aging barrels and wooden structure. Montreal’s office has the best artwork and looks like the most liveable city. Toronto’s offices and the city in general was a grand party since it’s mostly new to me, but I have friends, family, and colleagues everywhere.

Barrel Hall, Shopify’s Waterloo Office, definitely had the most character.

Even though I didn’t do as much cycling as last year, this year I took advantage of Ottawa’s Sunday Bikedays. On Sundays throughout the summer certain parkways were closed for the morning. This allowed for coasting down some long stretches of smooth roadway. The midpoint for some of these outings were spent taking a break at a local brewpub. Some friends joined me every once in a while too!

One of the many destinations – where the Rideau Canal meets the Ottawa River.

Investing and personal finance has become a hobby of mine and ever more important as I get older. It’s better earlier than later to learn about the do’s and don’ts of personal finance. A year of listening to related podcasts, plenty of reading, and managing my savings has enabled me to go from zero to pretty competent. I’m lucky to have a representative group of people to bounce ideas and plans off of.

I went to my first conference, BSides Ottawa, which was quite fun. I met a number of colleagues and played in my first capture the flag event. I found out that I can defend, but am not too good at attacking. I’ll try again to attend this year!

December was when myself and a few others started a rewrite for Shopify’s Help Centre. Unknown to us, there was quite a lot of feature creep – either from “that one little feature that’s existed forever”, to adding multiple language support. This resulted in the project taking seven months, but we’re glad to have done it. Throughout the process we started and built our own kick-ass team. When the rewrite shipped it went off without a hitch! 🎇 Now all of our current projects hinge off of the benefits that this rewrite brought.

Some of the team which traveled to Montreal to launch the Help Centre.

I attended a few training sessions that should benefit my career – Visualizing Software Architecture with the C4 Model, as well as Agile Scrum training. The latter has definitely transformed my team for the better.

There was a lot of various work events – planned or unplanned, official or unofficial – which I’m pretty grateful to have experienced with friends and colleagues. Alas, there are too many to mention.

For example: that one time we had a marching band…

Here’s to another year of learning, growth, exploration, and good times.

Hunting for segmentation faults in Ruby programs

I was working on building a content management engine for Shopify’s next generation Help Centre. Code named Brodie, it was equivalent to the Jekyll static site generator added to Ruby on Rails, but instead of rendering all the pages up front at compile time, each page is generated when it is requested by the client.

Brodie used a Ruby Gem called Redcarpet for the Markdown rendering. Redcarpet worked wonderfully, but Brodie ended up having a severe bug due to the extensive usage of it. The way Redcarpet was being used in Brodie resulted in periodic segmentation faults (segfaults) while rendering Markdown. These segfaults were causing many 502 and 503 errors when some unknown pages were being visited. It was such an issue that all the web servers in the cluster would go down for some time until they restarted automatically.

How do I Redcarpet?

To better explain the issue and its resolution, it is best to have an understanding of how Redcarpet, and really any other text renderer works. Here is a simple example:

In the above example, the code defines the Markdown that is to be rendered to HTML, sets up the Redcarpet::Markdown configuration object, and then finally parses and renders the Markdown to HTML.

But wait! There’s more. Jekyll and Brodie both use the Liquid language (made by Shopify!) to make it easier to write and manage content. Liquid provides control flow structures, variables, and functions. One useful function allows including the contents of other files into the current file (the equivalent of partials in Rails). Here is an example that uses the Liquid include function:

As we can see in the example above, the code renders the Liquid and Markdown to HTML. This is achieved by rendering the Liquid first, then passing the result of that into the Markdown renderer. Additionally, the Liquid include function injected the contents of _included.liquid exactly where the include function was called in main.md.

Now that the basics of Markdown and Liquid rendering have been explained, it is now possible to understand this segfault issue.

“Where is this segfault coming from???”

When my team and I were close to launching the new Help Centre that used Brodie, the custom-built Liquid and Markdown rendering engine, the app would crash due to segmentation faults. When the servers were put under load with many requests coming in, the segfaults and resulting downtime was magnified. It was clear from load testing that a small amount of traffic would bring down the entire site and keep it down.

The segfaulting would lead to servers becoming unavailable until Kubernetes, the cluster manager, checked that those servers were unhealthy and restarted them. The time it took for the pod to come back online would be 30-60 seconds. With the system being under load, it was only a couple of minutes before all the servers in the cluster were down. When this happened, the app returned HTTP 502 and 503 errors to any client requesting a page – never a good sign.

The only message that was present in the logs before the app died was the following:

Assertion failed: (md->work_bufs[BUFFER_BLOCK].size == 0), function sd_markdown_render, file markdown.c, line 2544.

Apparently, Ruby crashed in a random Redcarpet C function call. No sort of stacktrace or helpful logging followed this message. The logs did not even include which page the client requested, as the usual Rails request logging was created after the HTTP request finished. This Assertion failed message was a lead, but didn’t help much since it does not reference what caused it.

I have dealt with other Redcarpet issues in the past, where methods that have been extended in Redcarpet to add custom behaviour have thrown exceptions. Sometimes these exceptions have caused the request to fail and a stacktrace of the issue to show up. Other times it has resulted in a segfault with a similar Redcarpet C function in the message. Ultimately, writing better code fixed this earlier situation.

My intuition told me that an error was being thrown while rendering the page, causing this segfault to occur. I attempted an experiment where I added some rescue blocks to the Redcarpet methods that we extended. This would prevent the potential exceptions from being raised in the buggy code that was causing it, hopefully resulting in no segfaults. If that fix succeeded, I could safely assume that fixing the code which raised the error would be the end of the story.

Trying this, the experiment was shipped to production. Things went well until the next day. Sometime overnight the page that caused the segfaults was hit, and the operational dashboards recorded the cluster going down and rebooting. At least this confirmed that the Redcarpet extensions were not at fault.

Getting lucky

Playing around with things, a page was found out of sheer luck that could cause the app to segfault repeatedly. Visiting this page once did not cause the server to crash or the response to 500, but refreshing the page multiple times did cause the server to crash. Since this app was running multiple threads in the local development and production environments to answer requests in parallel, it is possible that there was a shared Redcarpet data structure that was getting clobbered by multiple threads writing to it at the same time. This is actually a recurring issue according to the community:

Recursive rendering

Discussing the issue more with my larger team of developers, there was the idea of removing any sort of cross-thread sharing of Redcarpet’s configuration object. One of the other developers shipped a PR which gave each thread its own Redcarpet configuration object, but this did not end up fixing the problem.

A tree, showing the order in which nodes are traversed using the depth-first search algorithm. (CC 3.0)

Building on top of this developer’s work, I knew that it was possible for the Redcarpet renderer to be called recursively based on the nature of the Liquid and Markdown content files. As described earlier, it is possible for one content file to include another content file. As we saw in the examples earlier in this article, when a content file is being rendered, the rendering pauses on that content file and descends into the included file to render it, then returns to where it left off in the original content file. This behaviour is exactly like the depth-first search algorithm from graph theory.

After making this breakthrough it was simple to understand what to try next. Each time Redcarpet was being called to render some Markdown, always create a new Redcarpet configuration object. This should solve the issue of multiple thread writes, as well as the recursive writes. Even though there is extra overhead with creating a new Redcarpet configuration object each time a content file is rendered, it is a reliable workaround that bypasses Redcarpet’s single-thread, single-writer limitation.

After coding and shipping this fix, it worked!

Refreshing that problematic page multiple times, no matter how many times, never crashed the app. The production servers were back to handling their original capacity and one developer was feeling very relieved.

Takeaways

I learned a considerable number of things from this debugging experience. Even when using battle-tested software (like Redcarpet), there may be use cases which are not exactly supported or documented to not work. Additionally, the Redcarpet library is now rarely maintained. Knowing the limitations up front can save time and frustration. One of the main reasons why this article was written was that there was no other writing about this issue and the workarounds. Hopefully it will help save time for developers in the future who run into similar issues.

It was valuable to bounce ideas off of other team members. If I had not put out my ideas and had these discussions, I would not have understood the problem as well as I did. Even the potential fix that a teammate of mine shipped but did not end up working helped me understand the problem better.

Drawing out parts of the control flow on paper to really understand how the app renders content files builds a better mental model of what actually goes on inside the app. It is one thing to have a high level overview of how different components interact with each other, but it is an entirely different level of understanding to factually know what exactly happens. This can be extended to the intricacies of the software libraries being used. In this situation, knowing the internals and behaviour of Ruby on Rails, Liquid, and Redcarpet made it a lot easier to understand what was going on.

Lastly, you always feel like a boss when you fix big, complicated problems such as this one.

Twilio’s TaskRouter Quickstart

My team and I are exploring different services and technologies in the area of contact centres. We develop and maintain the tools for over 1000 support agents, with the number rapidly rising. Making smart, long-term business and technology decisions are paramount. One of the technologies we looked into was Twilio and its ecosystem – specifically TaskRouter.

Twilio’s TaskRouter provides a clean interface for building contact centres. Its goal is to take the tedious infrastructure and plumbing work out of building a custom contact centre, exposing the right APIs to implement domain logic. TaskRouter is a high-level service since it orchestrates voice, SMS, and other communication channels with the ability to assign incoming interactions across a workforce of agents ready to take those interactions.

Twilio-Ruby

To get a head start at understanding how TaskRouter works, I spent a day looking at Twilio’s Ruby quickstart guide for TaskRouter. Wow, was I in for a frustrating time.

The quickstart guide takes the reader through a number of steps, both inside of the Twilio Console as well as building a small Ruby Sinatra app. After completing the quickstart the reader should have a fully functioning call centre with an interactive voice response (IVR) to greet and queue any user that calls in.

Some of the things that made the quickstart harder to complete is that the Ruby code examples included throughout used an older version of the twilio-ruby gem. Because of this, the code examples didn’t work with the latest version. This was both a bad and good thing. Bad in that the existing code examples wouldn’t work out of the box, but good in the fact that I had to put in some extra effort into learning where the docs and other sources of help exist, and having a deeper understanding of how the Twilio API works.

I compiled a list of resources that would assist anyone going through the same or a similar situation. It certainly helped me complete the TaskRouter quickstart.

  • The README for the twilio-ruby gem provided a great overview of what functionality it provides and how the gem is to be used
  • The v4 to v5 upgrade guide for the twilio-ruby gem showed that there was some sense to this chaos by providing the rationale and examples for updating old versions of the twilio-ruby code to the latest (v5). This was where I had my moment of understanding for the quickstart code examples.
  • Using JWT tokens was part of the last section of the quickstart. Since twilio-ruby changed the way it uses tokens, its code examples had to be updated too. The main Twilio docs on JWT goes into intricacies around building policies contained within JWT tokens
  • My lead/manager was quite happy when I mentioned to him that the twilio-ruby gem no longer uses title case for situations where camel-case or snake-case would have been better to Ruby styling. TwiML was affected by this for a number of gem versions up until v5. Since TwiML is used frequently throughout the quickstart the docs for using TwiML in twilio-ruby helped during those times.
  • Lastly, if all else fails, feel free to reference my resulting code from the TaskRouter quickstart. It’s available here on GitHub.

How Does Symmetric and Public Key Encryption Work?

With the release of Rails 5.2 and the changes with how secrets are securely stored, I thought it would be timely to write about the benefits and downsides of secrets management in Rails. It would be valuable to compare how Rails handles secrets, how Shopify handles secrets, and a few other methods from the open source community. On my journey to write about this I got caught up in explaining how symmetric and public key encryption work. So the post comparing different Rails secret management gems will have to wait until another post.

Managing secrets is now more challenging

A majority of applications created these days integrate with other applications – whether it’s for communicating with other business-critical systems, or purely operational such as log aggregation. Secrets such as usernames, passwords, and API keys are used by these apps in production to communicate with other systems securely.

The early days of the Configuration Management, and then later the DevOps movements have rallied and popularized a wide array of methodologies and tools around managing secrets in production. Moving from a small, artisanal, hand-crafted set of long-running servers to the modern short-lifetime cloud instance paradigm now requires the discipline to manage secrets securely and repeatedly, with the agility to revoke and update credentials in a matter of hours if not minutes.

While there’s many ways to handle secrets while developing, testing, and deploying Rails applications, it’s important to bring up the benefits and downsides to the different methods, particularly around production. Different levels of security, usability, and adoption exist with different technologies. Public/private key encryption, also known as RSA encryption, is one of the technologies. Symmetric key encryption is also another common encryption technology.

There exist many ways to handle secrets within Rails and webapps in general. It’s important to understand the underlying concepts before settling on one method or another because making the wrong decision may result in secrets being insecure, or the security being too hard to use.

Let’s first discuss the different types of encryption that are characteristic of the majority of secret management libraries and products out there.

Symmetric Key Encryption

Symmetric key encryption may be the simplest form of encryption to understand, but don’t let that trick you into thinking that it’s not secure. Symmetric key encryption involves one key used to both encrypt and decrypt data. This key will have to be kept secret and only be shared with trusted people and systems. Once secrets are encrypted with the key, that encrypted data can be readily shared and transferred without worry of the unencrypted data being read.

A simple example of symmetric key encryption can be explained. The most straightforward method utilizes the binary XOR function. (This example is not representative of state of the art symmetric key encryption algorithms in use, but it does get the point across). The binary XOR function means “one or the other, but not both”. Here is an example that shows the complete set of inputs and outputs for one binary digit:

1 XOR 1 = 0
1 XOR 0 = 1
0 XOR 1 = 1
0 XOR 0 = 0

A more complicated example would be:

10101010 XOR 01010101 = 11111111
11111111 XOR 11111111 = 00000000
11111111 XOR 01010101 = 10101010

Note that line 1 and 3 are related. The output of line 1 is part of the input of line 3. The second parameter of line 1 is used as the second parameter of line 3 too. Notice that the output of line 3 is the same as the first input of line 1. As demonstrated here, the XOR function will return the same input if the result of the function is fed back into itself a second time. A further example will show this property.

Given the property that any higher form of data representation can be broken down to binary, we can then show the example of hexadecimal digits being XOR’ed with another parameter.

12345678 XOR deadbeef = cc99e897

Given the key is the hexadecimal characters deadbeef and the data to be encrypted is 12345678, the result of the XOR is the incomprehensible result cc99e897. Guess what? This cc99e897 is encrypted. It can be saved and passed around freely. The only way to get the secret input (ie. 12345678) is to XOR it again with the key deadbeef. Let’s see this happen!

cc99e897 XOR deadbeef = 12345678

Fact check it yourself if you don’t believe me, but we just decrypted the data! This is the simplest example of course, so there’s a lot more that goes into symmetric key encryption that keeps it secure. Things like block-based, and stream-based algorithms, and even larger key sizes augment the simple XOR algorithm to make it more secure. It may be simple for someone who wants to break the encryption to guess the key in this example, but it becomes much harder the longer the key size is.

This is what makes symmetric key encryption so powerful – the ability to encrypt and decrypt data with a single key. With this property comes the need to keep this single key secret and separate from the data. When symmetric key encryption is used in practice, the smaller amount of people and systems that have the key the better. Humans can easily lose the key, leave jobs, or worse: share the key with people of malicious intent.

Public Key Encryption

Quite opposite to how symmetric key encryption works, public key encryption, (or asymmetric key encryption, or RSA encryption) uses two distinct keys. In its simplest form the public key is used for encryption and the private key is used for decryption. This method of encryption separates the need for the user who is encrypting the data from having the ability to decrypt the data. Put plainly, it allows for anyone to encrypt data with the public key while the owner of the private key is the only one able to decrypt the data. The public key can be shared with anyone without compromising the security of the encrypted data.

Some tradeoffs between symmetric and public key encryption is that the private key (the key used to decrypt data) is never shared with other parties, whereas the same key is used in symmetric key encryption. Also, a downside of public key encryption is that there are multiple keys to manage, therefore it brings a higher level of overhead compared to symmetric key encryption.

Let’s dig into a simple example. Given a public key (n=55, e=3) and a private key (n=55, d=27) we can show the math behind public key encryption. (These numbers were fetched from here).

Encrypting

To encrypt data the function is:

c = m^e mod n

Where m is the data to encrypt, e is the public key, mod is the modulus function, n is the shared modulus, and c is the encrypted data.

For the number 42 to be encrypted we can plug it into the formula quite simply:

c = 42^3 mod 55
c = 3

c = 3 is our encrypted data.

Decrypting

Decrypting takes a similar route. For this a similar formula is used:

m = c^d mod n

Where c is the encrypted data, d is part of the private key, mod is the modulus function, n is the shared modulus, and m is the decrypted data. Lets decrypt the encrypted data c = 3:

m = 3^27 mod 55
m = 42

And there we have it, our decrypted data is back!

As we can see, a separate key is used for encryption and decryption. It’s worth restating that this example here is very simplified. Many more mathematical optimizations, and larger key sizes are used to make public key encryption secure.

Signing – a freebie with public key encryption

Another benefit to using RSA public and private keys is that given the private key is only held by one user, that user can sign a piece of data to verify that it was them who actually sent it. Anyone who has the matching public key can verify that the data was signed by the private key and that the data was not tampered with during transit.

When Bob needs to receive data from Alice and Bob needs to be sure it was sent by Alice, as well as not tampered with while being sent, Alice can hash the data and then encrypt that hash with her private key. This encrypted hash is then sent along with the data to Bob. Bob can then use Alice’s public key to decrypt the hash and compare it to a hash of the data that he performs. If both of the hashes match, Bob knows that the data was truly from Alice and was not tampered with while being sent to him.

Wrapping up

To pick one method of encryption as the general winner at this abstract level is nonsensical. It makes sense to have a use case and pick the best encryption method for it by finding the best fit at the abstract level first, then finding a library which offers that method of encryption.

A following post will go into the tradeoffs between different encryption methods in relation to keeping secrets in Ruby on Rails applications. It will take a practical approach, explaining some of the benefits of one encryption method over another, and then give some examples of well-known libraries for each category.

Parallel GraphQL Resolvers with Futures

My team and I are building a GraphQL service that wraps multiple RESTful JSON services. The GraphQL server connects to backend services such as Zendesk, Salesforce, and even Shopify itself.

Our use case involves returning results from these backend services all from the same GraphQL query. When the GraphQL server goes out to query all of these backend services, each backend service can take multiple seconds to respond. This is a terrible experience if queries take many seconds to complete.

Since we’re running the GraphQL server in Ruby, we don’t get provided the nice asynchronous IO that would come with the NodeJS version of GraphQL. Because of this, the GraphQL resolvers run serially instead of in parallel – thus a GraphQL query to five backend services which take one second each to fetch data from will result in the query taking five seconds to run.

For our use case, having a GraphQL query that takes five seconds is a bad experience. What we would prefer is 2 seconds or less. This means performing some optimizations when GraphQL goes to do the HTTP requests to the backend services. Our idea is to parallelize those HTTP requests.

First Approaches

To parallelize those HTTP requests we took a look at non-blocking HTTP libraries, different GraphQL resolvers, and Ruby concurrency primitives.

Typhoeus

Knowing that running the HTTP requests in parallel is the direction to explore, we first took a look at the Ruby library Typhoeus. Typhoeus offers a simple abstraction for performing parallel HTTP requests by wrapping the C library libcurl. Below is one of the many possible ways to use Typhoeus.

After playing around with Typheous, we quickly found out that it wasn’t going to work without extending the GraphQL Ruby library. It became clear that it was nontrivial to wrap a GraphQL resolver’s life cycle with a Hydra from Typhoeus. A Hydra basically being a Future that runs multiple HTTP requests in parallel and returns when all requests are complete.

Lazy Execution

We also took a look at the GraphQL Ruby’s lazy execution features. We had a hope that the lazy execution would automatically optimize by running resolvers in parallel. It didn’t. Oh well.

We also tried a perverted version of lazy execution. I can’t remember why or how we came up with this method, but it was obviously overcomplicated for no good reason and didn’t work 😆

Threads and Futures

We looked back and understood the shortcomings of the earlier methods – namely, we had to find a concurrency method that would allow us to do the HTTP requests in the background without blocking the main thread until it needed the data. Based on this understanding we took a look at some Ruby concurrency primitives – both Futures (from the Concurrent Ruby library), and Threads.

I highly recommend using higher-order concurrency primitives such as Futures, and the like because of their well-defined and simple APIs, but for hastily hacking something together to see if it would work I experimented with Threads.

My teammate ended up figuring out a working example of Futures faster than I could hack my threads example together. (I’m glad they did, since we’ll see why next.) Here is a simple use of Futures in GraphQL:

It’s not clear at first, but according to the GraphQL Ruby docs, any GraphQL resolver can return the data or can return something that can then return the data. In the code example above, we use the latter by returning a Concurrent::Future in each resolver, and having the lazy_resolve(Concurrent::Future, :value!) in the GraphQL schema. This means that when a resolver returns a Concurrent::Future, the lazy_resolve part tells GraphQL Ruby to call :value! on the future when it really needs the data.

What does all of this mean? When GraphQL goes to fulfill a query, all the resolvers involved with the query quickly spawn Futures that start executing in the background. GraphQL then moves to the phase where it builds the result. Since it now needs the data from the Futures, it calls the potentially blocking operation value! on each Future.

The beautiful thing here is that we don’t have to worry about whether the Futures have finished fetching their data yet. This is because of the powerful contract we get with using Futures – the call to value! (or even just value) will block until the data is available.

Conclusion

We ended up settling on the last design – utilizing Futures to allow the main thread to put as much asynchronous work into background.

As seen through our thought process, all that we needed was to find a way to start execution of a long-running HTTP request, and give back control to the main thread as fast as possible. It was quite clear throughout the early ideas of utilizing concurrent HTTP request libraries (Typhoeus) that we were on the right track, but weren’t understanding the problem perfectly.

Part of that was not understanding the GraphQL Ruby library. Part of it was also being fuzzy on our concurrent primitives and libraries. Once we had taken a look at GraphQL Ruby’s lazy loading features, it became clear to us that we needed to kick-off the HTTP request and immediately give back control to the GraphQL Ruby library. Once we understood this, the solution became clear and we became confident after some prototypes that used Futures.

I enjoyed the problem solving process we went through, as well as this writing that resulted from it. The problem solving process ended up teaching the both of us some valuable engineering lessons about collaborative, up-front prototyping and design since we couldn’t have achieved this outcome on our own. Additionally, writing about this success can help others with our direct problem, not to mention learning about the different techniques that we met along the way.

Zero to One Hundred in Six Months: Notes on Learning Ruby and Rails

When they say you’ll like learning and programming in Ruby, they really mean it. From my experience learning and professionally using Ruby and Ruby-on-Rails day-to-day has been quite straightforward and friendly. The rate at learning Ruby and Rails is limited to how fast you’re able to obtain and use that knowledge from either resources online, in a book, or from other people.

It’s common for people who join Shopify to not know how to program in Ruby, yet will be required to. Ruby’s community has grown a great deal of beginner to intermediate guides for newcomers to quickly get up to speed at programming in Ruby. At Shopify, since the feedback is so fast, you’re able to get into an intense virtuous cycle of learning. Since you’re able to code and ship fast, you’re able to learn faster.

From personal experience I found it quite useful to do a deep-dive into Rails before starting to learn the full Ruby language. I focused on reading the entire Agile Web Development with Rails 5 book, which consists of a short primer on Ruby, then the bulk being how to develop an online store using Rails, and lastly an in-depth look into each Rails module. I completed this book over the two weeks before starting at Shopify to give me a head-start at learning.

Roughly the first two months spent at Shopify were a split of working on small tasks by myself, pair-programming with others, and reading a number of Ruby and Rails articles. At the end of two months I found myself being able to take pieces of work from our weekly sprints and completing them to the end without feeling like I was slow, and not requiring a team member to guide me through the entire change.

Code reviews over GitHub on the changes that myself and others have made provided a strong signal on how well my Ruby and Rails knowledge and style has progressed. Code reviews for my code at the start consisted of a lot of comments on style and better methods to use. As more and more code reviews were performed over time my intuition and knowledge increased, resulting in better code and less review comments. The bite-sized improvements gained in each code review slowly built up my knowledge and helped guide me towards areas of further learning.

Mastering Ruby and Rails is gained over months to years of constant use. This is where the lesser-known to obscure language features are understood and put to use, or explicitly not put to use (I’m looking at you metaprogramming!) Some examples being the unary & operator, bindings, and even Ruby internals such as object allocation and how Ruby programs are interpreted.

Coming from the statically-typed Java world, Ruby and Rails is INSANE with the things you can do since it is a dynamic language. My favourite dynamic language related features so far are Module#prepend for inheritance, and the ability to live-reload code.

After a sufficient amount of time gathering knowledge and experience, you gain the ability to help others along their path of learning. This not only benefits their understanding, but it also reinforces your knowledge of the subject.

Some of the things I look forward to in the future are learning about optimizing Rails apps, dispelling metaprogramming, reading Ruby Under a Microscope, and digging into the video back-catalogue of Destroy All Software. I hope you have a good journey too!

What the hack? – or how my first capture the flag went

The 2017 BSides Security Conference, just outside of Ottawa, was a two day event from October 5th to 6th. It was packed with talks, lock picking, and a capture the flag (CTF) competition. Pretty great for being a free conference.

On the second day of the conference I decided to join one of the Shopify CTF teams since it looked like a ton of fun. Actually, I think it was the deep house playing 24/7 was what lured me into the dim and crowded CTF room of the conference centre. I subbed in for one of my friends on Shopify’s Red Team, which was suitably named for Shopify’s second CTF team. Shopify’s first team was named the Blue Team.

I thought I knew what CTF’s were all about – hacking challenges, they say. But I was completely unprepared. My “so called” 10+ years of listening to the Security Now podcast didn’t exactly prepare me for the hands-on experience required for CTFs. It was quite the learning experience since most of the flags remaining on day 2 were difficult to capture for a newbie.

Having some of a background in security and hacking helps, though it doesn’t bridge the gap between the hacking experience and intuition required to solve CTF challenges. These challenges require experience and practice in thinking like an attacker.

For example, it’s one thing to understand that data can be hidden in images via steganography, but it’s another thing completely to actually extract the hidden data from an image.

Instead of wasting time on finding unknown flags, I focused on the topics I have experience with. Most of the flags I focused on were WEP and WPA cracking with aircrack-ng, and it’s associated collection tools. I was not able to inject packets with my setup, but luckily some other competitors did the hard work for me. After a few hours of unsuccessful attempts to crack the Wi-Fi networks I conceded that my attempts weren’t working.

I moved onto a new flag that involved breaking into an old exploited version of Joomla. After asking for some help from a teammate we found a script on exploit-db that would raise privileges to admin for any user. After running the exploit it took me a bit to figure out that it ran successfully since the flag was locked inside a Page that was locked for editing by someone else. The ‘locked for editing’ didn’t allow reading the Page, but after figuring out that the Page had a context menu to unlock it enabled me to view the flag. That made me facepalm both at Joomla’s UI and my inability to figure that out sooner.

After a day filled with a of couple muffins, a few slices of pizza, and countless teas the CTF concluded around 5pm. Winners were announced and thankfully our team didn’t fail too hard. I came out of the competition having met a bunch of colleagues from different parts of the company, and the expectation of what to expect in future CTFs. I’ll definitely be attending another CTF.

My team, the Red-Team, placed somewhere around 5th or 6th. Not too bad for having a handicap on it like myself. I got to hand it to Shopify, they have some seriously talented Security folks! No wonder Shopify’s Blue-Team came in first!