Nix-ify your environment

Over some vacation I put a bunch of effort into rebuilding my dotfiles and other environment configuration using home-manger, Nix, and the wealth of packages available in Nixpkgs. Previously, all of my system’s configuration was bespoke, unversioned files and random commands run to bring it to its current state. This has worked fine for a number of years, but has some drawbacks such as not being easily reproducible and portable between other systems.

Our developer acceleration team at Shopify is exploring the feasibility of Nix to solve a number of problems when it comes to supporting development environments for hundreds of software developers. Burke Libbey, who is spearheading a lot of Nix exploration on the developer acceleration team at Shopify, has a number of excellent resources, two of which are public that have inspired me to look into Nix on my own and write this article. He’s created a number of Nix related youtube videos, and an article on the Shopify Engineering blog diving into what Nix is. I won’t go into detail about what Nix is in this article as these resources can help. Instead, I’ll focus on some learnings I’ve had over my time switching to using home-manger, using the Nix language, and the Nix package manager.

home-manager

Home-manager is a program built on top of Nix which makes it simple for a user to manage their environment. Home-manager has a long list of applications which it natively supports configuring, as well as the flexibility to configure programs not yet managed by home-manager. Configuring home-manager is generally as simple as settings a number of key-value pairs in a file. Home-manager then deals with installing, uninstalling, and configuring everything for you from a few simple commands. For example, here’s a simplified version of the home-manager config which installs and configures a few packages and plugins:

Here is my full home-manager config for reference.

Some of the biggest factors that sold home-manager to me was easily being able to version my environment’s configuration, installing neovim and all the plugins I use by only specifying the name of the plugin, integrating fzf into my shell with only two config options, zsh installed and configured with plugins, and lastly having an escape hatch to specify custom options in my zshrc and neovim config.

All of this configuration is now versioned, and any edits I make to my home-manager config or associated config files just require one home-manager switch to be run to update my entire environment. If I want to try out some new vim plugins, a different shell, or someone else’s home-manager configuration, I can safely modify my configuration and know that I can revert back to the version I have stored in Git.

home-manager tips

I found the manpages for home-manager to be greatly useful at seeing which configuration options there are as well as what they do, what types it takes, etc. This can be accessed via man home-configuration.nix. I would always have it open when modifying my home-manager configuration.

By default home-manager stores its configuration in ~/.config/nixpkgs/home.nix. Home-manager provides an easy way to jump right into editing this file: home-manager edit. Since this configuration file isn’t in the nicest of places we can change the location of it and still have home-manager pick it up. The best way to configure this would be to use home-manager to configure itself by setting programs.home-manager.path = "~/src/github.com/jonniesweb/dotfiles/home-manager/home.nix";, or wherever your configuration file exists. If needed, the HOME_MANAGER_CONFIG environment variable can be set with the same value to tell the home-manager command where the config exists if something goes wrong.

In the switchover I challenged myself to change over from vim to neovim. This didn’t involve too much effort as my vim config needed a few updates to be compatible with neovim. A large amount of time was saved by the automatic install of the various vim plugins I use.

In the process I also moved away from using oh-my-zsh to plain old zsh. A fair amount of time was spent understanding the different zsh shell options and which ones oh-my-zsh provided me with. More time was spent configuring my shell’s prompt to use a plugin offering git information and its own theme. Oh-my-zsh does a fair amount of magic in the background at plugin and theme loading, but when looking at it’s source code, it’s actually incredibly simple.

A lot of language, tools, and other dependencies were left out of my home-manager’s config since Shopify’s internal dev tools handles the majority of this for us on a per-project basis.

If you’re having home-manager manage your shell, don’t forget to set the xdg.enable = true; option in your config. Some programs depend on the XDG_*_HOME environment variables to be present. I can see why this option isn’t enabled by default as many operating systems may have values that differ from the ones defaulted to by home-manager.

My main development environment is on OS X and therefore differs from Linux in some areas. One of the projects I’m going to keep my eye on is nix-darwin which appears to be solving the problem that NixOS solves for Linux: complete system configuration.

Conclusion

Similar to Docker, Canonical’s snap packages, and Nix ecosystems, we’re going to see a steady increase in the number of companies and individuals utilizing these technologies for their use cases of explicitly defining what software runs on their systems. Docker is already gained critical mass throughout enterprises, Canonical’s Snap packages are slowly picking up steam on Ubuntu-based systems, and Nix appears to be breaking into the scene. I’m rooting for Nix as it has a leg up on other systems with its complete and explicit control over all components which go into making up a program or even a complete system. I’m excited to see how much it will catch on.

Replacing the engine while driving: planning safety measures into the project

When a project is created for changing out an existing dependency with another one, there are many factors that should be thought of. If there is a high risk involved if something goes wrong, then more caution would need to be taken. If a lot of effort is involved to add the new dependency, then it may make sense to incrementally ship the new dependency alongside the old dependency. Lastly, depending on the complexity, will a lot of work be done up front to integrate the new dependency, or will it be hacked in then cleaned up after the launch?

These are some of the questions that can be asked to help guide the planning for a project which changes out one dependency for another. The same pattern can be used for similar types of work besides dependencies such as service calls, database queries, integrations, and components, to name a few.

Generally, a project like this can have the following steps:

  • Investigate
  • Refactor
  • Add the new dependency
  • Launch
  • Remove the old dependency
  • Refactor
  • Celebrate

Not all of these steps need to be performed in this order or at all. For example, the launch, add the new dependency, and remove the old dependency can all be completed at once if the risk and simplicity allows for it.

Refactoring may be redundant to mention given developers should be refactoring as they normally make changes to the codebase. I am being explicit about mentioning it here so that the team can use deliberate refactoring to their advantage to make the new dependency easy to add, as well as removing any unnecessary abstractions or code leftover from the old dependency.

A special shoutout goes to the celebrate step, since I certainly know that teams can be eager to move onto the next project and forget to appreciate the hard work put into achieving its success.

Now lets get into the different concerns that can apply to projects like this, and the practices that can help it succeed.

Concerns and Practices

The most important step every project should have, the investigation step informs how the rest of the project should work. During this step as much information should be gathered to give a confident enough plan for the rest of the project to proceed. Some of the actions taken during this step are to understand how the current system works, how the dependency to be replaced is integrated, how critical this dependency is to the businesses operations, how the new dependency should be integrated, and any cleanup and refactorings that should be made before, during, and after the new dependency is added and launched.

A big topic to explore is what the system would look like if the system was originally built using this new dependency. This mental model forces the envisioning of a clean integration with the new dependency, ignoring any sort of legacy code, and free of any constraints within the system. This aligns the team with the ultimate design of the system that uses this new dependency. The team should strive and fight for reaching this state at the end of the project since it can result in the cleanest and most maintainable code. If this is not part of the end goal for the project then the system may carry forward unnecessary code and bad abstractions that result in tech debt piling up for future developers.

Another big consideration is what will the system look like when it is launched? Will the new and old dependency both have to coexist in the system so that there can be a gradual transition? Or maybe the change is small enough that the old dependency can be removed and the new one added in one change and no user interruption will occur? If there does exist a period where the two dependencies need to coexist then how can this behaviour be implemented to fit in with the ultimate design discussed earlier? This may involve doing a lot of refactoring up front to get the system into a state that fits the ultimate design at the end of the project. There also exists the tradeoff that the new dependency can be hastily integrated into the system alongside the other dependency, then all of the hacks can be undone with a sufficient cleanup and refactoring after the launch. Both methods have different tradeoffs: with the up front refactoring, the new dependency is integrated the right way into the system, but may require a lot of refactoring of the surrounding and old dependency’s code. On the other hand, hacking the new dependency into the same pattern that the old dependency uses gets the project faster to launch, but can lead to bugs and integration troubles if the interfaces or behaviour are different. Regardless, at the end of the project, the system should look as if the old dependency never existed.

How much risk is involved with switching over to use the new dependency? If it is very risky then more precautions should be taken to reduce the risk. If there is very little risk then little to no precautions can be taken, and the project can move much faster. Some methods I have used to help with reducing risk are collecting metrics on the different types of requests and responses between the two dependencies. These metrics can be early warning systems to the new dependency behaving incorrectly. Being able to rollback to the old dependency via a deploy or a feature flag provides the flexibility to quickly switch back to the old dependency in case things go wrong. Dark launching the new dependency to production has been a practice I often encourage my team to use, since it allows for testing out the new code in the production environment without affecting real users. Lastly, beta testing with a percentage of users can also reduce the impact since if something goes wrong with the new dependency, only a fraction of the users are affected. Many of these practices are complimentary and can be used together to achieve the best mix of handling risk.

How much effort is involved to add the new dependency? Effort could mean the amount of changes to make to the code or the time involved. If there is a significant amount of effort then it absolutely makes sense to incrementally ship small changes to production. Even if the changes are not live code paths, at least the team can review each change, provide feedback, and course correct if needed. I have seen in small effort projects all work was deployed to production in one pull request. On large effort projects the work was split across many pull requests written and deployed over time by a team. This latter case enabled the team to dark launch the new dependency to production and had the added safety to switch back to the old dependency if needed.

Given the considerations and practices discussed throughout this article, it is best to validate that they will actually work when it comes time to execute. If a team is experienced with these projects and practices then the investigation period can be quicker, otherwise if the team is less experienced with these projects the investigation should be more substantial. Building a prototype can help build confidence in the assumptions made about running the project and guide the team going forward. A good prototype proves or disproves assumptions in as minimal time as possible. Once the team is confident in their plan, start the project, and do not be afraid to reevaluate the choices already made as the project goes on.

Key-Value Pairs in GraphQL

Today I was pair programming with a member of my team on a new GraphQL mutation. We were trying to figure out how to represent the returning of data which included a list of key-value pairs – aka a Map datatype. These pairs weren’t constant since they were being returned from a third-party API, so hardcoding the key names in a type wouldn’t work.

We toyed around with the idea of using an array where the first value would represent the key, and the second value would represent the value. We also wondered if the key-value would best be represented as its own type – that way the array method would never be misconstrued.

We ended up delaying our decision to choose one method over another by mocking out what the resulting mutation response would look like to the caller. For example, here’s what the response would look like for using arrays to represent the key-value pairs:

{
  "data": {
    "fields": [
      ["key1", "value1"],
      ["key2", "value2"],
      ["key3", "value3"],
    ]
  }
}

And here’s what the response would look like if a GraphQL type was used for holding key-value pairs:

{
  "data": {
    "fields": [
      {"key": "key1", "value": "value1"},
      {"key": "key2", "value": "value2"},
      {"key": "key3", "value": "value3"}
    ]
  }
}

We quickly realized that the array-based method has the disadvantage of the client needing to implicitly know which place in the array the key and value reside. There’s also possibility of more or less than two elements in the array, even though the user would expect there to be only two. GraphQL and its schema provides a concise and explicit contract, and using this array method bypasses this benefit.

Therefore, we went forth with adding a generic PairType to our GraphQL app. This worked perfectly for our use case.

But now this begs the question: why doesn’t the GraphQL spec support key-value pairs as a first-class type?

It appears that it’s a long standing feature request.

Brodie: Building Shopify’s new Help Centre

One of the primary projects which has defined the existence of my team at Shopify was a complete rebuild of the Help Centre’s platform. The prior Help Centre utilized Jekyll (the static site generator) with a number of features added over the past five years to provide documentation to our merchants, partners, and prospective customers.

The rebuild took about six months, and successfully launched with multiple languages in July 2018.

Deacon Brodie

This post will first discuss the limitations we encountered with using Jekyll for a number of years on a Help Centre which has grown to 15 technical writers and 1600 pages. Next, a number of upcoming features are outlined which the new platform should easily accommodate for. Following that, a high level overview of Brodie, the library we built to replace Jekyll. Next, Brodie’s internals are explained with details on how it integrates with Ruby on Rails. This post then ends with links to related code discussed throughout this post.

Jekyll’s Limitations

As of February 2018, Shopify’s Help Centre consisted of 1600 pages, 3000 images, and 300 partials/includes. This amount of content can really slow down Jekyll’s build time. A clean build takes 80 seconds, while changing a single character on a page requires 15 seconds for a partial rebuild. This really slows down the workflow for our technical writers, as well as developers who maintain the heavy Javascript-based Support page.

Static sites, where a server serves up html files, can only get you so far. Features considered dynamic must be implemented using client-side Javascript. This has proven to be difficult and even restrictive to the features that could be added to the site, especially when it comes to features which require running on a server and not in the user’s browser. Things such as authenticating Shopify merchants before they contact Support is more difficult considering that all of the functionality lives in Javascript, or another app is relied upon.

The original Deacon Brodie’s Tavern in Edinburgh

Even other companies have blogged about the hoops they’ve jumped through to scale Jekyll too.

Upcoming Features

Allowing users to login to the Help Centre with their Shopify credentials can provide a more personalized experience. Based on the shops the Merchant has access to, the pages in the Help Centre can be tailored to their Country, the features that they use, and the growth stage of their business. The API documentation can be extended to provide the logged in user the ability to query their shop’s API.

Enabling the ability for merchants to login to the Help Centre can simplify the process of talking with Support. Once logged in, users would be able to bypass verifying who they are to a Support representative, since they’ve already proven who they are by logging into the Help Centre. This saves time on both ends of the conversation and keeps the user focused on their problem.

A short history of Deacon Brodie’s life

Features could also be added to enhance the workflow of our technical writers. As a logged in technical writer a few features could be enabled such as showing all pages regardless of being hidden or being an early-release feature, a link to view the page on GitHub, or even a link to view the current page in Google Analytics. Improvements such as these make it much quicker to access to relevant data.

Being able to correlate the Help Centre pages visited by a user before they contact Support can help infer how successful pages are at helping answer the user’s question. Pages which do poorly can be updated, and pages which are successful can be studied for trends. Resources can be better focused on areas of the Help Centre pages which need it. Additionally, combining the specific pages visited to Support interactions opens the opportunity to perform A/B tests. A Help Centre page can have two or more versions, and the version which results in the least amount of Support interactions could be considered the winning version. Currently there is no way to definitively correlate the two.

Many Support organizations gauge the effectiveness of their Help Centre content (self-help) by comparing potential Support interactions solved by Help Centre pages to the number of actual Support interactions. A so called deflection ratio, where the higher the self-help-to-support-interaction ratio the better. This ratio can be more accurately calculated by better tracking of the user’s journey through these various Shopify properties before they contact Support.

Lastly, Internationalization (aka I18n) and Localization means translating pages into different languages and cultural norms. I18n would enable the Help Centre to be used by people other than those who know English, or prefer reading in a language they understand better. I18n support can be hacked into Jekyll, but as was discussed earlier with 1600 pages already slowing down the build times, Jekyll will absolutely cripple when there exists multiple localized versions of each page. Therefore, having an app that can scale to a much larger number of pages is required for I18n and localization to even be considered.

The Solution

To enable our Help Centre to scale way past 1600 pages, and support complex server-side features, a scrappy team was formed to rebuild the Help Centre platform in Ruby on Rails.

Rewriting any of the content pages or partials wouldn’t be feasible for the time or resources we had – therefore maintaining compatibility with the existing content files was paramount.

Allowing the number of pages in the Help Centre to keep growing, but to dramatically reduce the 80 second clean build time, and the 15 second page rebuild time requires an architectural shift. Moving away from Jekyll’s model of pre-rendering all pages at build time to the model of rendering only what’s needed at request time. Instead of performing all computational work up-front, performing smaller batches of work at request time spreads out the cost.

The Deacon Brodies Pub in Ottawa, steps away from Shopify HQ

Ruby on Rails was chosen as the new technology stack for the Help Centre for a few reasons. The limits were being reached with Jekyll, therefore we technically couldn’t continue using it. Shopify’s internal tooling and production systems  heavily integrate with Rails applications, therefore building on Rails to utilize these would save a lot of developer time. Shopify also employs the largest base of Rails developers, so tapping into that workforce and knowledge base is very beneficial for future development.

Ruby on Rails brings a number of complementary features such as a solid MVC framework, simple caching abstractions for application code and views, as well as a strong and healthy community of libraries and users. These benefits make Rails a great selling point for building new features faster and easier than the prior Jekyll system.

One of the things that has been working quite well over the past few years has been the workflow for our technical writers. It consists of using a text editor (such as Atom) to edit Markdown and Liquid code, then using Git and GitHub to open a Pull Request for peer review of the changes. Automated tests check for broken links, missing images, incorrectly formed HTML and Javascript.  Once the changes are approved and all tests have passed, the Pull Request is merged and shipped to production.

Since there isn’t a good reason to change the technical writer’s current workflow we’re more than happy to design the new documentation site with the existing workflow in mind.

One of the main features of the platform my team built was the flexible content rendering engine. It’s equivalent to Jekyll on steroids. Here I’ll discuss the heart of the system, Brodie, the ERB-Liquid-Markdown rendering engine.

Brodie

Brodie is the library we’ve purpose-built for Shopify’s new Help Centre. It renders any file that contains ERB, Liquid, and Markdown, or a combination of the three into HTML.

Brodie is named after Deacon Brodies, an Ottawa pub which is itself named after Deacon William Brodie, an 18th-century city councillor in Edinburgh who moonlighted as a burglar and gambler.

Deacon Brodie’s double life inspired the Robert Louis Stevenson story Strange Case of Dr Jekyll and Mr Hyde.

Brodie, and the custom extensions built on-top of it, enable a smooth transition from Jekyll to Rails. Shopify’s 1600 pages, 3000 images, and 300 partials/includes can be rendered by Brodie without modification. Additionally, the workflow of the technical writers is not disturbed. They continue to use their favourite text editor to modify content files, Git and GitHub to perform reviews, and to utilize the existing Continuous Delivery pipeline for fast validation and shipping.

Views in Rails are rendered using templates. A template is a file that consists of code that defines what the user will see. In a Rails app the template file will usually consist of ERB mixed into HTML. A template file like this would belong in the app/views/ directory and would have a descriptive name such as homepage.html.erb.

The magic in Rails treats templates differently based on its filename. Let’s break it down. homepage represents the template’s filename. Rails knows to look for this template based on this name. The html part represents what the format the template should output to. Lastly, erb is the portion which specifies what language the template file is written in. This naming convention enables Rails to dynamically render views just by looking at the filename.

Rails provides template handlers to render ERB to HTML, as well as JSON and a few others. Rails offers the ability to extend its rendering system by plugging in new template handlers. This is where Brodie integrates with Rails applications. Brodie provides its own template handler to take content files and convert the ERB, Liquid, and Markdown to HTML.

Rails exposes this via the ActionView::Template.register_template_handler(:md, Content) where :md is the file extension to act on, and Content is the Class to use as the template rendering engine (template handler). Next we’ll go over how a template handler works.

Rendering Templates

The only interface a template handler is required to respond to is call with one parameter being the template to render. This method should return a string of code that will render the view. This string will be eval‘ed by the template later. Returning a string of code is a Rails optimization which inlines much of the code required to render the template. This reduces the number of methods needing to be called, speeding up the already time consuming rendering process.

When Rails needs to render a view it takes the specified template and calls the proper template handler on itself. The handler returns a string that contains the code that renders the template. The Template class combines the code with other code, then evals the stringified code.

For example, the ERB-Liquid-Markdown renderer has a call method like the following:

def call(template)
  compiled_source = erb_handler.call(template)
  "Brodie::Handlers::Content.render(begin;#{compiled_source};end, local_assigns)"
 end

Brodie first renders the ERB present in the template’s content with the existing ERB handler that comes with Rails. Brodie then returns a string of code which calls the “render” method on itself. That render method is shown next:

def render(source, local_assigns = {})
  markdown.call(
    liquid.call(source, local_assigns)
  ).html_safe
end

Here is where the actual rendering of the Liquid and Markdown occur. When this code is eval‘ed the parameter local_assigns is included for passing in variables when rendering a view. This is how variables are magically passed from Rails controllers into views.

Left: The old Jekyll site. Right: The new Rails site. The Help Centre rebuild looks the same but has a completely new backend

It’s as straightforward as that for rendering ERB, Liquid, and Markdown together. The early days of Brodie were spent understanding the ins-and-outs of ActiveView enough to validate that this approach was a sane practice and not breaking in edge cases.

Further Reading

The current documentation is really limited when it comes to Templates and Template Handlers. I would suggest building a small template handler, setting breakpoints and walk through the source. Here’s a great example of a template handler for Markdown.

Additionally, looking over the source code and comments is the best way to get an understanding of the ActiveView internals. The main entry point into ActiveView is the render method from TemplateRenderer. Template would be best to check out next as it concerns itself with actually rendering templates. Lastly, Handlers would be good to check out to see how Rails can register and fetch Template Handlers.

Hunting for segmentation faults in Ruby programs

I was working on building a content management engine for Shopify’s next generation Help Centre. Code named Brodie, it was equivalent to the Jekyll static site generator added to Ruby on Rails, but instead of rendering all the pages up front at compile time, each page is generated when it is requested by the client.

Brodie used a Ruby Gem called Redcarpet for the Markdown rendering. Redcarpet worked wonderfully, but Brodie ended up having a severe bug due to the extensive usage of it. The way Redcarpet was being used in Brodie resulted in periodic segmentation faults (segfaults) while rendering Markdown. These segfaults were causing many 502 and 503 errors when some unknown pages were being visited. It was such an issue that all the web servers in the cluster would go down for some time until they restarted automatically.

How do I Redcarpet?

To better explain the issue and its resolution, it is best to have an understanding of how Redcarpet, and really any other text renderer works. Here is a simple example:

In the above example, the code defines the Markdown that is to be rendered to HTML, sets up the Redcarpet::Markdown configuration object, and then finally parses and renders the Markdown to HTML.

But wait! There’s more. Jekyll and Brodie both use the Liquid language (made by Shopify!) to make it easier to write and manage content. Liquid provides control flow structures, variables, and functions. One useful function allows including the contents of other files into the current file (the equivalent of partials in Rails). Here is an example that uses the Liquid include function:

As we can see in the example above, the code renders the Liquid and Markdown to HTML. This is achieved by rendering the Liquid first, then passing the result of that into the Markdown renderer. Additionally, the Liquid include function injected the contents of _included.liquid exactly where the include function was called in main.md.

Now that the basics of Markdown and Liquid rendering have been explained, it is now possible to understand this segfault issue.

“Where is this segfault coming from???”

When my team and I were close to launching the new Help Centre that used Brodie, the custom-built Liquid and Markdown rendering engine, the app would crash due to segmentation faults. When the servers were put under load with many requests coming in, the segfaults and resulting downtime was magnified. It was clear from load testing that a small amount of traffic would bring down the entire site and keep it down.

The segfaulting would lead to servers becoming unavailable until Kubernetes, the cluster manager, checked that those servers were unhealthy and restarted them. The time it took for the pod to come back online would be 30-60 seconds. With the system being under load, it was only a couple of minutes before all the servers in the cluster were down. When this happened, the app returned HTTP 502 and 503 errors to any client requesting a page – never a good sign.

The only message that was present in the logs before the app died was the following:

Assertion failed: (md->work_bufs[BUFFER_BLOCK].size == 0), function sd_markdown_render, file markdown.c, line 2544.

Apparently, Ruby crashed in a random Redcarpet C function call. No sort of stacktrace or helpful logging followed this message. The logs did not even include which page the client requested, as the usual Rails request logging was created after the HTTP request finished. This Assertion failed message was a lead, but didn’t help much since it does not reference what caused it.

I have dealt with other Redcarpet issues in the past, where methods that have been extended in Redcarpet to add custom behaviour have thrown exceptions. Sometimes these exceptions have caused the request to fail and a stacktrace of the issue to show up. Other times it has resulted in a segfault with a similar Redcarpet C function in the message. Ultimately, writing better code fixed this earlier situation.

My intuition told me that an error was being thrown while rendering the page, causing this segfault to occur. I attempted an experiment where I added some rescue blocks to the Redcarpet methods that we extended. This would prevent the potential exceptions from being raised in the buggy code that was causing it, hopefully resulting in no segfaults. If that fix succeeded, I could safely assume that fixing the code which raised the error would be the end of the story.

Trying this, the experiment was shipped to production. Things went well until the next day. Sometime overnight the page that caused the segfaults was hit, and the operational dashboards recorded the cluster going down and rebooting. At least this confirmed that the Redcarpet extensions were not at fault.

Getting lucky

Playing around with things, a page was found out of sheer luck that could cause the app to segfault repeatedly. Visiting this page once did not cause the server to crash or the response to 500, but refreshing the page multiple times did cause the server to crash. Since this app was running multiple threads in the local development and production environments to answer requests in parallel, it is possible that there was a shared Redcarpet data structure that was getting clobbered by multiple threads writing to it at the same time. This is actually a recurring issue according to the community:

Recursive rendering

Discussing the issue more with my larger team of developers, there was the idea of removing any sort of cross-thread sharing of Redcarpet’s configuration object. One of the other developers shipped a PR which gave each thread its own Redcarpet configuration object, but this did not end up fixing the problem.

A tree, showing the order in which nodes are traversed using the depth-first search algorithm. (CC 3.0)

Building on top of this developer’s work, I knew that it was possible for the Redcarpet renderer to be called recursively based on the nature of the Liquid and Markdown content files. As described earlier, it is possible for one content file to include another content file. As we saw in the examples earlier in this article, when a content file is being rendered, the rendering pauses on that content file and descends into the included file to render it, then returns to where it left off in the original content file. This behaviour is exactly like the depth-first search algorithm from graph theory.

After making this breakthrough it was simple to understand what to try next. Each time Redcarpet was being called to render some Markdown, always create a new Redcarpet configuration object. This should solve the issue of multiple thread writes, as well as the recursive writes. Even though there is extra overhead with creating a new Redcarpet configuration object each time a content file is rendered, it is a reliable workaround that bypasses Redcarpet’s single-thread, single-writer limitation.

After coding and shipping this fix, it worked!

Refreshing that problematic page multiple times, no matter how many times, never crashed the app. The production servers were back to handling their original capacity and one developer was feeling very relieved.

Takeaways

I learned a considerable number of things from this debugging experience. Even when using battle-tested software (like Redcarpet), there may be use cases which are not exactly supported or documented to not work. Additionally, the Redcarpet library is now rarely maintained. Knowing the limitations up front can save time and frustration. One of the main reasons why this article was written was that there was no other writing about this issue and the workarounds. Hopefully it will help save time for developers in the future who run into similar issues.

It was valuable to bounce ideas off of other team members. If I had not put out my ideas and had these discussions, I would not have understood the problem as well as I did. Even the potential fix that a teammate of mine shipped but did not end up working helped me understand the problem better.

Drawing out parts of the control flow on paper to really understand how the app renders content files builds a better mental model of what actually goes on inside the app. It is one thing to have a high level overview of how different components interact with each other, but it is an entirely different level of understanding to factually know what exactly happens. This can be extended to the intricacies of the software libraries being used. In this situation, knowing the internals and behaviour of Ruby on Rails, Liquid, and Redcarpet made it a lot easier to understand what was going on.

Lastly, you always feel like a boss when you fix big, complicated problems such as this one.