Jul 30, 2018

Hunting for segmentation faults in Ruby programs

I was working on building a content management engine for Shopify’s next generation Help Centre. Code named Brodie, it was equivalent to the Jekyll static site generator added to Ruby on Rails, but instead of rendering all the pages up front at compile time, each page is generated when it is requested by the client.

Brodie used a Ruby Gem called Redcarpet for the Markdown rendering. Redcarpet worked wonderfully, but Brodie ended up having a severe bug due to the extensive usage of it. The way Redcarpet was being used in Brodie resulted in periodic segmentation faults (segfaults) while rendering Markdown. These segfaults were causing many 502 and 503 errors when some unknown pages were being visited. It was such an issue that all the web servers in the cluster would go down for some time until they restarted automatically.

How do I Redcarpet?

To better explain the issue and its resolution, it is best to have an understanding of how Redcarpet, and really any other text renderer works. Here is a simple example:

In the above example, the code defines the Markdown that is to be rendered to HTML, sets up the Redcarpet::Markdown configuration object, and then finally parses and renders the Markdown to HTML.

But wait! There’s more. Jekyll and Brodie both use the Liquid language (made by Shopify!) to make it easier to write and manage content. Liquid provides control flow structures, variables, and functions. One useful function allows including the contents of other files into the current file (the equivalent of partials in Rails). Here is an example that uses the Liquid include function:

As we can see in the example above, the code renders the Liquid and Markdown to HTML. This is achieved by rendering the Liquid first, then passing the result of that into the Markdown renderer. Additionally, the Liquid include function injected the contents of _included.liquid exactly where the include function was called in main.md.

Now that the basics of Markdown and Liquid rendering have been explained, it is now possible to understand this segfault issue.

“Where is this segfault coming from???”

When my team and I were close to launching the new Help Centre that used Brodie, the custom-built Liquid and Markdown rendering engine, the app would crash due to segmentation faults. When the servers were put under load with many requests coming in, the segfaults and resulting downtime was magnified. It was clear from load testing that a small amount of traffic would bring down the entire site and keep it down.

The segfaulting would lead to servers becoming unavailable until Kubernetes, the cluster manager, checked that those servers were unhealthy and restarted them. The time it took for the pod to come back online would be 30-60 seconds. With the system being under load, it was only a couple of minutes before all the servers in the cluster were down. When this happened, the app returned HTTP 502 and 503 errors to any client requesting a page – never a good sign.

The only message that was present in the logs before the app died was the following:

Assertion failed: (md->work_bufs[BUFFER_BLOCK].size == 0), function sd_markdown_render, file markdown.c, line 2544.

Apparently, Ruby crashed in a random Redcarpet C function call. No sort of stacktrace or helpful logging followed this message. The logs did not even include which page the client requested, as the usual Rails request logging was created after the HTTP request finished. This Assertion failed message was a lead, but didn’t help much since it does not reference what caused it.

I have dealt with other Redcarpet issues in the past, where methods that have been extended in Redcarpet to add custom behaviour have thrown exceptions. Sometimes these exceptions have caused the request to fail and a stacktrace of the issue to show up. Other times it has resulted in a segfault with a similar Redcarpet C function in the message. Ultimately, writing better code fixed this earlier situation.

My intuition told me that an error was being thrown while rendering the page, causing this segfault to occur. I attempted an experiment where I added some rescue blocks to the Redcarpet methods that we extended. This would prevent the potential exceptions from being raised in the buggy code that was causing it, hopefully resulting in no segfaults. If that fix succeeded, I could safely assume that fixing the code which raised the error would be the end of the story.

Trying this, the experiment was shipped to production. Things went well until the next day. Sometime overnight the page that caused the segfaults was hit, and the operational dashboards recorded the cluster going down and rebooting. At least this confirmed that the Redcarpet extensions were not at fault.

Getting lucky

Playing around with things, a page was found out of sheer luck that could cause the app to segfault repeatedly. Visiting this page once did not cause the server to crash or the response to 500, but refreshing the page multiple times did cause the server to crash. Since this app was running multiple threads in the local development and production environments to answer requests in parallel, it is possible that there was a shared Redcarpet data structure that was getting clobbered by multiple threads writing to it at the same time. This is actually a recurring issue according to the community:

Recursive rendering

Discussing the issue more with my larger team of developers, there was the idea of removing any sort of cross-thread sharing of Redcarpet’s configuration object. One of the other developers shipped a PR which gave each thread its own Redcarpet configuration object, but this did not end up fixing the problem.

A tree, showing the order in which nodes are traversed using the depth-first search algorithm. (CC 3.0)

Building on top of this developer’s work, I knew that it was possible for the Redcarpet renderer to be called recursively based on the nature of the Liquid and Markdown content files. As described earlier, it is possible for one content file to include another content file. As we saw in the examples earlier in this article, when a content file is being rendered, the rendering pauses on that content file and descends into the included file to render it, then returns to where it left off in the original content file. This behaviour is exactly like the depth-first search algorithm from graph theory.

After making this breakthrough it was simple to understand what to try next. Each time Redcarpet was being called to render some Markdown, always create a new Redcarpet configuration object. This should solve the issue of multiple thread writes, as well as the recursive writes. Even though there is extra overhead with creating a new Redcarpet configuration object each time a content file is rendered, it is a reliable workaround that bypasses Redcarpet’s single-thread, single-writer limitation.

After coding and shipping this fix, it worked!

Refreshing that problematic page multiple times, no matter how many times, never crashed the app. The production servers were back to handling their original capacity and one developer was feeling very relieved.

Takeaways

I learned a considerable number of things from this debugging experience. Even when using battle-tested software (like Redcarpet), there may be use cases which are not exactly supported or documented to not work. Additionally, the Redcarpet library is now rarely maintained. Knowing the limitations up front can save time and frustration. One of the main reasons why this article was written was that there was no other writing about this issue and the workarounds. Hopefully it will help save time for developers in the future who run into similar issues.

It was valuable to bounce ideas off of other team members. If I had not put out my ideas and had these discussions, I would not have understood the problem as well as I did. Even the potential fix that a teammate of mine shipped but did not end up working helped me understand the problem better.

Drawing out parts of the control flow on paper to really understand how the app renders content files builds a better mental model of what actually goes on inside the app. It is one thing to have a high level overview of how different components interact with each other, but it is an entirely different level of understanding to factually know what exactly happens. This can be extended to the intricacies of the software libraries being used. In this situation, knowing the internals and behaviour of Ruby on Rails, Liquid, and Redcarpet made it a lot easier to understand what was going on.

Lastly, you always feel like a boss when you fix big, complicated problems such as this one.