Hunting for segmentation faults in Ruby programs

I was working on building a content management engine for Shopify’s next generation Help Centre. Code named Brodie, it was equivalent to the Jekyll static site generator added to Ruby on Rails, but instead of rendering all the pages up front at compile time, each page is generated when it is requested by the client.

Brodie used a Ruby Gem called Redcarpet for the Markdown rendering. Redcarpet worked wonderfully, but Brodie ended up having a severe bug due to the extensive usage of it. The way Redcarpet was being used in Brodie resulted in periodic segmentation faults (segfaults) while rendering Markdown. These segfaults were causing many 502 and 503 errors when some unknown pages were being visited. It was such an issue that all the web servers in the cluster would go down for some time until they restarted automatically.

How do I Redcarpet?

To better explain the issue and its resolution, it is best to have an understanding of how Redcarpet, and really any other text renderer works. Here is a simple example:

In the above example, the code defines the Markdown that is to be rendered to HTML, sets up the Redcarpet::Markdown configuration object, and then finally parses and renders the Markdown to HTML.

But wait! There’s more. Jekyll and Brodie both use the Liquid language (made by Shopify!) to make it easier to write and manage content. Liquid provides control flow structures, variables, and functions. One useful function allows including the contents of other files into the current file (the equivalent of partials in Rails). Here is an example that uses the Liquid include function:

As we can see in the example above, the code renders the Liquid and Markdown to HTML. This is achieved by rendering the Liquid first, then passing the result of that into the Markdown renderer. Additionally, the Liquid include function injected the contents of _included.liquid exactly where the include function was called in main.md.

Now that the basics of Markdown and Liquid rendering have been explained, it is now possible to understand this segfault issue.

“Where is this segfault coming from???”

When my team and I were close to launching the new Help Centre that used Brodie, the custom-built Liquid and Markdown rendering engine, the app would crash due to segmentation faults. When the servers were put under load with many requests coming in, the segfaults and resulting downtime was magnified. It was clear from load testing that a small amount of traffic would bring down the entire site and keep it down.

The segfaulting would lead to servers becoming unavailable until Kubernetes, the cluster manager, checked that those servers were unhealthy and restarted them. The time it took for the pod to come back online would be 30-60 seconds. With the system being under load, it was only a couple of minutes before all the servers in the cluster were down. When this happened, the app returned HTTP 502 and 503 errors to any client requesting a page – never a good sign.

The only message that was present in the logs before the app died was the following:

Assertion failed: (md->work_bufs[BUFFER_BLOCK].size == 0), function sd_markdown_render, file markdown.c, line 2544.

Apparently, Ruby crashed in a random Redcarpet C function call. No sort of stacktrace or helpful logging followed this message. The logs did not even include which page the client requested, as the usual Rails request logging was created after the HTTP request finished. This Assertion failed message was a lead, but didn’t help much since it does not reference what caused it.

I have dealt with other Redcarpet issues in the past, where methods that have been extended in Redcarpet to add custom behaviour have thrown exceptions. Sometimes these exceptions have caused the request to fail and a stacktrace of the issue to show up. Other times it has resulted in a segfault with a similar Redcarpet C function in the message. Ultimately, writing better code fixed this earlier situation.

My intuition told me that an error was being thrown while rendering the page, causing this segfault to occur. I attempted an experiment where I added some rescue blocks to the Redcarpet methods that we extended. This would prevent the potential exceptions from being raised in the buggy code that was causing it, hopefully resulting in no segfaults. If that fix succeeded, I could safely assume that fixing the code which raised the error would be the end of the story.

Trying this, the experiment was shipped to production. Things went well until the next day. Sometime overnight the page that caused the segfaults was hit, and the operational dashboards recorded the cluster going down and rebooting. At least this confirmed that the Redcarpet extensions were not at fault.

Getting lucky

Playing around with things, a page was found out of sheer luck that could cause the app to segfault repeatedly. Visiting this page once did not cause the server to crash or the response to 500, but refreshing the page multiple times did cause the server to crash. Since this app was running multiple threads in the local development and production environments to answer requests in parallel, it is possible that there was a shared Redcarpet data structure that was getting clobbered by multiple threads writing to it at the same time. This is actually a recurring issue according to the community:

Recursive rendering

Discussing the issue more with my larger team of developers, there was the idea of removing any sort of cross-thread sharing of Redcarpet’s configuration object. One of the other developers shipped a PR which gave each thread its own Redcarpet configuration object, but this did not end up fixing the problem.

A tree, showing the order in which nodes are traversed using the depth-first search algorithm. (CC 3.0)

Building on top of this developer’s work, I knew that it was possible for the Redcarpet renderer to be called recursively based on the nature of the Liquid and Markdown content files. As described earlier, it is possible for one content file to include another content file. As we saw in the examples earlier in this article, when a content file is being rendered, the rendering pauses on that content file and descends into the included file to render it, then returns to where it left off in the original content file. This behaviour is exactly like the depth-first search algorithm from graph theory.

After making this breakthrough it was simple to understand what to try next. Each time Redcarpet was being called to render some Markdown, always create a new Redcarpet configuration object. This should solve the issue of multiple thread writes, as well as the recursive writes. Even though there is extra overhead with creating a new Redcarpet configuration object each time a content file is rendered, it is a reliable workaround that bypasses Redcarpet’s single-thread, single-writer limitation.

After coding and shipping this fix, it worked!

Refreshing that problematic page multiple times, no matter how many times, never crashed the app. The production servers were back to handling their original capacity and one developer was feeling very relieved.

Takeaways

I learned a considerable number of things from this debugging experience. Even when using battle-tested software (like Redcarpet), there may be use cases which are not exactly supported or documented to not work. Additionally, the Redcarpet library is now rarely maintained. Knowing the limitations up front can save time and frustration. One of the main reasons why this article was written was that there was no other writing about this issue and the workarounds. Hopefully it will help save time for developers in the future who run into similar issues.

It was valuable to bounce ideas off of other team members. If I had not put out my ideas and had these discussions, I would not have understood the problem as well as I did. Even the potential fix that a teammate of mine shipped but did not end up working helped me understand the problem better.

Drawing out parts of the control flow on paper to really understand how the app renders content files builds a better mental model of what actually goes on inside the app. It is one thing to have a high level overview of how different components interact with each other, but it is an entirely different level of understanding to factually know what exactly happens. This can be extended to the intricacies of the software libraries being used. In this situation, knowing the internals and behaviour of Ruby on Rails, Liquid, and Redcarpet made it a lot easier to understand what was going on.

Lastly, you always feel like a boss when you fix big, complicated problems such as this one.

Better Cilk Development With Docker

I’m taking a course that focuses on parallel and distributed computing. We use a compiler extension for GCC called Cilk to develop parallel programs in C/C++. Cilk offers developers a simple method for developing parallel code, and as a plus it now comes included in GCC since version 4.9.

The unjust thing with this course is that the professor provides a hefty 4GB Ubuntu virtual machine just for running the GNU compiler with Cilk. No sane person would download an entire virtual machine image just to run a compiler.

Docker comes to the rescue. It couldn’t be more space effective and convenient to use Cilk from a Docker container. I’ve created a simple Dockerfile containing the latest GNU compiler for Ubuntu 16.04. Here are some Gists showing how to build and run a Dockerfile which contain the dependencies needed to build and run Cilk programs.

Implementing Agile Databases with Liquibase

We have an inconvenient problem. Our development databases are all snowflakes – snowflakes in the sense that each developer’s database has been hand updated and maintained at the leisure of the developer so that no two databases are alike.

database-scripts-directoryWe version our database changes into scripts with the creation date included in the name. But that’s where the database script organization and automation ends. There’s nothing to take those scripts and apply it to a local developer’s database. Just plain old copy and pasting to run new scripts. Adding to the pain is that the database scripts don’t go back to day 1 of the database. Instead, the development databases are passed around and copied whenever someone breaks their database and needs a new one or a new employee comes on board and needs to set up their development environment.

Manually updating our personal development database is problematic. Forgetting to run scripts can result in unknown side effects. Usually we don’t bother updating our database with the latest scripts until we really have to. That happens whenever we launch our app. Once the app starts complaining about missing tables or fields we’re on the hunt searching for the one script out of hundreds that would fix the problem.

liquibase_logoAs you can see, it is a system that is wasting the productivity of all developers, not to mention the frustration that happens when catching up after being behind for a long time. For a while now we’ve acknowledged that it’s a problem and should be fixed. A few of us looked into the problem and talked about using FlywayDB or Liquibase, but Liquibase seemed to be the best choice for us since it is more feature complete. Since that discussion one of our team members started experimenting with Liquibase and pushed that code to a branch, but it’s remained dormant for a while. I wouldn’t say integrating Liquibase into our development environment was abandoned because it was tough to do, rather I’m realizing that it is a common trend for developer tooling and continuous improvement to make way for feature development, bug fixing and level 3 support. Maybe our development team is just too small and busy to tackle these extra tasks or our bosses don’t realize the productivity sinkholes as significant and don’t allocate any time for improving it. I would like to spur some discussion around this.

Anyways, on with the rest of the post.

Look! The Proof of Concept is Working!

I spent the greater part of my Good Friday working on getting Liquibase working with our app. Partway through the day I got the production database schema into the Liquibase xml format and checked into source control. A few more hours were put into fixing minor SQL vs. MySQL issues with Liquibase’s import. (Who knew the BIT(1) type could have an auto increment? Liquibase disagrees).

Some time was spent creating a script at script/db (in the style of GitHub) for bootstrapping the Liquibase command with the developer database configuration.

Next I’ll mention some of the incompatibilities that I ran into while generating a Liquibase change log from the existing production database.

Generating a Change Log From an Existing Database

Liquibase offers a very helpful feature: being able to take an existing database schema and turn it into an xml change log that it can work with. The Liquibase website has documentation on the topic, but it doesn’t mention the slight incompatibilities that you may run into, particularly with data types.

Once the production database schema was converted into a liquibase change log, I pointed Liquibase to a fresh MySQL server running locally. Running the validate and update commands on the change log resulted in some SQL errors when executing. All of them were related to data type conversions. These problems were fixed by modifying the change log xml file manually.

The first issue was that the NOW() function wasn’t being recognized. Simple enough, just replace it with CURRENT_TIMESTAMP.

Next was Liquibase turning all of the timestamp data types into TIMESTAMP(19). Doing a search and replace for TIMESTAMP(19) to TIMESTAMP did the trick.

The same issue as above happened to all datetime data types. Doing a search and replace for datetime(6) to datetime worked as expected.

In the production database one table had a primary key with the data type of TINYINT(1). When Liquibase read this it converted the data type to BIT. It’s a known issue at the moment, but the fix is simple: change the type in the change log to some other data type like TINYINT (or TINYINT UNSIGNED). Make sure if this is a primary key that you update the foreign keys in the other tables, otherwise you’ll get errors when the foreign keys get applied.

This one was the weirdest. In the production database an index existed on a column of type mediumtext with no explicit length. The index was defined as a FULLTEXT. When Liquibase would create the database, it would fail when creating this index. After some googling it appears that the FULLTEXT index requires a length when operating on mediumtext. In the end, adding a (255) or however long to your FULLTEXT index data type fixes it.

Lastly, the tables from the production database were set to use the UTF-8 encoding and the InnoDB engine, but Liquibase doesn’t pick this up. The workaround for this was to append the following to every table definition in the Liquibase change set xml:

Next Steps

Because we provide a multitenancy SaaS offering, we drive a lot of behaviour of our app from the database. Whether it’s per customer feature toggles, a list of selectable fields, or email templates, a lot of data needs to be prepopulated in the database for the app to fully function.

The next bit of work involved with moving towards an agile database is to find all of the tables that contain data which are needed for the app to function. Liquibase offers methods of loading this data into the database by either loading data from a CSV file or by specifying the data in a change log.

Another important part of the database that needs to be checked in with Liquibase is the triggers and procedures. Liquibase doesn’t automatically extract the triggers and procedures so you’ll have to locate and export them manually.

Additionally, improving the developer experience by simplifying the number of things they have to do and know eases adoption and can make them more productive. Things like the configuration needed to run Liquibase, creating a new change log from a template and documentation of usage and best practices are all things that can bring a developer up to speed and make their life easier.

Lastly, there exists a Liquibase plugin for the Gradle build tool which makes it straightforward to orchestrate Liquibase with your  Gradle tasks. This would come in handy when Gradle is used to perform integration and any other form of automated testing in an environment which uses the database. Test data could be loaded in and cleaned up based on the type of testing.

Conclusion

automate-all-the-thingsNo developer likes to perform repetitive tasks, therefore minimize the pain by automating all the things. Developer tooling can be often overlooked. As a developer do yourself and your colleagues a favour and automate the tedious tasks into oblivion. As a manager, realize the inefficiencies and prioritize fixing it. Attack the tasks that take the most time or would provide the most value if automated, then just start picking at it piece by piece.

Liquibase was discussed and acknowledged as the solution to our developer database woes. Following through with integrating Liquibase into our developer environment and going a few steps further with making it easy to use leads to more time saved for actual work. Delaying the implementation of the solution results in losing out on the productivity gains that you’re well aware of. Any productivity increase is better for both the developer’s productivity, the developer’s happiness and the business as a whole.

Quick Eclipse Tip

Whenever programming Java, it’s always a good idea to log what the program is doing. Log this error, log that object’s value – it’s a constant occurrence. More often than not, I’m adding this simple line to the fields of every class I write:

private final Log log = LogFactory.getLog(ThisClass.class);

Not only is this repetitive and time-consuming, it is easily automated in the form of a Template from within Eclipse.

The Solution

Adding a text expansion in the form of an Eclipse Template, allows for typing log, press the content assist hotkey, followed by pressing Enter, to automatically insert the Log declaration statement and add the necessary imports. Wow, that was fast, was my initial response. No way am I ever going to type that out manually again.

How to do it

In Eclipse, navigate to Window -> Preferences. In the tree on the left-hand side, go under Java -> Editor -> Templates. Here is the screen for defining text expansions that will be available when using the editor. Click New, enter “log” or whatever of your choosing as the name to expand from. Select the context drop-down to “Java type members”. Finally copy and paste the following into the Pattern field:

private final Log log = LogFactory.getLog(${enclosing_type}.class);
${imp:import(org.apache.commons.logging.Log, org.apache.commons.logging.LogFactory)}

Save it, apply changes, and exit the Preferences window. You are all good to go now!

Note

This assumes that you’re using Apache Commons Logging library for all of your logging tasks. The above template can easily be converted to define your specific logger of choice.

Svn to Git Migration

At my workplace ZDirect, we have a decade old SVN repository hosting about twenty projects and totalling about 13 000 commits. Recently, we’ve decided to switch over to using Git from SVN because of SVN slowly becoming antiquated and its various productivity slowdowns that are not seen in new version control systems.

Some immediate goals
  • Speed up the time it takes to clone a repo
  • Simple branching and conflict handling
  • More code reviews

That last point about pull requests is actually a feature of the web-based software hosting system. We chose GitLab as our solution, but more on that will come in a later post.

Some long-term goals
  • Move towards continuous integration
  • Use advanced Git workflows

Since being the most comfortable with Git, I volunteered myself as the “Migration Lead”, where I coordinated both the technical side and the human side. There is an incredible amount of articles out on the web talking about how company X or average Joe Y moved their SVN codebase to Git. What has really helped me along the way so far is Atlassian’s Git articles and tutorials; outlining a standard workflow for the process really makes it trivial for anyone else to do the same.

Hacking Banshee

Since syncing music to my Samsung Galaxy S3 doesn’t work with Linux (for the most part), I’ve constructed a method of transferring the music over ssh from my laptop to my phone using a cool little program called Unison. The solution is flawless and allows two-way syncing.

One problem, (or challenge depending on how you think about it), is that the playlists that are managed inside of Banshee Media Player on my laptop have to be exported individually and manually to a file when I want to transfer them over to my phone. Having a way to automatically export all of your playlists to some predefined directory would be very helpful for automating my music syncing. After some Googling it seems like no one has solved this problem yet.

I grabbed the Banshee source and started looking over its files associated with playlist exporting. Bingo! Shortly thereafter I found in the file Banshee.ThickClient - Banshee.Gui.SourceActions.cs the method OnExportPlaylist() which has the user interaction for exporting a playlist and the holy grail, the playlist.Save() method call.

The next logical step for me would be to figure out whether this functionality can be encapsulated into an extension, or if that’s not possible, a patch. I’ll definitely be following up on this.