Hunting for segmentation faults in Ruby programs

I was working on building a content management engine for Shopify’s next generation Help Centre. Code named Brodie, it was equivalent to the Jekyll static site generator added to Ruby on Rails, but instead of rendering all the pages up front at compile time, each page is generated when it is requested by the client.

Brodie used a Ruby Gem called Redcarpet for the Markdown rendering. Redcarpet worked wonderfully, but Brodie ended up having a severe bug due to the extensive usage of it. The way Redcarpet was being used in Brodie resulted in periodic segmentation faults (segfaults) while rendering Markdown. These segfaults were causing many 502 and 503 errors when some unknown pages were being visited. It was such an issue that all the web servers in the cluster would go down for some time until they restarted automatically.

How do I Redcarpet?

To better explain the issue and its resolution, it is best to have an understanding of how Redcarpet, and really any other text renderer works. Here is a simple example:

In the above example, the code defines the Markdown that is to be rendered to HTML, sets up the Redcarpet::Markdown configuration object, and then finally parses and renders the Markdown to HTML.

But wait! There’s more. Jekyll and Brodie both use the Liquid language (made by Shopify!) to make it easier to write and manage content. Liquid provides control flow structures, variables, and functions. One useful function allows including the contents of other files into the current file (the equivalent of partials in Rails). Here is an example that uses the Liquid include function:

As we can see in the example above, the code renders the Liquid and Markdown to HTML. This is achieved by rendering the Liquid first, then passing the result of that into the Markdown renderer. Additionally, the Liquid include function injected the contents of _included.liquid exactly where the include function was called in main.md.

Now that the basics of Markdown and Liquid rendering have been explained, it is now possible to understand this segfault issue.

“Where is this segfault coming from???”

When my team and I were close to launching the new Help Centre that used Brodie, the custom-built Liquid and Markdown rendering engine, the app would crash due to segmentation faults. When the servers were put under load with many requests coming in, the segfaults and resulting downtime was magnified. It was clear from load testing that a small amount of traffic would bring down the entire site and keep it down.

The segfaulting would lead to servers becoming unavailable until Kubernetes, the cluster manager, checked that those servers were unhealthy and restarted them. The time it took for the pod to come back online would be 30-60 seconds. With the system being under load, it was only a couple of minutes before all the servers in the cluster were down. When this happened, the app returned HTTP 502 and 503 errors to any client requesting a page – never a good sign.

The only message that was present in the logs before the app died was the following:

Assertion failed: (md->work_bufs[BUFFER_BLOCK].size == 0), function sd_markdown_render, file markdown.c, line 2544.

Apparently, Ruby crashed in a random Redcarpet C function call. No sort of stacktrace or helpful logging followed this message. The logs did not even include which page the client requested, as the usual Rails request logging was created after the HTTP request finished. This Assertion failed message was a lead, but didn’t help much since it does not reference what caused it.

I have dealt with other Redcarpet issues in the past, where methods that have been extended in Redcarpet to add custom behaviour have thrown exceptions. Sometimes these exceptions have caused the request to fail and a stacktrace of the issue to show up. Other times it has resulted in a segfault with a similar Redcarpet C function in the message. Ultimately, writing better code fixed this earlier situation.

My intuition told me that an error was being thrown while rendering the page, causing this segfault to occur. I attempted an experiment where I added some rescue blocks to the Redcarpet methods that we extended. This would prevent the potential exceptions from being raised in the buggy code that was causing it, hopefully resulting in no segfaults. If that fix succeeded, I could safely assume that fixing the code which raised the error would be the end of the story.

Trying this, the experiment was shipped to production. Things went well until the next day. Sometime overnight the page that caused the segfaults was hit, and the operational dashboards recorded the cluster going down and rebooting. At least this confirmed that the Redcarpet extensions were not at fault.

Getting lucky

Playing around with things, a page was found out of sheer luck that could cause the app to segfault repeatedly. Visiting this page once did not cause the server to crash or the response to 500, but refreshing the page multiple times did cause the server to crash. Since this app was running multiple threads in the local development and production environments to answer requests in parallel, it is possible that there was a shared Redcarpet data structure that was getting clobbered by multiple threads writing to it at the same time. This is actually a recurring issue according to the community:

Recursive rendering

Discussing the issue more with my larger team of developers, there was the idea of removing any sort of cross-thread sharing of Redcarpet’s configuration object. One of the other developers shipped a PR which gave each thread its own Redcarpet configuration object, but this did not end up fixing the problem.

A tree, showing the order in which nodes are traversed using the depth-first search algorithm. (CC 3.0)

Building on top of this developer’s work, I knew that it was possible for the Redcarpet renderer to be called recursively based on the nature of the Liquid and Markdown content files. As described earlier, it is possible for one content file to include another content file. As we saw in the examples earlier in this article, when a content file is being rendered, the rendering pauses on that content file and descends into the included file to render it, then returns to where it left off in the original content file. This behaviour is exactly like the depth-first search algorithm from graph theory.

After making this breakthrough it was simple to understand what to try next. Each time Redcarpet was being called to render some Markdown, always create a new Redcarpet configuration object. This should solve the issue of multiple thread writes, as well as the recursive writes. Even though there is extra overhead with creating a new Redcarpet configuration object each time a content file is rendered, it is a reliable workaround that bypasses Redcarpet’s single-thread, single-writer limitation.

After coding and shipping this fix, it worked!

Refreshing that problematic page multiple times, no matter how many times, never crashed the app. The production servers were back to handling their original capacity and one developer was feeling very relieved.

Takeaways

I learned a considerable number of things from this debugging experience. Even when using battle-tested software (like Redcarpet), there may be use cases which are not exactly supported or documented to not work. Additionally, the Redcarpet library is now rarely maintained. Knowing the limitations up front can save time and frustration. One of the main reasons why this article was written was that there was no other writing about this issue and the workarounds. Hopefully it will help save time for developers in the future who run into similar issues.

It was valuable to bounce ideas off of other team members. If I had not put out my ideas and had these discussions, I would not have understood the problem as well as I did. Even the potential fix that a teammate of mine shipped but did not end up working helped me understand the problem better.

Drawing out parts of the control flow on paper to really understand how the app renders content files builds a better mental model of what actually goes on inside the app. It is one thing to have a high level overview of how different components interact with each other, but it is an entirely different level of understanding to factually know what exactly happens. This can be extended to the intricacies of the software libraries being used. In this situation, knowing the internals and behaviour of Ruby on Rails, Liquid, and Redcarpet made it a lot easier to understand what was going on.

Lastly, you always feel like a boss when you fix big, complicated problems such as this one.

Twilio’s TaskRouter Quickstart

My team and I are exploring different services and technologies in the area of contact centres. We develop and maintain the tools for over 1000 support agents, with the number rapidly rising. Making smart, long-term business and technology decisions are paramount. One of the technologies we looked into was Twilio and its ecosystem – specifically TaskRouter.

Twilio’s TaskRouter provides a clean interface for building contact centres. Its goal is to take the tedious infrastructure and plumbing work out of building a custom contact centre, exposing the right APIs to implement domain logic. TaskRouter is a high-level service since it orchestrates voice, SMS, and other communication channels with the ability to assign incoming interactions across a workforce of agents ready to take those interactions.

Twilio-Ruby

To get a head start at understanding how TaskRouter works, I spent a day looking at Twilio’s Ruby quickstart guide for TaskRouter. Wow, was I in for a frustrating time.

The quickstart guide takes the reader through a number of steps, both inside of the Twilio Console as well as building a small Ruby Sinatra app. After completing the quickstart the reader should have a fully functioning call centre with an interactive voice response (IVR) to greet and queue any user that calls in.

Some of the things that made the quickstart harder to complete is that the Ruby code examples included throughout used an older version of the twilio-ruby gem. Because of this, the code examples didn’t work with the latest version. This was both a bad and good thing. Bad in that the existing code examples wouldn’t work out of the box, but good in the fact that I had to put in some extra effort into learning where the docs and other sources of help exist, and having a deeper understanding of how the Twilio API works.

I compiled a list of resources that would assist anyone going through the same or a similar situation. It certainly helped me complete the TaskRouter quickstart.

  • The README for the twilio-ruby gem provided a great overview of what functionality it provides and how the gem is to be used
  • The v4 to v5 upgrade guide for the twilio-ruby gem showed that there was some sense to this chaos by providing the rationale and examples for updating old versions of the twilio-ruby code to the latest (v5). This was where I had my moment of understanding for the quickstart code examples.
  • Using JWT tokens was part of the last section of the quickstart. Since twilio-ruby changed the way it uses tokens, its code examples had to be updated too. The main Twilio docs on JWT goes into intricacies around building policies contained within JWT tokens
  • My lead/manager was quite happy when I mentioned to him that the twilio-ruby gem no longer uses title case for situations where camel-case or snake-case would have been better to Ruby styling. TwiML was affected by this for a number of gem versions up until v5. Since TwiML is used frequently throughout the quickstart the docs for using TwiML in twilio-ruby helped during those times.
  • Lastly, if all else fails, feel free to reference my resulting code from the TaskRouter quickstart. It’s available here on GitHub.

How Does Symmetric and Public Key Encryption Work?

With the release of Rails 5.2 and the changes with how secrets are securely stored, I thought it would be timely to write about the benefits and downsides of secrets management in Rails. It would be valuable to compare how Rails handles secrets, how Shopify handles secrets, and a few other methods from the open source community. On my journey to write about this I got caught up in explaining how symmetric and public key encryption work. So the post comparing different Rails secret management gems will have to wait until another post.

Managing secrets is now more challenging

A majority of applications created these days integrate with other applications – whether it’s for communicating with other business-critical systems, or purely operational such as log aggregation. Secrets such as usernames, passwords, and API keys are used by these apps in production to communicate with other systems securely.

The early days of the Configuration Management, and then later the DevOps movements have rallied and popularized a wide array of methodologies and tools around managing secrets in production. Moving from a small, artisanal, hand-crafted set of long-running servers to the modern short-lifetime cloud instance paradigm now requires the discipline to manage secrets securely and repeatedly, with the agility to revoke and update credentials in a matter of hours if not minutes.

While there’s many ways to handle secrets while developing, testing, and deploying Rails applications, it’s important to bring up the benefits and downsides to the different methods, particularly around production. Different levels of security, usability, and adoption exist with different technologies. Public/private key encryption, also known as RSA encryption, is one of the technologies. Symmetric key encryption is also another common encryption technology.

There exist many ways to handle secrets within Rails and webapps in general. It’s important to understand the underlying concepts before settling on one method or another because making the wrong decision may result in secrets being insecure, or the security being too hard to use.

Let’s first discuss the different types of encryption that are characteristic of the majority of secret management libraries and products out there.

Symmetric Key Encryption

Symmetric key encryption may be the simplest form of encryption to understand, but don’t let that trick you into thinking that it’s not secure. Symmetric key encryption involves one key used to both encrypt and decrypt data. This key will have to be kept secret and only be shared with trusted people and systems. Once secrets are encrypted with the key, that encrypted data can be readily shared and transferred without worry of the unencrypted data being read.

A simple example of symmetric key encryption can be explained. The most straightforward method utilizes the binary XOR function. (This example is not representative of state of the art symmetric key encryption algorithms in use, but it does get the point across). The binary XOR function means “one or the other, but not both”. Here is an example that shows the complete set of inputs and outputs for one binary digit:

1 XOR 1 = 0
1 XOR 0 = 1
0 XOR 1 = 1
0 XOR 0 = 0

A more complicated example would be:

10101010 XOR 01010101 = 11111111
11111111 XOR 11111111 = 00000000
11111111 XOR 01010101 = 10101010

Note that line 1 and 3 are related. The output of line 1 is part of the input of line 3. The second parameter of line 1 is used as the second parameter of line 3 too. Notice that the output of line 3 is the same as the first input of line 1. As demonstrated here, the XOR function will return the same input if the result of the function is fed back into itself a second time. A further example will show this property.

Given the property that any higher form of data representation can be broken down to binary, we can then show the example of hexadecimal digits being XOR’ed with another parameter.

12345678 XOR deadbeef = cc99e897

Given the key is the hexadecimal characters deadbeef and the data to be encrypted is 12345678, the result of the XOR is the incomprehensible result cc99e897. Guess what? This cc99e897 is encrypted. It can be saved and passed around freely. The only way to get the secret input (ie. 12345678) is to XOR it again with the key deadbeef. Let’s see this happen!

cc99e897 XOR deadbeef = 12345678

Fact check it yourself if you don’t believe me, but we just decrypted the data! This is the simplest example of course, so there’s a lot more that goes into symmetric key encryption that keeps it secure. Things like block-based, and stream-based algorithms, and even larger key sizes augment the simple XOR algorithm to make it more secure. It may be simple for someone who wants to break the encryption to guess the key in this example, but it becomes much harder the longer the key size is.

This is what makes symmetric key encryption so powerful – the ability to encrypt and decrypt data with a single key. With this property comes the need to keep this single key secret and separate from the data. When symmetric key encryption is used in practice, the smaller amount of people and systems that have the key the better. Humans can easily lose the key, leave jobs, or worse: share the key with people of malicious intent.

Public Key Encryption

Quite opposite to how symmetric key encryption works, public key encryption, (or asymmetric key encryption, or RSA encryption) uses two distinct keys. In its simplest form the public key is used for encryption and the private key is used for decryption. This method of encryption separates the need for the user who is encrypting the data from having the ability to decrypt the data. Put plainly, it allows for anyone to encrypt data with the public key while the owner of the private key is the only one able to decrypt the data. The public key can be shared with anyone without compromising the security of the encrypted data.

Some tradeoffs between symmetric and public key encryption is that the private key (the key used to decrypt data) is never shared with other parties, whereas the same key is used in symmetric key encryption. Also, a downside of public key encryption is that there are multiple keys to manage, therefore it brings a higher level of overhead compared to symmetric key encryption.

Let’s dig into a simple example. Given a public key (n=55, e=3) and a private key (n=55, d=27) we can show the math behind public key encryption. (These numbers were fetched from here).

Encrypting

To encrypt data the function is:

c = m^e mod n

Where m is the data to encrypt, e is the public key, mod is the modulus function, n is the shared modulus, and c is the encrypted data.

For the number 42 to be encrypted we can plug it into the formula quite simply:

c = 42^3 mod 55
c = 3

c = 3 is our encrypted data.

Decrypting

Decrypting takes a similar route. For this a similar formula is used:

m = c^d mod n

Where c is the encrypted data, d is part of the private key, mod is the modulus function, n is the shared modulus, and m is the decrypted data. Lets decrypt the encrypted data c = 3:

m = 3^27 mod 55
m = 42

And there we have it, our decrypted data is back!

As we can see, a separate key is used for encryption and decryption. It’s worth restating that this example here is very simplified. Many more mathematical optimizations, and larger key sizes are used to make public key encryption secure.

Signing – a freebie with public key encryption

Another benefit to using RSA public and private keys is that given the private key is only held by one user, that user can sign a piece of data to verify that it was them who actually sent it. Anyone who has the matching public key can verify that the data was signed by the private key and that the data was not tampered with during transit.

When Bob needs to receive data from Alice and Bob needs to be sure it was sent by Alice, as well as not tampered with while being sent, Alice can hash the data and then encrypt that hash with her private key. This encrypted hash is then sent along with the data to Bob. Bob can then use Alice’s public key to decrypt the hash and compare it to a hash of the data that he performs. If both of the hashes match, Bob knows that the data was truly from Alice and was not tampered with while being sent to him.

Wrapping up

To pick one method of encryption as the general winner at this abstract level is nonsensical. It makes sense to have a use case and pick the best encryption method for it by finding the best fit at the abstract level first, then finding a library which offers that method of encryption.

A following post will go into the tradeoffs between different encryption methods in relation to keeping secrets in Ruby on Rails applications. It will take a practical approach, explaining some of the benefits of one encryption method over another, and then give some examples of well-known libraries for each category.

Parallel GraphQL Resolvers with Futures

My team and I are building a GraphQL service that wraps multiple RESTful JSON services. The GraphQL server connects to backend services such as Zendesk, Salesforce, and even Shopify itself.

Our use case involves returning results from these backend services all from the same GraphQL query. When the GraphQL server goes out to query all of these backend services, each backend service can take multiple seconds to respond. This is a terrible experience if queries take many seconds to complete.

Since we’re running the GraphQL server in Ruby, we don’t get provided the nice asynchronous IO that would come with the NodeJS version of GraphQL. Because of this, the GraphQL resolvers run serially instead of in parallel – thus a GraphQL query to five backend services which take one second each to fetch data from will result in the query taking five seconds to run.

For our use case, having a GraphQL query that takes five seconds is a bad experience. What we would prefer is 2 seconds or less. This means performing some optimizations when GraphQL goes to do the HTTP requests to the backend services. Our idea is to parallelize those HTTP requests.

First Approaches

To parallelize those HTTP requests we took a look at non-blocking HTTP libraries, different GraphQL resolvers, and Ruby concurrency primitives.

Typhoeus

Knowing that running the HTTP requests in parallel is the direction to explore, we first took a look at the Ruby library Typhoeus. Typhoeus offers a simple abstraction for performing parallel HTTP requests by wrapping the C library libcurl. Below is one of the many possible ways to use Typhoeus.

After playing around with Typheous, we quickly found out that it wasn’t going to work without extending the GraphQL Ruby library. It became clear that it was nontrivial to wrap a GraphQL resolver’s life cycle with a Hydra from Typhoeus. A Hydra basically being a Future that runs multiple HTTP requests in parallel and returns when all requests are complete.

Lazy Execution

We also took a look at the GraphQL Ruby’s lazy execution features. We had a hope that the lazy execution would automatically optimize by running resolvers in parallel. It didn’t. Oh well.

We also tried a perverted version of lazy execution. I can’t remember why or how we came up with this method, but it was obviously overcomplicated for no good reason and didn’t work 😆

Threads and Futures

We looked back and understood the shortcomings of the earlier methods – namely, we had to find a concurrency method that would allow us to do the HTTP requests in the background without blocking the main thread until it needed the data. Based on this understanding we took a look at some Ruby concurrency primitives – both Futures (from the Concurrent Ruby library), and Threads.

I highly recommend using higher-order concurrency primitives such as Futures, and the like because of their well-defined and simple APIs, but for hastily hacking something together to see if it would work I experimented with Threads.

My teammate ended up figuring out a working example of Futures faster than I could hack my threads example together. (I’m glad they did, since we’ll see why next.) Here is a simple use of Futures in GraphQL:

It’s not clear at first, but according to the GraphQL Ruby docs, any GraphQL resolver can return the data or can return something that can then return the data. In the code example above, we use the latter by returning a Concurrent::Future in each resolver, and having the lazy_resolve(Concurrent::Future, :value!) in the GraphQL schema. This means that when a resolver returns a Concurrent::Future, the lazy_resolve part tells GraphQL Ruby to call :value! on the future when it really needs the data.

What does all of this mean? When GraphQL goes to fulfill a query, all the resolvers involved with the query quickly spawn Futures that start executing in the background. GraphQL then moves to the phase where it builds the result. Since it now needs the data from the Futures, it calls the potentially blocking operation value! on each Future.

The beautiful thing here is that we don’t have to worry about whether the Futures have finished fetching their data yet. This is because of the powerful contract we get with using Futures – the call to value! (or even just value) will block until the data is available.

Conclusion

We ended up settling on the last design – utilizing Futures to allow the main thread to put as much asynchronous work into background.

As seen through our thought process, all that we needed was to find a way to start execution of a long-running HTTP request, and give back control to the main thread as fast as possible. It was quite clear throughout the early ideas of utilizing concurrent HTTP request libraries (Typhoeus) that we were on the right track, but weren’t understanding the problem perfectly.

Part of that was not understanding the GraphQL Ruby library. Part of it was also being fuzzy on our concurrent primitives and libraries. Once we had taken a look at GraphQL Ruby’s lazy loading features, it became clear to us that we needed to kick-off the HTTP request and immediately give back control to the GraphQL Ruby library. Once we understood this, the solution became clear and we became confident after some prototypes that used Futures.

I enjoyed the problem solving process we went through, as well as this writing that resulted from it. The problem solving process ended up teaching the both of us some valuable engineering lessons about collaborative, up-front prototyping and design since we couldn’t have achieved this outcome on our own. Additionally, writing about this success can help others with our direct problem, not to mention learning about the different techniques that we met along the way.

Zero to One Hundred in Six Months: Notes on Learning Ruby and Rails

When they say you’ll like learning and programming in Ruby, they really mean it. From my experience learning and professionally using Ruby and Ruby-on-Rails day-to-day has been quite straightforward and friendly. The rate at learning Ruby and Rails is limited to how fast you’re able to obtain and use that knowledge from either resources online, in a book, or from other people.

It’s common for people who join Shopify to not know how to program in Ruby, yet will be required to. Ruby’s community has grown a great deal of beginner to intermediate guides for newcomers to quickly get up to speed at programming in Ruby. At Shopify, since the feedback is so fast, you’re able to get into an intense virtuous cycle of learning. Since you’re able to code and ship fast, you’re able to learn faster.

From personal experience I found it quite useful to do a deep-dive into Rails before starting to learn the full Ruby language. I focused on reading the entire Agile Web Development with Rails 5 book, which consists of a short primer on Ruby, then the bulk being how to develop an online store using Rails, and lastly an in-depth look into each Rails module. I completed this book over the two weeks before starting at Shopify to give me a head-start at learning.

Roughly the first two months spent at Shopify were a split of working on small tasks by myself, pair-programming with others, and reading a number of Ruby and Rails articles. At the end of two months I found myself being able to take pieces of work from our weekly sprints and completing them to the end without feeling like I was slow, and not requiring a team member to guide me through the entire change.

Code reviews over GitHub on the changes that myself and others have made provided a strong signal on how well my Ruby and Rails knowledge and style has progressed. Code reviews for my code at the start consisted of a lot of comments on style and better methods to use. As more and more code reviews were performed over time my intuition and knowledge increased, resulting in better code and less review comments. The bite-sized improvements gained in each code review slowly built up my knowledge and helped guide me towards areas of further learning.

Mastering Ruby and Rails is gained over months to years of constant use. This is where the lesser-known to obscure language features are understood and put to use, or explicitly not put to use (I’m looking at you metaprogramming!) Some examples being the unary & operator, bindings, and even Ruby internals such as object allocation and how Ruby programs are interpreted.

Coming from the statically-typed Java world, Ruby and Rails is INSANE with the things you can do since it is a dynamic language. My favourite dynamic language related features so far are Module#prepend for inheritance, and the ability to live-reload code.

After a sufficient amount of time gathering knowledge and experience, you gain the ability to help others along their path of learning. This not only benefits their understanding, but it also reinforces your knowledge of the subject.

Some of the things I look forward to in the future are learning about optimizing Rails apps, dispelling metaprogramming, reading Ruby Under a Microscope, and digging into the video back-catalogue of Destroy All Software. I hope you have a good journey too!