Writing tests for some code which generated HTML ended up surfacing one peculiarity with how HTML encodes URLs. The valid URL
https://example.com?a=b&c=d would always get modified when inserted into HTML like so:
One of my teammates commented on this during a code review – why the
& character is converted to
& in the resulting HTML. That URL didn’t look right since the
& would break the URL query string.
Even more confusing was that the HTML in the URL still worked since Google Chrome and other browsers converted the URL in the HTML from its
& form back to
&. Were the browsers just being helpful by handling these developer mistakes, much like it already does with closing missing HTML elements?
The fake bug hunt
Over two hair pulling days of reading GitHub issues, StackOverflow, HTML standards, source code, and more, it was clear that there was a clear divide in understanding. One group of people who understood this as a bug in their library of choice and another group who understood that this wasn’t a bug.
I was definitely in the former group of people until I finally found a helpful blog post clearing up the confusion. Even this StackOverflow answer concisely summed why this is, in a few quick sentences.
Simply stated, lone
&characters in HTML are invalid and must be escaped to
Since HTML is a descendant of XML, HTML implements a subset of XML’s rules. One of those rules is that a
& character by itself is invalid since
& is used as an escape character for a character entity reference (eg.
The confusion arises when people don’t know that this rule exists. Many, including myself, was blaming it on their HTML parsing libraries such as Nokogiri and libxml2. Others blamed their web app of choice since it sends them invalid HTML or XML that their HTML parser doesn’t know how to deal with.
Another way of understanding the same problem is that a URL on its own has its own rules around which characters must be encoded. HTML also has different encoding rules. So when URLs are used in HTML, the URL may look invalid, but given that it is in HTML, HTML has its own rules around what characters need escaping. This can lead to funky looking URLs, but rest assured that using a HTML parsing library or a browser will properly encode and decode any sort of data stored within HTML.
This explains why our browsers see
& in the raw HTML and know to convert it back to
&. This also confirms that it is completely fine seeing
& characters in tests comparing HTML.