Jason Sultana

Hi, I'm Jason! I'm the author of this blog and am a certified Microsoft Developer / AWS Architect that's addicted to good steak, good coffee and good code!

Removing a BOMB from your text!

03 May 2021 » dotnet

G’day guys!

Don’t be alarmed, there isn’t really a BOMB in your text. There may be a BOM though. Yep, you read that right.

While I was working on an XML Transformer project, which involves reading an XML configuration file and applying one or more XML Transformations to it, I ran into a rather peculiar excpetion on the second parse of the file.

System.Xml.XmlException: Data at the root level is invalid. Line 1, position 1.

The above exception was being thrown when calling LoadXml(string) on an instance of XmlTransformableDocument. Upon inspection though, the XML string being parsed didn’t differ significantly between the first and second pass, and I couldn’t see any characters before the expected < character at the start of the string. So what gives!?

Well, if you simply Google the exception, you’ll end up at a couple of interesting discussions of the issue, both on Stack Overflow and on a few developer blogs. Essentially, my problem was that after applying the first XML transformation, something was adding a BOM (Byte Order Mark) to the start of the string, and that was causing the parser to blow up on the second parse.

Wait, a what!?

Yep, that was my first thought exactly. Sorry Joel, I failed your Unicode standards expectations! Anyway, a BOM (Byte Order Mark) is firstly a zero-width, non-breaking space. This means that it should never be rendered, which is why it wasn’t showing up in my debugger when I examined the string about to be parsed in my IDE. Under the hood, the BOM is a Unicode character that’s expressed as U+FEFF.

That’s great, I hear you say, but what does the bloody thing do? Well hold on mate, I’ll get to that. Essentially, those first two bytes of the file (or the last two bytes of the file, if you read it backwards - hint-hint) tell you the Endianness of the text. In other words, are the bytes of the file Big Endian (from left to right, where the BOM is U+FEFF) or Little Endian (from right to left, where the BOM is U+FFFE). Those two bytes tell a reader the order of the rest of the bytes - hence the name.

That’s sensational Sherlock, now how do I parse an XML document?

There are a million and one answers to this. If you’re loading a single file from disk, you could consider using Andrew’s approach of reading the text using a StreamReader, or even just use the Load(path) method on XmlTransformableDocument, instead of LoadXml(string). But if you’re dealing with an XML string in memory (like I was), unfortunately this won’t help you much.

One fairly popular StackOverflow answer suggested the following:

private string RemoveBOM(string xml)
{
    // https://stackoverflow.com/questions/17795167/xml-loaddata-data-at-the-root-level-is-invalid-line-1-position-1
    var preamble = Encoding.UTF8.GetPreamble();
    string byteOrderMarkUtf8 = Encoding.UTF8.GetString(preamble);
    if (xml.StartsWith(byteOrderMarkUtf8))
    {
        xml = xml.Remove(0, byteOrderMarkUtf8.Length);
    }

    return xml;
}

The problem I found with this approach (which is mentioned in its comments) is that Encoding.UTF8.GetString(preamble) returns an empty string, and string.StartsWith("") will always evaluate to true. So while this will work if you’re certain that there is a BOM in your text, it will fail if there isn’t, since you’re mistakenly removing the first two bytes of the XML document otherwise.

I opted for the following implementation (originally posted on SO by Vivek Ayer).

private string RemoveBOM2(string xml)
{
    int index = xml.IndexOf('<');
    if (index > 0)
    {
        xml = xml.Substring(index, xml.Length - index);
    }

    return xml;
}

I like this approach, since it covers any whitespace characters, including non-breaking spaces and BOMs, and any other rubbish that might come before the opening tag.

Have you guys ever found a BOMB in your own text that you had to defuse? What’s your preferred approach? Let me know in the comments!

Catch ya!