Byte Positions Are Better Than Line Numbers

It would be great if more utilities and editors supported byte positions in addition to line numbers for error messages.

Jan 09, 2024

At my own peril, I’m about to talk about something extremely simple. It’s so simple that it’s hard for me to imagine anyone objecting to it. Naturally, this fills me with dread, because it is precisely when a technical concept is simple that it causes the most controversy1.

Nonetheless, I think what I’m proposing is worth it. So, with a dismissive wave toward the effluvium that will inevitably bubble up in the Hacker News comment section, let’s get to it:

Reporting errors as byte positions instead of line numbers is better for most uses. This error message:

/dev/stuff/foo.cpp[90892]: Missing semicolon

is usually better than this error message:

/dev/stuff/foo.cpp(4201,14): Missing semicolon

The first says the error is on the 90,892th byte of the file. The second says the error is at line 4201, column 14. Both indicate the same position in the file, but the byte position version is more scalable, more versatile, and easier to implement. By contrast, the line/column version has only one benefit in its favor: invariance to text encoding.

Scalability

There are no subsystems in a modern computer that use line numbers as an indexing scheme. The network, disk, file system, memory, address translation, and cache hierarchy are all organized around blocks of binary data, usually (but not always) with a power-of-two size.

So, if you’d like to open an editor or viewer utility, and jump to the spot where an error occurred, “it’s at byte 90,892” is effectively an O(1) operation. Byte positions address files the same way as do operating systems and computer hardware. Regardless of whether the file is on the network or on a local file system, the application can just send a request for, say, 16k of data around the 90,892 mark. It’s guaranteed to get the location of the error, and more than enough surroundings to provide the user with context.

On the other hand, if the error report is “it’s at line 4201, column 14”, the application has no idea what to request. Theoretically, the error could occur as early as byte 4213 (if the first 4199 lines were blank, and the line endings were single-byte encoded), or as late as the last byte in the entire file. Well, perhaps not quite that far if the programming language in question has a line length limit. But if it doesn’t, then it really is that bad.

Thus the problem goes from being trivially O(1) to being O(n) in the size of the file. An application has no choice but to read through the entire file, starting at the beginning, and count each newline. It can never randomly access the file, because it simply can’t know ahead of time where the start of a particular line occurs.

Now, you might say, “well, I don’t care much about this sort of thing, because my source files are always small!” But, consider that error reporting is not just for simple source files.

First, you may encounter situations where many source files get concatenated together and feed into a tool for some reason, leading to (sometimes surprisingly) large files.

Second, most text processing utilities and editors tend to be used on both source files and other kinds of text files, some of which can be massive. Logs, traces, JSON data, and other common kinds of non-code text files can easily be gigabytes worth of data.

Because of this O(1) vs. O(n) difference, referring to byte locations instead of lines can lead to dramatic improvements in processing time and response times when files become large. Processing utilities that only need to extract some data can do so without ever touching the rest of the file, and viewers that only need to show a small window around the location can usually do so by reading only a small fraction of the file contents. This allows them to instantly handle many-gigabyte files instead of having noticeable pauses while entire files are ingested and their lines counted.

Versatility

Performance and responsiveness aren’t the only reasons to prefer byte positions. A perhaps more compelling argument is that byte positions unify text processing with binary processing2.

Consider the case of, say, an art pipeline. The artist saves a Photoshop file — a “PSD” as they are called, due to the file extension — and it is ingested into the pipeline. Once parsed, it’s processed into various resources for the shipping product: PNGs are generated, or texture files, or icons, etc.

If your PSD parser encounters a problem — even if it’s just a bug in your parsing code — it has to report this somehow. But how? The standard way to report errors is with a line number and column, but binary files don’t have such things. There is no useful way to report the location of the error, only the file in which it occurs.

With byte positions, error reporting works exactly as it does in the text case. This:

/dev/stuff/bar.psd[3458291]: Unrecognized block type

is the exact same format of error message, with the exact same meaning, as in the text case — but it works perfectly on binary data. You can trivially open up your favorite binary editor directly to the offending block, or set a breakpoint in your code for when it is encountered.

The benefits of byte positions become even more apparent in mixed mode files. A PSD, for example, also contains lots of embedded text!

For example, if there is a syntax error in the XML chunks often embedded in PSD files, what use is a line/column error report? Since the line/column will only be relative to the text portion, there is no way for a hex utility or debugger to actually show you where the error is. You would first need to extract the text part of the chunk using additional information not provided by the error message, and then find the line and column inside it using the error message.

Again, by contrast, the byte position just works. Not only could you open up any hex editor directly to the offending syntax error, but if (as I advocate later) text editors add support for jumping to byte positions, you could even open up your text editor directly to that part of the PSD file, too! The only reason you can’t do that currently is because most text editors don’t have a “jump to byte number” command yet.

Ease of Implementation

Normally, something that is better in (almost) all cases comes with the caveat that it is harder to implement. Thankfully, that is not the case with byte position error messages.

Having long ago implemented byte position error messages in both our text editor (for jumping to errors) and our build tools (for tracking and reporting errors), I can report that it was never harder to implement, and was often significantly easier.

In our case, for our parsing code (which typically deals with C++, PSD, PNG, WAV, CSV, and so on), it usually required no code at all to report byte-based errors. Since the reading code typically has to track where it is in order to read the file, all you are doing when you report byte position errors is printing the location variable you already have handy.

By contrast, in older versions of our build tools which reported line/column-based errors for our C++ and CSV parsing, we had additional code in a few places that did nothing but count newlines. Though simple, it was actually cumbersome code that complicated loops which otherwise wouldn’t have differentiated newlines at all.

As a simple example, an operation like “skip whitespace” in a line/column-reporting parser typically has to include a special check for “\r\n” et al and handle it accordingly. Alternatively, it can opt to defer line counting until an error is encountered, and then use the current position to reread the file up to that point, counting newlines to compute the error position. Regardless of which you choose, both approaches to line/column reporting require extra code that serves no other purpose.

In the byte position case, on the other hand, all whitespace characters can be considered uniformly everywhere. No special cases need to be added. The straightforward file position already used to determine the next character is all that is needed for error display, and no reprocessing of the file is ever necessary.

So, if our codebase is any indication, all you really have to do to switch from line-based error reporting to byte-based error reporting is delete a bunch of code. You typically don’t have to add any new code at all.

One Potential Downside

Using byte position errors was strictly a win for us, and we don’t use line numbers for anything anymore except reading messages from compilers we didn’t write (MSVC and CLANG). Everywhere else, we’re byte-position only.

However, when I tried to think of external situations where line numbers would still be preferable, I was able to come up with an important case: user-to-user communication about the same text file, but where each user has a different encoding of the file.

For example, suppose you are using a source code control utility that rewrites text files on checkout to conform to the operating system’s default line ending. On Linux or MacOS (both classic and modern), there will be no change to a byte-position reported error in a text file, because the line endings are always encoded with a single character — thus, byte positions in the file don’t change.

However, on Windows — which oh-so-helpfully chose to encode line endings as two characters — byte positions in the file do change if a text file is reencoded from Linux or MacOS.

While this poses no problems for byte position error messages on a single machine, nor on a homogenous set of machines, it does create a problem when two users need to communicate about an error position if one is on Windows and one is not. Their byte positions will not line up, whereas a line/column position would.

Similarly, you could imagine this happening across text encodings. If one user has a source file encoded as UTF-8, but another user for some reason encoded the file as UTF-16, their byte positions won’t line up. But — at least most of the time — their line/column positions presumably would.

Therefore, sadly, I can’t say that there are no benefits to line/column reporting. The ability to have a more “encoding-invariant” error position isn’t nothing, especially when you consider things like developer mailing lists or Stack Overflow, where people often post error messages they are getting, to report bugs or ask for assistance.

Technically, of course, this is already an ambiguous thing to do without further information. The user at least has to specify which version of the source code they were using if they expect to refer to a specific line/column. So in theory, one could argue that the need to know the platform the user was using wouldn’t be all that different.

But even so, at least until Windows line endings are more firmly relegated to obscurity, switching to byte position errors exclusively everywhere might not be a pure win for this reason.

Recommendations

So where does that leave us?

Personally, my hope would be that utilities and applications would consider supporting byte positions in the same places we did in our internal tools:

Add the ability to report errors as byte positions, preferable with bracket syntax or some other standardized syntax that clearly delineates it from line/column reporting so the two will not be confused.
Add parsing of byte-position errors to IDEs.
Add “jump to byte location” to text editors that don’t have it.

Although we no longer support line/column reporting in most of our code, that kind of hard break with the existing paradigm only makes sense for internal tools. Externally, even if byte position errors became commonplace in the future, the transition period from line/column errors would have been very long. Everything will have to interoperate with line/column reporting for the foreseeable future.

So, since it is usually trivial to implement byte position reporting, the best approach I can promote for error reporting code is to support a switch — say,

--byte-position-errors

or something like that. When enabled, it would use the bracket syntax instead of the parenthetical or “multi-colon” syntax typically used to indicate line/column errors.

I don’t think this would create much fuss in most codebases, since the information necessary to print the byte position is usually readily available. Similarly, in error parsers, checking for brackets instead of parentheses seems straightforward.

In exchange for what is likely very little code, in return we would be able to unify error reporting across binary and mixed-mode files, and gain the ability to write utilities and viewers that handle error messages with O(1) performance instead of O(n).

As I said in the opening, this is all very simple. But at least for us, it was a nice simplification with a lot of benefits, and I suspect it might be beneficial for a lot of other projects out there as well.

As has been observed for well over half a century.

Byte positions also allow proper error reporting when the encoding of a file isn’t known to the reporting or viewing application (or both). Technically, in order to work properly, a line/column error reporting scheme requires both to have knowledge of the text format and line terminator character(s). On the other hand, for the same file, byte positions are can always be counted accurately, and jumped to accurately, regardless of whether either application knows how a file is encoded.

vfig

Jan 9, 2024

i like this idea, with one modification: report byte positions in hex, e.g. 0x23ab5. it is more obviously not a line number (since we conventionally use decimal for line numbers); it is already commonplace to use hex for byte offsets in hex editors, debuggers, and so on; and as a side benefit, in large files it will end up being more compact anyway.

for tools talking to tools byte offsets is a clear win. but i would prefer to see tools reporting byte offsets in addition to line numbers, as a default (with switches if you like to choose only line numbers, or only byte offsets). because when a human has to be in the loop interpreting the result, line numbers have the convenience of being shorter (fewer errors in reading and typing them); easy to correlate by eye when an editor shows line numbers in the gutter (such as when having to make do with less-capable text editors that cant parse the output); and often easier to communicate to another human (such as when helping someone over screenshare, referring to line numbers is very valuable in a way that byte offsets cannot be).

Expand full comment

The Sandvich Maker

Jan 9, 2024Edited

A small benefit for line numbers is that if I don’t have my editor hooked up to read compiler messages, it’s easier to type in a line number than a byte position, especially if I’m not looking at the errors and the code side by side. Perhaps that speaks more to a workflow problem though…

21 more comments...

Computer, Enhance!

23 Comments