Who Wants To Know?

Thursday June 04, 2009
Alternate Titles:

I Have No Log File, Yet I Must Scream

or

So You Logged It, Now What?

or

This is why I don't want to have LOG_DEBUG in Twisted

Sometimes, when writing a program, you feel compelled to make the program emit some output which is peripheral to its operation.  The question is - who wants to know about that information?

Maybe you're debugging the program, and you insert a simple 'print' statement to get some information about it.  Maybe your program is a network server, and you are recording the fact that a message was received and processed.  Maybe you're maintaining an old library routine, and you want it to emit a message that points to a newer, better version of that routine which is now preferred.  Finally, regardless of what kind of program you're writing, maybe it has produced an error that a user or administrator will need to deal with, and you would like to show it to them.

This activity is referred to in several different contexts depending on how the messages are delivered, but it is most commonly known as "logging".  It is critical to the operation of many, many different kinds of programs.  Unfortunately, it is one of the most poorly-understood and poorly-implemented areas of software in general.  Software is a veritable cornucopia of poorly-understood and poorly-implemented ideas, so that's really saying something.  You can see some of the more hilarious and visible examples of developers getting this wrong in the "Pop-Up Potpourri" series on the Daily WTF.

It might seem odd that I lump together funny dialog boxes with "logging".  A dialog box is a little square on your screen; a log message is some text in a file somewhere.  But they are very much the same thing, and they fail in very much the same way.  Log files just do it less visibly.

The point that I hope to communicate here is that for every producer of information, there is a consumer.  When most programmers need to produce a "log message", however, they are thinking only of getting the information out of their program in some format, any format; not how that information is going to be used later.

When I say "most programmers", I most definitely include myself.  I'm probably guiltier than most. one of the reasons I'm writing about this in the first place is to work out some better approaches to the problem.

Consider this output from the "tomboy" desktop sticky-notes program on Ubuntu Hardy.  If I start it from the command-line, I see this:
[DEBUG]: NoteManager created with note path "/home/glyph/.tomboy".
[INFO]: Initializing Mono.Addins
[DEBUG]: AddinManager.OnAddinLoaded: Tomboy.Tomboy
[DEBUG]:            Name: Tomboy.Tomboy,0.10
[DEBUG]:     Description:
[DEBUG]:       Namespace: Tomboy
[DEBUG]:         Enabled: True
[DEBUG]:            File: /usr/lib/tomboy/Tomboy.exe
[DEBUG]: Updating note XML to newest format...
It goes on for several hundred more lines just at startup, and continues to produce messages as the program runs.  These messages are diligently classified into categories: DEBUG and INFO.  I'm sure they're useful to someone.  But why am I seeing them?  I just wanted to start a program to put some sticky notes on my desktop, and none of this information is useful to that task.

I have to imagine that pretty much all of these messages are useful only to Tomboy's developers.  But, worse than the fact that I see them is the fact that if something really interesting happened — I discovered a critical bug, let's say — all of that log output which is being splatted onto my screen is going nowhere.  It is a book written on water.  (Well, a book written on video memory, which is pretty much the same thing.)  Meanwhile, thanks to the bug-reporting facilities in Ubuntu, I'm sure that I could opt to give the Tomboy developers a huge ton of mostly useless information, like the contents of my registers at the time that it crashed.

Consider not just the placement of the messages (on my screen, where I certainly don't care about them) but their formatting.  Who is that elaborate right-justification of labels in the "DEBUG" output for, anyway?  It isn't for me, I don't want to see these messages in the first place.  I doubt it helps the developers, either; rather than just grepping for '[DEBUG]: File', now they need to put in a regular expression to collapse whitespace, or count the number of spaces that the justification happens to put in.  Presumably if this output is useful at all, it is useful in a search.

Text Formatting and the Inevitable Descent into Log-Level Hell

The right-aligned pretty-printing is a beautiful illustration of a very common anti-pattern in logging: trying to convey structured information by messing with a textual format.  A developer wanting to write a message indicating that there is a problem with the program, left with the extremely narrow confines of a logging API which just takes a string, will often do something like this:
log("*****")
log("THIS SHOULD NEVER HAPPEN!  HELP!!!")
log("*****")
Of course, this frantic wording doesn't help the output go anywhere but silently into a log file where it will be ignored.  But, perhaps if this is some server software, an administrator will notice this message and set up an alert that makes their blackberry buzz when they notice those particular words show up in the log file so they can ssh in and look for problems.

Then the developer gets chastised by his manager for his un-informative error message, and updates it to be something clearer:
log("Serious Error: phase inducers have been depolarized.  Contact engineering immediately.")
Of course this breaks the administrators' alerts, so after much discussion between programmers and admins, log levels are added so that admins only get alerts when something "really bad" happens, where "really bad" is an agreed upon flag:
log2(SERIOUS_ERROR, "phase inducers have been depolarized.  Contact engineering immediately.")
Okay.  Now we've got a log level so admins can tell when their pagers should go off.  Except, different developers have different ideas about what "serious" means.
log2(SERIOUS_ERROR, "OMG I lost my cat Mittens.  Where is my cat?")
Clearly this is an abuse of the new "severity" flag that was added, but the cat-engineering team thinks that loss of a cat is pretty serious, so we add a new thing, a log "system".
log3(SYSTEM_CATS, SERIOUS_ERROR, "OMG I lost my cat Mittens.  Where is my cat?")
Most logging systems stop in this general vicinity, but we still haven't solved the problem, which is that the log message has no structure and you can't tell what's going on without groveling around in a bunch of text files with regular expressions or manually reading each message.  Which cat was lost?  Which phase inducer was depolarized?  How do we get from a log message or alert to this information?  The 'log levels' solution to this problem is clearly untenable:
logRidiculous(SYSTEM_CATS, ALERT_IF_YOU_LIKE_CATS, O_RLY, YA_RLY, SERIOUS_BUSINESS, BUT_NOT_TOO_SERIOUS, CAT_LEVEL("Mittens"), "OMG I lost my cat Mittens.  Where is my cat?")
More importantly, if you're writing a library, you have a bunch of other problems.  This diagnostic information needs to be logged somewhere, but what if this library is being used on a user's desktop machine?  Some of these messages are relevant to them as well.  How do you tell the user who is using a GUI that a cat has been lost?  How do you show them the picture of Mittens so they will recognize her if they see her?

Everyone agrees that log messages need some "small amount" of information associated with them, but very few people can agree on what that information should be.  Even at the simplest layer, the idea of a "level", there are lots of open questions.  Is the "debug" level for a programmer trying to debug something on their test rig, or is it for administrators trying to debug something in production?  Should there be a difference between those two things?  How serious does a problem have to be before warranting a "critical" classification?

Once you're using logging code written by more than one programmer, or worse yet, more than one team, you're going to be facing this problem.

The Particular Problem of Libraries

This is, of course, my main interest, since this is where the rubber meets the road for Twisted.  Libraries need to communicate to several different audiences:
  1. We need to tell developers using the library about the correct way to use the library at runtime.
  2. We need to tell administrators of systems using the library about the status of the library and tasks they may need to perform to keep it functioning well.  (Clear your caches, restart the server, install a security update...)  We also need to provide administrators with information they can mine for statistics about how the library is performing; how many requests handled, where its resources are going, etc.
  3. We need to notify users of applications using the library about things that the library is doing which may be relevant to them.  (A new message has arrived, a new printer is available... obviously this depends heavily on what the library does.)
Libraries, especially event-driven engines such as Twisted, libevent, and glib, have a particularly difficult time because they have to deal with all of these audiences simultaneously.  However, I think that any application or server which needs to do some kind of logging or user notification needs a subset of these features, so if any logging system could solve this problem, it could solve pretty much all logging problems.

Type of Information by Type of Audience

Developers, Developers, Developers

Many languages don't have a solution to developer communication at all.  Python has one — the warnings module — but it is in many ways inadequate.

The warnings module doesn't easily let you selectively see which libraries you want to see warnings for.  If I'm developing an application A using libraries M, N, and O, which themselves have dependencies on X, Y, and Z respectively, I don't want to see warnings that M caused in X or that O caused in Z; those are problems for the maintainers of M and O.

I am maintaining only A, so I want to see warnings caused by my application in M, N, and O.  I can try to filter specifically by module, but unfortunately the
only way of determining which library caused which issue is by directly
examining stack depth, which is unreliable at best and misleading at
worst.  Even if I could filter very accurately, it's hard to get a stand-alone report of warnings and deal with them as they're supposed to be.  Warnings show up to end-users as well, and to administrators looking at applications in production.  It's worth putting up with that to have at least some solution for communication with developers, but it would certainly be better if it didn't happen.

Finally, it's easy to generate a huge amount of warning noise (and, especially as of Python 2.6, many libraries do).  With that much noise and no reporting functionality it's hard to the warnings you care about.

A better solution for communicating for developers would be one which:
  • allowed developers to declare somewhere what code they are working on and what code they are just using
  • recorded relevant warnings to a log file which was optimized, perhaps with an associated tool, for locating and removing the sources of the warnings
  • allowed end-users to easily communicate their warning data to developers without inundating them with irrelevant noise while using the application

Administrators

To communicate with administrators there is a huge variety of options, but many of them depend on a lot of ad-hoc hackery by the admins themselves, which means they are inconsistent and therefore there is little reusable technology or standardized APIs available.

Right now the gold standard for talking to admins seems to be just writing strings into a text file and hoping they have some facility to read it.

A better solution for communicating with administrators would be one which:
  • preserved structured data in an analysis-friendly format, rather than formatting it in human readable messages. (For most UNIX admins, I imagine some kind of structured text would be best, so "grep" would still work but more advanced tools could also be brought to bear.  I'm not sure what the tools in the Windows world look like.  The "Event Viewer" looks like maybe it's a step in the right direction, but its UI is incredibly primitive.)
  • provided easily-accessible hooks for dispatching different types of events to ad-hoc code to wire up to existing notification systems - without significantly altering the behavior of the system doing the logging, if the logging hooks were broken, as admin-written code tends to be a bit flaky
  • included an enumerated list of events which administrators could inspect before they happened to run across them in log files
Although it's a crappy format in many ways, the Common Log Format for HTTP might serve as a good example.  Unfortunately it's too purpose-specific to extend to do more than what it already does, but lots of tools have been written to produce lots of interesting data from even that very simple standard.

Desktop Users

There are two popluar cases for communicating with end-users.  One case is that you're actually running a program on their desktop and you want to tell them something.  Another is that you've got some code running in a web application which wants to tell them something "out of bounds".

On the desktop, there are fairly standard "notification" APIs for popping up little bubbles.  On the web, there are emerging conventions for these notifications, like a bar that descends from the top of the page to mimic the firefox 'do you want to remember this password?' UI element.  A good example of this is Stack Overflow's notification banner.

Unfortunately both of these have a problem with scale and with timing.  If your application suddenly encounters a large number of errors, it will flood the user's screen.  If the user isn't present when a notification occurs (or navigates away), the bubbles or banners may disappear.

A better way to talk to end-users would be:
  • for desktop applications, a mini-email interface, which records notifications in a scrolling list so that users can inspect notifications that occur while they're away.
  • for web applications, a standard API so that multiple applications on a single site (or even, potentially, on different sites) can drop notifications into a queue which can be displayed appropriately.  (Since websites tend to have strong preferences to control their own design, an actual standard widget might not be possible, but it would be nice.)
  • for both of these, a standard protocol which would enable notifications to be easily streamed to different computers or mobile devices without needing to reconfigure
  • a specific classification of messages at the API level, saying, "I want to tell the end-user about this".  Messages about crashes, etc, should be displayed as an option to send information to the developer.  In the context of a web application this can be done automatically and silently; in a desktop app there would need to be some channel set up for sending that information.
Let's not forget that administrators are users too.  Everything that happens in a server's notification APIs should be able to be trivially filtered and redirected to administrator's desktop machine (or their phone) so they can immediately notice when something has gone wrong.

Appendix A: Optimization and Dynamic Instrumentation

This doesn't fit into the "figure out what you need to say and who you need to say it to" theme of the rest of the requirements here, but it is nevertheless important.  If you make heavy use of a logging system, especially one where you have lots of messages that are logged "just in case" and rarely displayed, you will quickly discover that it's consuming a lot of resources.

The work to calculate what goes into a log message shouldn't be done unless the message is actually going to be emitted.  Similarly, it should be easy to dynamically add and remove log events from particular methods or particular objects without modifying their code.  The best way to do this is to keep the logging entirely separate and add it at the level of methods and functions, rather than ad-hoc in the middle of application code.  Sometimes, of course, you want to emit log messages at a very particular point, but in general it should be easy to add instrumentation to a method without modifying its code, to avoid cluttering up application logic with lots of "just in case" debug messages.

What Can Do This?

I'm not aware of any logging system that can already do these things.  We'd have to write a new one.  This essay was largely composed due to my desire to understand what I thought a "good" logging system would do.  It might be too ambitious.

There are a few fundamental units which are missing from most logging systems.  While lots of logging systems have various ways to indicate subtle and nuanced levels of urgency, few have a way to indicate who a message might be relevant to.  Logging systems also generally don't have a good way to associate structured information along with a log message.

Twisted's logging mechanism does have a free-form dictionary associated with each message, but that's not much help unless you impose your own structure on it, which means you have to build all this infrastructure anyway.  It is, at least, possible to treat the Twisted system as a kernel which this could be based upon.

In order to produce an enumeration of events before the events are actually logged, this API will require pre-declaration of log messages.  This might be too much of a burden, since sometimes you want to just stick a log message into the middle of some code so that you can see if it happens.  So, in practice, it's more likely that pre-declaration will need to be optional and you'll need to be able to associate ad-hoc data and have it still be persisted along with your message.

There will need to be some way for communicating both the structured data of an event and the human-readable text associated with that event — preferably in a way which can be internationalized.

There's also a bunch of UI, web, and protocol standardization work that would need to be done.  Luckly, that's independent of the actual log machinery; if it already existed it would be a trivial matter to hook it up.  In the meanwhile, something that did all of this but just used existing facilities, like the desktop notification spec, status icons and email, would still be immensely useful.