Oh <what> a.tangled {web, we} WEAVE FROM

The always entertaining Jacob Kaplan-Moss recently posted a missive, "Snakes on the Web", which, if you haven't already read it, is a highly edifying trip through a variety of Python web technologies and history.  He begins with a simple statement — "Web development sucks." — and goes on to ask a number of interesting questions about that.

What sucks about web development?  How will we fix it?  How has python fixed it, and how will python fix it in the future?  While I can't say I agree with every answer, I found myself nodding quite a bit, and he has something useful to say on just about every point.

I noticed one very important question he leaves out of the mix, though, which seems more fundamental than the others: why does web development suck?  In particular, why do so many people who are familiar with multiple styles of development feel like developing for the web is particularly painful by comparison, while so much of software development moves to the web?  And, why does web development in Python suck, despite the fact that otherwise, Python mostly rocks?

Programming for the web lacks an important component, one that Fred Brooks identified as crucial for all software as early as 1975: conceptual integrity.  Put more simply, it is difficult to make sense of "web" programs.  They're difficult to read, difficult to write and difficult to modify, because none of the pieces fits together in a way which can be understood using a simple conceptual model.

Rather than approach this head on, from the perspective of a working web programmer, let's start earlier than that.  Let's say someone approached you with a simple programming task: write an accounting system that includes point-of-sale software to run a small business.  Now, considering some imagined requirements for such a system, how many languages would you recommend that it be written in?

Most working programmers would usually say "one" without a second thought.  A too-clever-by-half language nerd might instead answer "two, a general-purpose programming language for most things and a domain specific language to describe accounting rules and promotions for the business".  Why this number?  Simply put, there's no reason to use more, and introducing additional languages means mastering additional skills and becoming familiar with additional quirks, all of which add to initial development time and maintenance overhead.  Modern programming languages are powerful enough to perform lots of different types of tasks, and are portable across both different computer architectures and different operating systems, so other concerns rarely intrude.

But, in the practical, working programmer's world, what's the web's answer to this question?  Six.  You have to learn six languages to work on the web:
  1. HTML.  This isn't really a programming language, but in web development you do end up reading and writing quite a lot of it.
  2. CSS.  In order to apply visual styles to your HTML so that it actually looks nice in a browser, you need to understand a different language (with a different conceptual model for how documents are laid out than the HTML itself).
  3. JavaScript.  In today's competitive AJAX-y world, you need to be able to react instantly in the browser, writing a real client application.
  4. SQL, so that you can store your data in a database.
  5. Your "middle-tier" language: in my case and Jacob's, that would be Python.  This is where people tend to spend the bulk of their programming time, but not all of it.
  6. A templating language; in Jacob's case, the Django template language.
If you're unlucky, you might need to learn XML, more than one back-end language, a deployment language (UNIX shell scripting or Windows's "batch" language), and ActionScript.  You'll probably need to learn a smattering of some awful web-server configuration language though, like the not-quite-XML-not-quite-HTML used to configure Apache.

Of course, Jacob lists a pile of related technologies too, and rightly points out that it's a lot to keep in your head.  But he is talking about a problem of needing extensive technical knowledge, something which all programmers working in a particular technology ecosystem learn sooner or later.  I'm talking about a different, more fundamental problem: in addition to the surface problem of being complex and often broken, these technologies are fundamentally conceptually incompatible, which leads to a whole host of other problems.  Furthermore, the only component which is really complete is the "middle-tier" language, although bespoke web-only languages like PHP and Arc manage to screw that up too.

Here are a few simple example problems that are made depressingly complex by the impedence mismatch between two of these components, but which are incredibly easy using a different paradigm.

How do you place two boxes with text in them side-by-side?  Using a GUI toolkit, like my favorite PyGTK, it often goes something like this:
left = Label("some text")
right = Label("some other text")
box = HBox()
box.add(left)
box.add(right)
The conceptual model here is simple: the HBox() is a container, the "left" and "right" things are widgets, which are in that container.  You can add them, remove them, swap them, or handle events on them easily.  You can discover how these things are done by reading the API references for the appropriate classes of object.  However, there's no right answer to this question on the web.  You can use a <table> tag, and then some <tr>s and <td>s to make a single-row table with two cells, but that has a variety of limitations; plus, it's considered somehow gauche by most web designers to use tables for layout these days.  Or, you could cook up a collection of CSS classes.  So there's the first impedence mismatch: do you do layout in HTML, or CSS?  Of course most design gurus would like to tell you that "always and only CSS" is the right answer here, but more practically-minded web developers who actually write code will often prefer HTML, partially because it's simpler but partially because CSS's featureset is incomplete and there are some things you can still only do with HTML, or only do portably with HTML.

Plus, how do you discover how these layouts work?  There are a variety of reference materials, but no canonical guide that says "this is exactly what a <table> tag should do, and how it should look".  There are different forms of documentation for both.

If you have a variable number of elements, you quickly run into another problem.  Should this be the responsibility of the HTML, the CSS, or some code (in the templating layer) that emits some HTML or some CSS?  Should the code in the templating layer be written as an invocation of your middle-tier language, or should the template language itself have some code in it?  Reasonable people of good conscience disagee with each other in every possible way over every one of these details.

This is all part of a very complex problem though.  For all of these crazy hoops you have to jump through, HTML and CSS do provide a layout model that allows you to do some very pretty and very flexible things with layout, especially if you have large amounts of text.  Perhaps not as good as even the most basic pre-press layout engine, but still better than the built-in stuff that most GUI toolkits allow you.  So there is an argument that this complexity is a trade-off, where you get functionality in exchange for the confusion.  So let's look at a much simpler problem.

Let's say that, in our hypothetical accounting application, you have a list of items in a retail transaction, and you want to process the list and produce a sum.  Where is the right place to do that?  It turns out you have to write the code to do that three times.

First, you have to write it in JavaScript.  After all, the numbers are all already in the client / browser, and you want to update the page instantaneously, not wait for some potentially heavily-loaded server to get back to you each time the user presses a keystroke.  And why not?  You've got plenty of processing power available on the client.

Then you have to write it in Python.  That's where the real brain of the application lives, after all, and if you're going to do something like send a job to a receipt printer or email a customer or sales representative some information in response to a sale, the number has to be located in the middle tier.

Finally you have to do it in SQL.  Since this is a traditional web application, your Python code is going to be spread out among multiple servers, and the database is the ultimate arbiter of recorded truth.  So you need to have transactions around the appropriate points and execute any interesting aggregate functions (such as SUM()) in the database tier.

So, you've got three times as much work to do in your fancy new web application as you would in a simple record-based application with a GUI.  A worthy price to pay to run in the brave new world of tomorrow rather than on some crusty old client/server system, right?

Well, as it turns out, the problem is somewhat deeper than that.  It turns out that JavaScript, Python, and SQL actually have slightly different numerical models (in fact Python implements at least 4 itself: fixed-point decimal, floating-point decimal, IEEE 754 floating-point binary, and integer math; you should really only use decimal for money, but this isn't availble in JavaScript and its availability in SQL is spotty).  After applying some discounts, your register might read $19.74 but your receipt will read $19.75; and the reports sent to the accounting department will read $19.74898989898989.

Even if you know a lot about math on computers, the limitations of each of these runtimes, and you happen to get all of that just right, you still have another problem to contend with: what happens when somebody else needs to change the logic in question?  How do you test that the Python, the JavaScript, and the SQL are all still in sync?  It's possible, but you have to go above and beyond the usual discipline of test-driven development, because you need to have integration tests that verify that different, almost unrelated code, in different languages, in different environments is all executing properly in lock-step.  Just getting the code from SQL and JavaScript to run in your Python test suite at all is a major challenge; in a language like PHP it's borderline impossible.

This is all even worse when it comes to security, because every part of the application exposes an attack surface, and because you can't use the same language or the same libraries to do any of the work, they all expose a different attack surface.

In his talk, Jacob notes that "frameworks suck at inter-op", but the problem is much deeper than that.  As I've shown here, a single page from a single application written using a single framework, which has only one task to do, can't even inter-operate with itself cleanly, at least not at the level that Jacob wants — or that I want.  He says, "gateways aren't APIs", and he's right: the correct way to inter-operate is through well-defined APIs.  APIs can be discovered through a single, consistent process.  Their implementations can be debugged using a single set of development tools.

CSS isn't an API.  HTML isn't an API.  Strings containing a hodgepodge of SQL and data aren't an API either.

It's not all doom and gloom, but my ideas for a future solution to this problem will have to wait for another post.

Threat 2: Attacks via E-Mail

Continuing my series on simple threat models for internet users, I'll now address the second threat I mentioned: threats via e-mail.

There are two kinds of e-mail attacks: direct attacks, and trojan horses.  First let's talk about direct attacks.

The basic idea behind a direct e-mail attack is that the program you use to read your e-mail might have flaws in it, which a specially-crafted message will exploit.  That message will have a program in it, and a mistake by the programmers who wrote your e-mail client will cause that program to be executed.

Unlike attacks from the outside, which you can very simply protect against by denying outside attackers access to your computer entirely, there is no fool-proof method to protect against this kind of threat.  E-mail formats are highly complex, and messages can contain multiple parts, including images, etc.  The code that decodes images is notoriously prone to security problems.  Even e-mail programs which don't process images are occasionally prone to security problems dealing with the structure of certain messages.

Chances are that you are going to want to read e-mail somewhere, and you probably want to be able to see images and download attachments; shutting off e-mail completely isn't really an option.  The more general advice I gave against the first threat still applies, though: keep all your software up to date, including your e-mail client.  People who make e-mail software take these kinds of threats very seriously and release updates very quickly when problems are discovered.

One way you can mitigate this risk, and reduce the amount of work required to keep up to date (and therefore the opportunity for you to forget to do so) is to use a web-based e-mail client like GMail.  If you use GMail, the potentially vulnerable program running on your computer is just your web browser, and you already need to keep your browser up to date for other reasons.  The code which deals with the structure of messages is all run on the server, and constantly kept up to date by the fine folks at Google.  Similarly, they take steps to protect your browser; stripping out harmful attachments and filtering spam for you so that potentially dangerous messages never reach you.

The much more common form of e-mail attack is easier to defend against, but is attacking something more potentially vulnerable than your e-mail software: it's attacking you.  A trojan horse is a program which doesn't do anything tricky to get itself run automatically, but instead elicits your cooperation in making it run.  Whether you run a web-based email client or the oldest, buggiest version of Microsoft Outlook, you are equally vulnerable to these kinds of attacks.

The key to defending yourself against a trojan horse is to understand what you are double-clicking on.  Look inside that trojan horse before you open it; there may be a bunch of armed greeks inside.  Before you open any document or run any program that was attached to an e-mail, very carefully read the message that it came from.  Ask yourself a few questions:
  1. Were you expecting this message?
    If you weren't expecting the message, you should double-check to make sure.  In the best case, use some mechanism other than e-mail to check.  Give the sender a phone call.  Ask if they actually sent you the message in question.
  2. Is the message really from who it says it's from?
    It's very easy to fake e-mail addresses, so if you are used to receiving messages from Bob Dobbs and you see "From: Bob Dobbs <bobdobbs@example.com>", you shouldn't necessarily believe it.  Does the text of the message read like Bob wrote it?  Does Bob usually send you these kinds of attachments?  Is the "To" line correct?  Does he use your real name?  A lot of spam which includes viruses is very generic, but it is increasingly cleverly disguised as coming from people in your addressbook.
  3. Is an attachment trying to disguise itself?
    Sometimes, even messages you are expecting, from people that you know, will contain evil attachments.  If Bob's computer is infected with a virus, he may well have actually legitimately written you the message but a trojan horse packed itself along for the ride.  In this case, you need to see if the attachment is trying to look like something different than it is.  Does the file's name have multiple extensions?  For example, "business-plan.doc" is a Word document, but "business-plan.doc.exe" is an executable program, with its name changed to pretend to be a Word document to fool you.
  4. Is anything trying to warn you?
    Most browsers and operating systems these days will double-check with you before opening executables which you've downloaded.  If a box pops up saying "Are you sure you want to do that?", don't just click past it immediately; read it completely and try to understand what it's telling you.  Even if you don't understand a word, pausing for a moment to reflect on whether the warning is serious or not will often help you realize that something might be amiss.
If you're careful and look for details which seem out of place, you don't need to be an expert to spot e-mails that look wrong.  The most basic task here is to recognize genuine human communication, and not to scan for any particular technical trick.  That's not all, of course; as I mentioned, there are ways that programs can hijack legitimate communications, but these are much more sophisticated, and much rarer than the much more common type of message, which is one that simply says "hey buddy, click this" and expects you to click on it without thinking.  If you can recognize those you will be safe 99% of the time.

In using the Internet, this is a generally useful skill, and particularly important when it comes to security.  It will be particularly useful when I discuss threat #4, phishing attacks.


Goodbye, Divmod. Hello, World!

At the end of this month, Divmod will lay off its last employee and cease to be.

As some of you know, I've been on hiatus for several months now.  The idea was originally that I would take a break, allow the company to build up a small operating buffer to deal with our cash-flow issues, and heal a psyche damaged by many months of intense stress (caused largely by those same cash-flow issues).

The psyche-healing worked out okay.  I'm feeling much better than I was when my break started.  The cash-flow issues, not so much.  The reality turned out to be that much of the new consulting business we were counting on just didn't materialize.  We managed to get quite a bit of maintenance done on our infrastructure — I continued to help out intermittently, interleaving some reviews and bugfixes with hobby projects — but it was no longer really clear what business purpose that infrastructure was serving.  We didn't have any product that generated a revenue stream and we certainly didn't have the resources to build a new one.

Users of Divmod email: I'm not exactly sure what the plan is, but JP and I will personally make sure that you can get your email in some form and we'll work out some way to keep at least a forwarding service running.

Users of Divmod open source projects: we will figure out some way to continue to host and maintain the code.  I'm not sure what we're going to do about official stewardship, but it was years before Twisted needed any official legal structure, so I'm sure we'll make due.

The Divmod Fan Club, which deposits money into my personal paypal account rather than a business one (for stupid technical reasons which are now extremely convenient), is generating enough money that we may be able to afford some hosting, assuming those of you who supported Divmod-the-company would like to continue supporting Divmod-the-ambiguously-defined-collection-of-open-source-projects.  Regardless of whether you decide to cancel your subscriptions now (you can do so in the UI for your PayPal account; nothing to do with us, happily), thank you all, very much.  You enabled us to do a lot more with our open-source work than we would otherwise have been able to, and you helped the get through a number of crunches in the past.

The fan club might enable us to host the collection of open source projects, and possibly also host versions of Mantissa and Quotient, and Sine.  I think that having some users would help keep those projects alive in the absence of a corporate sponsor.  I'm not really sure what's going to happen to Blendix, though, and as a proprietary thing it requires more thinking.  If you care deeply about it, please get in touch with me.  Also, if you are a member of the Divmod community who might like to help out with administration, we might need help with mundane things like keeping our Trac instance running.

Now, on to the more personal stuff.

Thanks in advance for your condolances, but I'm feeling okay about this.  Not to say that I don't wish Divmod had ended with more success, but I spoke to Amir and JP yesterday, and we all agreed — it's time to move on.  We tried everything we could think of.  It's time to do something different.

More importantly, I'm not really sure what I'm going to do next.

Right now I'm considering a few things.  I have a couple of job offers, I have a few ideas for new businesses that I might want to start myself.  Some of those ideas are things I would bootstrap myself, some would require funding.

Some of you reading this right now have intimated that you'd like to offer me a job, if I were available.  Some have speculated that you might want to fund some other company that was less ambitious than Divmod.  Well, now's your chance.  Get in touch, and let's talk.

If you can, please do it soon, though.  Some of the offers I'm already considering need a decision soon, but I'd really like an opportunity to consider my options before I jump into the next thing.


Threat 1: Attacks From The Outside

This article continues my series on my personal threat model for the internet.  In this article, I'm going to talk about the threat of automated attacks coming in to your computer over the internet, while it is connected to the internet.

The basic problem underlying this threat is the same as that underlying threats #2 (malicious e-mail messages which attack your e-mail program) and #3 (malicious web pages which attack your web browser): the software you are running on your computer, which you need to do your job, play your games, or otherwise get value out of your computer, is full of bugs.  Some of those bugs are security problems.  The most dangerous type of security problem is one that allows some data which a program is reading, which is supposed to just be processed by the program, to overwrite portions of that program's memory such that it takes over the program.  That data is then itself a program, and can take over your computer.  Unfortunately, this type of problem is very common.

The first thing you need to do to protect against these threats is to regularly install security updates for your computer.  On Windows you can do this by using Automatic Updates, on MacOS X it will be done for you by Software Update, and on Ubuntu, Update Manager.

When updates are available, make sure to install them as soon as you can!  By the time an update is available, the problem that the update is intended to fix has often been made public already.  The publication of the problem allows the update to be created in the first place, but it also allows malicious individuals to create attacks from it.  The longer you wait, the longer you are vulnerable to problems which have been made public, and thus can be exploited by the largest population of attackers.

However, even if all of your software is fully up-to-date, it still isn't perfect.  The general strategy for dealing with this type of problem, then, is to make sure that only data from sources you trust will ever be allowed into that software.  This limits your exposure to attacks.

In later posts I'll talk about limiting your exposure to malicious data that you have specifically requested, but right now I'm just going to talk about preventing unsolicited data getting to your computer directly over the internet.  The best way to do this is to get a commodity hardware router, and put it between your computer and the internet.  Devices such as this are made by vendors such as linksys, belkin, buffalo or netgear.

You don't need to get a router with fancy "security" features like an "SPI firewall" or "intrusion detection".  In my opinion these features don't add a lot - in fact, they will often cause difficult-to-diagnose problems for home users.  Of course, the people who sell these devices love to put the word "security" on the box as many times as possible, but you really only need the most basic security feature, and that's the one that isn't really a "security" feature at all.

The basic feature that a router adds is a separate layer of protection, independent from anything you can do to your computer itself.  If your home computer is hooked up directly to the internet, it looks like this:



That is, whenever your computer tries to contact another computer on the internet, it sends a request directly via your modem.  Whenever another computer tries to connect to you, it goes directly to your computer.  This means that if there are programs that you don't know about, which your operating system vendor, or some application has left running on your computer, anyone on the internet will be able to access them.

If those programs were all perfectly secure, that would be fine.  Unfortunately, programmers make mistakes, and mistakes lead to bugs, and bugs sometimes lead to security problems.

When you have a router, the picture looks more like this:



which is to say, when your computer submits a request to another computer on the internet, the router sees that the request is coming from inside the network, and transparently forwards it to the outside, establishing a channel of communication.  However, when another computer tries to talk to the IP address that your ISP gives you, the device they find is the router.  The router itself is a very simple device, and, unless you've done something unusual to it, will never be running any programs beyond the ones necessary to move traffic between you and your network.  Because one of the functions of a router is to allow multiple computers on your home network, when connections come in from the internet, the router doesn't know which computer it should go to, even if you only have one.  So the incoming connection will be refused, never having a chance to get to your computer.

This is preferable to running "firewall" software on your computer, for two reasons:
  1. Firewall software is still running on your computer, and thus on your operating system.  If your operating system itself has a flaw in it, the firewall can't protect you.
  2. Software which listens for incoming connections is doing so for a reason.  Different components of the same program will sometimes communicate with each other over a network connection internal to the same computer - as a user of those programs, you really shouldn't need to know this.  Firewall software will present you with prompts to allow or deny permission for programs: these prompts often boil down to "do you want this to work?"  If you say yes, your computer will be exposed to a potential threat, if you say no, the program will break.
Of course, if you've prevented other people's computers from accessing yours, there are some programs which will now be unable to connect to your computer.  BitTorrent, for example, is notorious for performing poorly if other users can't connect to you directly.  Certain voice-over-IP programs will also have problems.  To address this, you can add rules to your router to allow specific incoming connections, without opening the floodgates to everything.  This is referred to as "port forwarding", and portforward.com is a good resource.  If installing a router causes any problems with network applications that you use, consult their documentation: port-forwarding issues are usually prominently covered early on.

My Threat Model

As a "computer guy", I am sometimes called upon by friends and family to opine on what makes a computer or a network secure.  Many of my colleagues are in the same situation.  As a "networking guy", I get similar questions from even from experienced "computer guys".

Users have very peculiar ideas about security.  Users — and I include myself in this grouping — will become confused even in areas of the computing experience where billions of dollars have been spent trying to make the experience as easy and comprehensible as possible.  So it stands to reason that users will often be confused in the area of security, by its nature the least usable and comprehensible area of computing.  Attacks are arcane, and, by definition, unexpected ways that software can be manipulated.  Yet, these attacks are very relevant to users, who want to understand what, exactly, they are vulnerable to and how to defend against it.

It's basically impossible to try to understand computer security this way, let alone explain it.

The important thing to remember in any security situation is this: what do you have of value, and what is the threat to it?  Computer security professionals call the answer to this question the "threat model".  Stephen Colbert calls it the ThreatDown.  No matter what you call it, it's important to enumerate the threats that you're defending against.  Any security measure that you take which is not designed to protect you from a threat which you can, at the very least, imagine and describe, is just extra cost.

In my case, people ask me about three broad classes of user:
  1. users who have networked computers in a home, and use them for checking email, browsing the web, online shopping, and games,
  2. users who have networked desktop computers in a business, and use them for email, web, and business applications, and
  3. users who have networked server computers that are running server applications.
These users all have roughly similar threat models, so I'm going to lump them together for the sake of simplicity, with a nod to a few specific situations.

I believe there are five major types of attacks which threaten average users on the internet today.
  1. Automated attacks that attempt to connect to your computer and exploit a flaw in its operating system or in software that is running a server, and install malicious software on your computer.
  2. E-mail attacks, which attempt to deliver a message which will exploit a flaw in your desktop e-mail client to install malicious software on your computer.
  3. Browser attacks, which attempt to get your browser (either with or without your consent) to visit a site which will exploit a flaw in your browser software to install malicious software on your computer.
  4. Phishing attacks, which attempt to convince you to disclose information about yourself, such as bank account numbers, passwords, or personal details that can be used to access those other things.
  5. Snooping attacks, which attempt to read information in transit between you and another computer.  Usually snooping attacks read passwords in an attempt to allow the attacker to impersonate you later.
Attacks 1-3 are all based on the same premise: software is flawed, and sometimes the flaws in it can be exploited to get it to do things that it should not do.  There are multiple resources under threat here: your computer itself (i.e. its processing power), your network connection, and the data stored on your computer.

Attacks 4 and 5 are in a different class.  They're attempting to get you to reveal information over the network, either with or without your knowledge.  The resource under threat here is the information you are transmitting - in most cases, the information being sought is a token which allows you access to some resource; anything from a username and password to your facebook account (which allows for stealing your personal information or impersonating you) to a debit card number (which allows attackers access to the money in your bank account).

I have fairly simple ways to protect yourself against each of these types of attack.  In a series of follow-up articles, I'll cover each of those strategies.  They should cover a wide variety of attacks with a minimum of effort and cost.  Of course, these defenses aren't perfect.  It's possible that someone who knows much more about security than I do will correct me, but if so, that's so much the better.

More importantly, I will try to provide simple abstractions that allow you to reason about each type of attack without understanding the intricacies of the technology involved.  A major reason I've decided to try to write about this is that security vendors play upon the intuitive (and wrong) understanding that most people have about computer security: equating it with physical security, making their security widget the digital "lock" for the digital "house" of your computer.

I am targeting this series at a fairly nontechnical audience.  I realize that my audience here mostly rates pretty high on the nerd spectrum; my hope is that you will agree with what I say sufficiently that this will be a useful resource for you to refer your less technical friends and family.  To maintain your interest, however, I'll also be embedding some details about the reasoning behind my own security practices.  See you next time!

Update: I accidentally posted a draft of this rather than a final copy; some of the sentences and paragraphs were incomplete.  I hope that I've now corrected this.