The Web, Untangled

In my previous post, I outlined some reasons that web development is worse than other kinds of development (specifically: traditional client-server development).  I left off there saying that I had some prescriptions for the web's ailments, though, and now I'll describe those.  Given that we're stuck with web development for the forseeable future, how can we make it a tolerable experience?

First, let me tell you what the answer isn't.  It isn't a continuation of the traditional "web framework" strategy.  These have been important tools in dredging the conceptual mire of the web for useful patterns, and at this point in history they have a long life ahead of them.  I'm not predicting the death of Django or Rails any time soon.  Django and Rails are the stucco of the web.  An important architectural innovation, to be sure: they let you cover over the materials underneath, allowing you to build structures that are appealing without fundamentally changing the necessarily ugly underpinnings.  But you can't build a skyscraper out of stucco.

As Jacob covered in great detail in his talk, innovations in the "framework" space generally involve building more and more abstractions, creating more and more new concepts to simplify the underlying concepts.  Eventually you run out of brain-space for new concepts, though, and you have to start over.

I started here by saying that we're stuck with the web.  If we can understand why we're stuck with the web, we can make it a pleasant place to be.  Of course everybody has their own ideas about what makes the web great, but it's important to remember that none of that is what makes the web necessary.

What makes the web necessary is very simple: a web browser is a turing complete execution environment, and everyone has one.  It's also got a feature-complete (if highly idiosyncratic) widget set, so you can display text, images, buttons, scrollbars, and menus, and compose your own widgets (sort of).  Most importantly, it executes code without prompting the user, which means the barrier to adoption of new applications is at zero.  Not to mention that, thanks to the huge ecosystem of existing applications, the user is probably already running a web browser.

I feel it's important to emphasize this point.  When developing an application, delivery is king.  It doesn't matter how great your application is if no users ever run it, and given how incredibly cheap in terms of user effort it is to run an application in a web browser, your application has to be really, really awesome to get them to do more work than clicking on a link.  I can't find the article, but I believe Three Rings once did an interview where they explained that some huge percentage of users (if I remember correctly, something like 90%) will leave immediately if you make them click on a "download" link to play the game, but they'll stick around if you can manage to keep it in the browser without making them download a plugin.

Improvements to ECMAScript and HTML sound fun, but if, tomorrow morning, somebody figured out how to securely execute x86 machine code on web browsers, and distribute that capability to every browser on the internet, developers would start using that almost immediately.  HTML-based applications would slowly die out, as their UIs would be comparatively slow, clunky, and limited.

Tools like the Google Web Toolkit (and Pyjamas, its Python clone), recognized this fact early on.  They treat the browser as what the browser should be: a dumb run-time.  A deployment target, not a development environment.  Seen in this light, it's possible to create layers for integration and inter-op above the complexity soup of DOM and JavaScript: despite the fact that the browser itself has no "linker" to speak of, and no direct support for library code, with GWT you get Java's library mechanism.

Although it's not particularly well-maintained, PyPy also has a JavaScript back-end, which allows you to run a restricted subset of Python ("RPython") in a web browser; I hope that in the future this will be expanded to give us a more realistic, full-featured Python VM in the browser than Pyjamas' fairly simplistic translation currently does.  In opposition to the "worrying trend" that Jacob noted, with individual applications needing to write new, custom run-times, they leverage an existing language ecosystem rather than inventing something new.

Using tools like these, you can write code in the same language client-side and server-side.  This simplifies testing.  You can at least get basic test coverage in one pass, in one runtime, even if some of that code will actually run in a different runtime later.  It simplifies implementation and maintenance, too.  You can write functions and then decide to run them somewhere else based on deployment, security, or performance concerns without necessarily rewriting them from scratch.

If toolkits like these gained more traction, it would go a long way towards interop, too.  It would be a lot easier to have an FFI between Python-in-the-browser and Java-in-the-browser than to try to wrangle every possible JavaScript hack in the book.  Similarly on the server side: once a few frameworks can standardize on rich client-server communication channels, it will be easier to have a high-level abstraction over those than over the mess of XmlHttpRequest and its various work-alikes.

There's still an important component missing, though.  Web applications almost always have 3 tiers.  I've already discussed what should happen on the first tier, the browser.  And, as GWT, NaCl and Pyjamas indicate, there are folks already hard at work on that.  The middle tier is basically already okay; server-side frameworks allow you to work with "business logic" in a fairly sane manner.  What about the database tier?

The most common complaint about the database tier is security.  Since half the time your middle tier needs to be generating strings of SQL to send to the database, there are a plethora of situations where an accidental side-channel is created, allowing users to directly access the database.

This is a much more tractable problem than the front-end problem.  For one thing, a really well-written framework, one which doesn't encourage you to execute SQL directly, can comprehensively deal with the security issue.  Similarly, a good ORM will allow you complete access to the useful features of your database without forcing you to write code in two different programming languages.

Still, there's a huge amount of wasted effort on the database side of things.  Pretty much every major database system has a sophisticated permission system that nobody really uses.  If you want to write stored procedures, triggers, or constraints in a language like Python, it is at worst impossible and at best completely non-standard and very confusing.  Finally, if you want to test anything... you're not entirely on your own, but it's going to be quite a bit harder than testing your middle-tier code.

One part of the solution to this problem comes, oddly enough, from Microsoft: LINQ, the Langauge Integrated Query component, provides a standard syntax and runtime model for queries executed in multiple different languages.  More than providing a nice wrapper over database queries, it allows you to use the same query over in-memory objects with no "database engine" to speak of.  So you can write and test your LINQ code in such a way that you don't need to talk to a database.  When you hook it up to a database, your application code doesn't even really need to know.

The other part of the solution comes from SQLite.  Right now, managing the deployment of and connection to a database is a hugely complex problem.  You have to install the software, write some config files, initialize your database, grant permissions to your application user, somehow get credentials from the database to the application, connect from the application to the database, and verify that the database's schema is the same as what the application expects.  And that's before you can even do anything!  Once you're up and running, you need to manage upgrades, schedule downtime for updating the database software (independently of upgrading the application software).  Note that the database can't be a complete solution for the application's persistence needs, either, because in order to tell the application where it needs to find the rest of its data, you need, at the very least, a hostname, username, and password for the database server.

All of this makes testing more difficult - with all those manual steps, how can you really know if your production configuration is the same as your test configuration?  It also makes development more difficult: if automatically spinning up a new database instance is hard, then you end up with a slightly-nonstandard manual database setup for each developer.  With SQLite, you can just say "database, please!" from your application code, specifying all the interesting configuration right there.

Finally, SQLite allows you to very easily write stored procedures and triggers in your "native" language.  You also don't need to quite as much, because your application can much more easily completely control access to its database, but if you want to work in the relational model it's fairly simple.  The stored procedures are just in memory, and are called like regular functions, not in an obscure embedded database environment.

In other words, for modern web applications, a database engine is really just a library.  The easier it is to treat it like one, the easier it is to deploy and manage your application.

In the framework of the future, I believe you'll be able to write UI code in Python, model code in Python, and data-manipulation code in Python.  When you need to round a number to two digits, you'll just round it off, and it'll come out right.

Oh <what> a.tangled {web, we} WEAVE FROM

The always entertaining Jacob Kaplan-Moss recently posted a missive, "Snakes on the Web", which, if you haven't already read it, is a highly edifying trip through a variety of Python web technologies and history.  He begins with a simple statement — "Web development sucks." — and goes on to ask a number of interesting questions about that.

What sucks about web development?  How will we fix it?  How has python fixed it, and how will python fix it in the future?  While I can't say I agree with every answer, I found myself nodding quite a bit, and he has something useful to say on just about every point.

I noticed one very important question he leaves out of the mix, though, which seems more fundamental than the others: why does web development suck?  In particular, why do so many people who are familiar with multiple styles of development feel like developing for the web is particularly painful by comparison, while so much of software development moves to the web?  And, why does web development in Python suck, despite the fact that otherwise, Python mostly rocks?

Programming for the web lacks an important component, one that Fred Brooks identified as crucial for all software as early as 1975: conceptual integrity.  Put more simply, it is difficult to make sense of "web" programs.  They're difficult to read, difficult to write and difficult to modify, because none of the pieces fits together in a way which can be understood using a simple conceptual model.

Rather than approach this head on, from the perspective of a working web programmer, let's start earlier than that.  Let's say someone approached you with a simple programming task: write an accounting system that includes point-of-sale software to run a small business.  Now, considering some imagined requirements for such a system, how many languages would you recommend that it be written in?

Most working programmers would usually say "one" without a second thought.  A too-clever-by-half language nerd might instead answer "two, a general-purpose programming language for most things and a domain specific language to describe accounting rules and promotions for the business".  Why this number?  Simply put, there's no reason to use more, and introducing additional languages means mastering additional skills and becoming familiar with additional quirks, all of which add to initial development time and maintenance overhead.  Modern programming languages are powerful enough to perform lots of different types of tasks, and are portable across both different computer architectures and different operating systems, so other concerns rarely intrude.

But, in the practical, working programmer's world, what's the web's answer to this question?  Six.  You have to learn six languages to work on the web:
  1. HTML.  This isn't really a programming language, but in web development you do end up reading and writing quite a lot of it.
  2. CSS.  In order to apply visual styles to your HTML so that it actually looks nice in a browser, you need to understand a different language (with a different conceptual model for how documents are laid out than the HTML itself).
  3. JavaScript.  In today's competitive AJAX-y world, you need to be able to react instantly in the browser, writing a real client application.
  4. SQL, so that you can store your data in a database.
  5. Your "middle-tier" language: in my case and Jacob's, that would be Python.  This is where people tend to spend the bulk of their programming time, but not all of it.
  6. A templating language; in Jacob's case, the Django template language.
If you're unlucky, you might need to learn XML, more than one back-end language, a deployment language (UNIX shell scripting or Windows's "batch" language), and ActionScript.  You'll probably need to learn a smattering of some awful web-server configuration language though, like the not-quite-XML-not-quite-HTML used to configure Apache.

Of course, Jacob lists a pile of related technologies too, and rightly points out that it's a lot to keep in your head.  But he is talking about a problem of needing extensive technical knowledge, something which all programmers working in a particular technology ecosystem learn sooner or later.  I'm talking about a different, more fundamental problem: in addition to the surface problem of being complex and often broken, these technologies are fundamentally conceptually incompatible, which leads to a whole host of other problems.  Furthermore, the only component which is really complete is the "middle-tier" language, although bespoke web-only languages like PHP and Arc manage to screw that up too.

Here are a few simple example problems that are made depressingly complex by the impedence mismatch between two of these components, but which are incredibly easy using a different paradigm.

How do you place two boxes with text in them side-by-side?  Using a GUI toolkit, like my favorite PyGTK, it often goes something like this:
left = Label("some text")
right = Label("some other text")
box = HBox()
box.add(left)
box.add(right)
The conceptual model here is simple: the HBox() is a container, the "left" and "right" things are widgets, which are in that container.  You can add them, remove them, swap them, or handle events on them easily.  You can discover how these things are done by reading the API references for the appropriate classes of object.  However, there's no right answer to this question on the web.  You can use a <table> tag, and then some <tr>s and <td>s to make a single-row table with two cells, but that has a variety of limitations; plus, it's considered somehow gauche by most web designers to use tables for layout these days.  Or, you could cook up a collection of CSS classes.  So there's the first impedence mismatch: do you do layout in HTML, or CSS?  Of course most design gurus would like to tell you that "always and only CSS" is the right answer here, but more practically-minded web developers who actually write code will often prefer HTML, partially because it's simpler but partially because CSS's featureset is incomplete and there are some things you can still only do with HTML, or only do portably with HTML.

Plus, how do you discover how these layouts work?  There are a variety of reference materials, but no canonical guide that says "this is exactly what a <table> tag should do, and how it should look".  There are different forms of documentation for both.

If you have a variable number of elements, you quickly run into another problem.  Should this be the responsibility of the HTML, the CSS, or some code (in the templating layer) that emits some HTML or some CSS?  Should the code in the templating layer be written as an invocation of your middle-tier language, or should the template language itself have some code in it?  Reasonable people of good conscience disagee with each other in every possible way over every one of these details.

This is all part of a very complex problem though.  For all of these crazy hoops you have to jump through, HTML and CSS do provide a layout model that allows you to do some very pretty and very flexible things with layout, especially if you have large amounts of text.  Perhaps not as good as even the most basic pre-press layout engine, but still better than the built-in stuff that most GUI toolkits allow you.  So there is an argument that this complexity is a trade-off, where you get functionality in exchange for the confusion.  So let's look at a much simpler problem.

Let's say that, in our hypothetical accounting application, you have a list of items in a retail transaction, and you want to process the list and produce a sum.  Where is the right place to do that?  It turns out you have to write the code to do that three times.

First, you have to write it in JavaScript.  After all, the numbers are all already in the client / browser, and you want to update the page instantaneously, not wait for some potentially heavily-loaded server to get back to you each time the user presses a keystroke.  And why not?  You've got plenty of processing power available on the client.

Then you have to write it in Python.  That's where the real brain of the application lives, after all, and if you're going to do something like send a job to a receipt printer or email a customer or sales representative some information in response to a sale, the number has to be located in the middle tier.

Finally you have to do it in SQL.  Since this is a traditional web application, your Python code is going to be spread out among multiple servers, and the database is the ultimate arbiter of recorded truth.  So you need to have transactions around the appropriate points and execute any interesting aggregate functions (such as SUM()) in the database tier.

So, you've got three times as much work to do in your fancy new web application as you would in a simple record-based application with a GUI.  A worthy price to pay to run in the brave new world of tomorrow rather than on some crusty old client/server system, right?

Well, as it turns out, the problem is somewhat deeper than that.  It turns out that JavaScript, Python, and SQL actually have slightly different numerical models (in fact Python implements at least 4 itself: fixed-point decimal, floating-point decimal, IEEE 754 floating-point binary, and integer math; you should really only use decimal for money, but this isn't availble in JavaScript and its availability in SQL is spotty).  After applying some discounts, your register might read $19.74 but your receipt will read $19.75; and the reports sent to the accounting department will read $19.74898989898989.

Even if you know a lot about math on computers, the limitations of each of these runtimes, and you happen to get all of that just right, you still have another problem to contend with: what happens when somebody else needs to change the logic in question?  How do you test that the Python, the JavaScript, and the SQL are all still in sync?  It's possible, but you have to go above and beyond the usual discipline of test-driven development, because you need to have integration tests that verify that different, almost unrelated code, in different languages, in different environments is all executing properly in lock-step.  Just getting the code from SQL and JavaScript to run in your Python test suite at all is a major challenge; in a language like PHP it's borderline impossible.

This is all even worse when it comes to security, because every part of the application exposes an attack surface, and because you can't use the same language or the same libraries to do any of the work, they all expose a different attack surface.

In his talk, Jacob notes that "frameworks suck at inter-op", but the problem is much deeper than that.  As I've shown here, a single page from a single application written using a single framework, which has only one task to do, can't even inter-operate with itself cleanly, at least not at the level that Jacob wants — or that I want.  He says, "gateways aren't APIs", and he's right: the correct way to inter-operate is through well-defined APIs.  APIs can be discovered through a single, consistent process.  Their implementations can be debugged using a single set of development tools.

CSS isn't an API.  HTML isn't an API.  Strings containing a hodgepodge of SQL and data aren't an API either.

It's not all doom and gloom, but my ideas for a future solution to this problem will have to wait for another post.

Threat 2: Attacks via E-Mail

Continuing my series on simple threat models for internet users, I'll now address the second threat I mentioned: threats via e-mail.

There are two kinds of e-mail attacks: direct attacks, and trojan horses.  First let's talk about direct attacks.

The basic idea behind a direct e-mail attack is that the program you use to read your e-mail might have flaws in it, which a specially-crafted message will exploit.  That message will have a program in it, and a mistake by the programmers who wrote your e-mail client will cause that program to be executed.

Unlike attacks from the outside, which you can very simply protect against by denying outside attackers access to your computer entirely, there is no fool-proof method to protect against this kind of threat.  E-mail formats are highly complex, and messages can contain multiple parts, including images, etc.  The code that decodes images is notoriously prone to security problems.  Even e-mail programs which don't process images are occasionally prone to security problems dealing with the structure of certain messages.

Chances are that you are going to want to read e-mail somewhere, and you probably want to be able to see images and download attachments; shutting off e-mail completely isn't really an option.  The more general advice I gave against the first threat still applies, though: keep all your software up to date, including your e-mail client.  People who make e-mail software take these kinds of threats very seriously and release updates very quickly when problems are discovered.

One way you can mitigate this risk, and reduce the amount of work required to keep up to date (and therefore the opportunity for you to forget to do so) is to use a web-based e-mail client like GMail.  If you use GMail, the potentially vulnerable program running on your computer is just your web browser, and you already need to keep your browser up to date for other reasons.  The code which deals with the structure of messages is all run on the server, and constantly kept up to date by the fine folks at Google.  Similarly, they take steps to protect your browser; stripping out harmful attachments and filtering spam for you so that potentially dangerous messages never reach you.

The much more common form of e-mail attack is easier to defend against, but is attacking something more potentially vulnerable than your e-mail software: it's attacking you.  A trojan horse is a program which doesn't do anything tricky to get itself run automatically, but instead elicits your cooperation in making it run.  Whether you run a web-based email client or the oldest, buggiest version of Microsoft Outlook, you are equally vulnerable to these kinds of attacks.

The key to defending yourself against a trojan horse is to understand what you are double-clicking on.  Look inside that trojan horse before you open it; there may be a bunch of armed greeks inside.  Before you open any document or run any program that was attached to an e-mail, very carefully read the message that it came from.  Ask yourself a few questions:
  1. Were you expecting this message?
    If you weren't expecting the message, you should double-check to make sure.  In the best case, use some mechanism other than e-mail to check.  Give the sender a phone call.  Ask if they actually sent you the message in question.
  2. Is the message really from who it says it's from?
    It's very easy to fake e-mail addresses, so if you are used to receiving messages from Bob Dobbs and you see "From: Bob Dobbs <bobdobbs@example.com>", you shouldn't necessarily believe it.  Does the text of the message read like Bob wrote it?  Does Bob usually send you these kinds of attachments?  Is the "To" line correct?  Does he use your real name?  A lot of spam which includes viruses is very generic, but it is increasingly cleverly disguised as coming from people in your addressbook.
  3. Is an attachment trying to disguise itself?
    Sometimes, even messages you are expecting, from people that you know, will contain evil attachments.  If Bob's computer is infected with a virus, he may well have actually legitimately written you the message but a trojan horse packed itself along for the ride.  In this case, you need to see if the attachment is trying to look like something different than it is.  Does the file's name have multiple extensions?  For example, "business-plan.doc" is a Word document, but "business-plan.doc.exe" is an executable program, with its name changed to pretend to be a Word document to fool you.
  4. Is anything trying to warn you?
    Most browsers and operating systems these days will double-check with you before opening executables which you've downloaded.  If a box pops up saying "Are you sure you want to do that?", don't just click past it immediately; read it completely and try to understand what it's telling you.  Even if you don't understand a word, pausing for a moment to reflect on whether the warning is serious or not will often help you realize that something might be amiss.
If you're careful and look for details which seem out of place, you don't need to be an expert to spot e-mails that look wrong.  The most basic task here is to recognize genuine human communication, and not to scan for any particular technical trick.  That's not all, of course; as I mentioned, there are ways that programs can hijack legitimate communications, but these are much more sophisticated, and much rarer than the much more common type of message, which is one that simply says "hey buddy, click this" and expects you to click on it without thinking.  If you can recognize those you will be safe 99% of the time.

In using the Internet, this is a generally useful skill, and particularly important when it comes to security.  It will be particularly useful when I discuss threat #4, phishing attacks.


Goodbye, Divmod. Hello, World!

At the end of this month, Divmod will lay off its last employee and cease to be.

As some of you know, I've been on hiatus for several months now.  The idea was originally that I would take a break, allow the company to build up a small operating buffer to deal with our cash-flow issues, and heal a psyche damaged by many months of intense stress (caused largely by those same cash-flow issues).

The psyche-healing worked out okay.  I'm feeling much better than I was when my break started.  The cash-flow issues, not so much.  The reality turned out to be that much of the new consulting business we were counting on just didn't materialize.  We managed to get quite a bit of maintenance done on our infrastructure — I continued to help out intermittently, interleaving some reviews and bugfixes with hobby projects — but it was no longer really clear what business purpose that infrastructure was serving.  We didn't have any product that generated a revenue stream and we certainly didn't have the resources to build a new one.

Users of Divmod email: I'm not exactly sure what the plan is, but JP and I will personally make sure that you can get your email in some form and we'll work out some way to keep at least a forwarding service running.

Users of Divmod open source projects: we will figure out some way to continue to host and maintain the code.  I'm not sure what we're going to do about official stewardship, but it was years before Twisted needed any official legal structure, so I'm sure we'll make due.

The Divmod Fan Club, which deposits money into my personal paypal account rather than a business one (for stupid technical reasons which are now extremely convenient), is generating enough money that we may be able to afford some hosting, assuming those of you who supported Divmod-the-company would like to continue supporting Divmod-the-ambiguously-defined-collection-of-open-source-projects.  Regardless of whether you decide to cancel your subscriptions now (you can do so in the UI for your PayPal account; nothing to do with us, happily), thank you all, very much.  You enabled us to do a lot more with our open-source work than we would otherwise have been able to, and you helped the get through a number of crunches in the past.

The fan club might enable us to host the collection of open source projects, and possibly also host versions of Mantissa and Quotient, and Sine.  I think that having some users would help keep those projects alive in the absence of a corporate sponsor.  I'm not really sure what's going to happen to Blendix, though, and as a proprietary thing it requires more thinking.  If you care deeply about it, please get in touch with me.  Also, if you are a member of the Divmod community who might like to help out with administration, we might need help with mundane things like keeping our Trac instance running.

Now, on to the more personal stuff.

Thanks in advance for your condolances, but I'm feeling okay about this.  Not to say that I don't wish Divmod had ended with more success, but I spoke to Amir and JP yesterday, and we all agreed — it's time to move on.  We tried everything we could think of.  It's time to do something different.

More importantly, I'm not really sure what I'm going to do next.

Right now I'm considering a few things.  I have a couple of job offers, I have a few ideas for new businesses that I might want to start myself.  Some of those ideas are things I would bootstrap myself, some would require funding.

Some of you reading this right now have intimated that you'd like to offer me a job, if I were available.  Some have speculated that you might want to fund some other company that was less ambitious than Divmod.  Well, now's your chance.  Get in touch, and let's talk.

If you can, please do it soon, though.  Some of the offers I'm already considering need a decision soon, but I'd really like an opportunity to consider my options before I jump into the next thing.


Threat 1: Attacks From The Outside

This article continues my series on my personal threat model for the internet.  In this article, I'm going to talk about the threat of automated attacks coming in to your computer over the internet, while it is connected to the internet.

The basic problem underlying this threat is the same as that underlying threats #2 (malicious e-mail messages which attack your e-mail program) and #3 (malicious web pages which attack your web browser): the software you are running on your computer, which you need to do your job, play your games, or otherwise get value out of your computer, is full of bugs.  Some of those bugs are security problems.  The most dangerous type of security problem is one that allows some data which a program is reading, which is supposed to just be processed by the program, to overwrite portions of that program's memory such that it takes over the program.  That data is then itself a program, and can take over your computer.  Unfortunately, this type of problem is very common.

The first thing you need to do to protect against these threats is to regularly install security updates for your computer.  On Windows you can do this by using Automatic Updates, on MacOS X it will be done for you by Software Update, and on Ubuntu, Update Manager.

When updates are available, make sure to install them as soon as you can!  By the time an update is available, the problem that the update is intended to fix has often been made public already.  The publication of the problem allows the update to be created in the first place, but it also allows malicious individuals to create attacks from it.  The longer you wait, the longer you are vulnerable to problems which have been made public, and thus can be exploited by the largest population of attackers.

However, even if all of your software is fully up-to-date, it still isn't perfect.  The general strategy for dealing with this type of problem, then, is to make sure that only data from sources you trust will ever be allowed into that software.  This limits your exposure to attacks.

In later posts I'll talk about limiting your exposure to malicious data that you have specifically requested, but right now I'm just going to talk about preventing unsolicited data getting to your computer directly over the internet.  The best way to do this is to get a commodity hardware router, and put it between your computer and the internet.  Devices such as this are made by vendors such as linksys, belkin, buffalo or netgear.

You don't need to get a router with fancy "security" features like an "SPI firewall" or "intrusion detection".  In my opinion these features don't add a lot - in fact, they will often cause difficult-to-diagnose problems for home users.  Of course, the people who sell these devices love to put the word "security" on the box as many times as possible, but you really only need the most basic security feature, and that's the one that isn't really a "security" feature at all.

The basic feature that a router adds is a separate layer of protection, independent from anything you can do to your computer itself.  If your home computer is hooked up directly to the internet, it looks like this:



That is, whenever your computer tries to contact another computer on the internet, it sends a request directly via your modem.  Whenever another computer tries to connect to you, it goes directly to your computer.  This means that if there are programs that you don't know about, which your operating system vendor, or some application has left running on your computer, anyone on the internet will be able to access them.

If those programs were all perfectly secure, that would be fine.  Unfortunately, programmers make mistakes, and mistakes lead to bugs, and bugs sometimes lead to security problems.

When you have a router, the picture looks more like this:



which is to say, when your computer submits a request to another computer on the internet, the router sees that the request is coming from inside the network, and transparently forwards it to the outside, establishing a channel of communication.  However, when another computer tries to talk to the IP address that your ISP gives you, the device they find is the router.  The router itself is a very simple device, and, unless you've done something unusual to it, will never be running any programs beyond the ones necessary to move traffic between you and your network.  Because one of the functions of a router is to allow multiple computers on your home network, when connections come in from the internet, the router doesn't know which computer it should go to, even if you only have one.  So the incoming connection will be refused, never having a chance to get to your computer.

This is preferable to running "firewall" software on your computer, for two reasons:
  1. Firewall software is still running on your computer, and thus on your operating system.  If your operating system itself has a flaw in it, the firewall can't protect you.
  2. Software which listens for incoming connections is doing so for a reason.  Different components of the same program will sometimes communicate with each other over a network connection internal to the same computer - as a user of those programs, you really shouldn't need to know this.  Firewall software will present you with prompts to allow or deny permission for programs: these prompts often boil down to "do you want this to work?"  If you say yes, your computer will be exposed to a potential threat, if you say no, the program will break.
Of course, if you've prevented other people's computers from accessing yours, there are some programs which will now be unable to connect to your computer.  BitTorrent, for example, is notorious for performing poorly if other users can't connect to you directly.  Certain voice-over-IP programs will also have problems.  To address this, you can add rules to your router to allow specific incoming connections, without opening the floodgates to everything.  This is referred to as "port forwarding", and portforward.com is a good resource.  If installing a router causes any problems with network applications that you use, consult their documentation: port-forwarding issues are usually prominently covered early on.