Even without accounting for the sizeable overhead of spawning an OS process that, on average, twiddles its thumbs for a minute before reporting that no one has sent the user a message, the waiting time could be spent servicing 60-some requests for regular Facebook pages.The result of running out of Apache processes over the entire Facebook web tier is not pretty, nor is the dynamic configuration of the Apache process limits enjoyable.

The project I'm currently working on, Facebook Chat, offered a nice set of software engineering challenges: The most resource-intensive operation performed in a chat system is not sending messages.

It is rather keeping each online user aware of the online-idle-offline states of their friends, so that conversations can begin.

Fault tolerance is a desirable characteristic of any big system: if an error happens, the system should try its best to recover without human intervention before giving up and informing the user.

The results of inevitable programming bugs, hardware failures, et al., should be hidden from the user as much as possible and isolated from the rest of the system.

This isn't by any means a new technique: it's a variation of Comet, specifically XHR long polling, and/or BOSH.

Having a large-number of long-running concurrent requests makes the Apache part of the standard LAMP stack a dubious implementation choice.

Surfacing connected users' idleness greatly enhances the chat user experience but further compounds the problem of keeping presence information up-to-date.

Each Facebook Chat user now needs to be notified whenever one of his/her friends (a) takes an action such as sending a chat message or loads a Facebook page (if tracking idleness via a last-active timestamp) or (b) transitions between idleness states (if representing idleness as a state machine with states like "idle-for-1-minute", "idle-for-2-minutes", "idle-for-5-minutes", "idle-for-10-minutes", etc.).

