Differences and similarities in web servers

If you talked to a new grad and asked them "how can I achieve concurrency in code?" There are two obvious answers: multi-threading and multi-processing. Factually, both approaches are schemes to map instructions to the CPU/bare metal. Although from the hardware's standpoint, they are very different, from an engineer's standpoint, many use cases tend to be fairly similar. This matters for web servers.

In fact, for a long time, those two methods (multi-processing and multi-threading) were the primary ways to map HTTP requests to a web server's hardware. That is, if a request comes in, a request will map to a process or map to a thread. In Apache HTTPD-speak, you can control which scheme you use. That mapping component is called the "Multi-Processing Module" or MPM. This is all fine and dandy if the requests don't do anything requiring blocking or asynchronous behavior and all requests are short lived. However, nearly all modern web applications require data stores which are far from synchronous.

Traditionally, at least in Linux (~96% of web servers[1]), this is a problem. In the old world, web servers have depended on two system calls: select() and poll(). In both cases, your listening process/dispatcher continuously polls a number of file descriptors which represent socket IO. Sockets get created for each connection and are alive until closed (so they can be blocked). Webservers, internally, have an internal loop that calls select()/poll(). So essentially, for each new connection, you add to your O(n) operation that happens each loop whether they're ready to be processed or not. As you can imagine, when you've got a ton of file descriptors for which we're waiting for things, a lot of wasted time is spent just polling. This all changed with the introduction of the system call epoll() on Oct 19, 2002 (Linux kernel version 2.5.44 [2]). epoll() essentially turns the O(n) polling into an O(1) polling as it only returns the file descriptors that you need to operate on. This is the latest and greatest system call (on Linux) that modern web servers use to monitor socket IO. There's a lot of misinformation on the Internet where a lot of engineers say "Node.js/Nginx is different because of their use of epoll()". This just isn't true. Apache (with mpm_event), Node, Nginx, Lighthttpd can all use epoll() on Linux.

I'd like to predicate this section with the fact that, the following paragraph wasn't always true and Apache has, at one point, been significantly slower than other webservers. Today, however, Apache with mod_event (the most modern configuration of Apache) uses an event driven model to dispatch to a thread or process pool. That is, Apache will accept a request from the kernel using epoll(), push it onto a queue, and have a pool of threads process that queue. Nginx, does the exact same thing. That is, if we're talking about web applications which are based on externally executed code, Nginx and Apache will perform very similarly. The reason is, while the underlying architecture may be different, the blocking component is the code and that code waits for things like databases. So if we're using PHP, Python, etc. the difference between Apache and Nginx is minimal when both webservers are tuned correctly. The story is different if using mpm_prefork or mpm_worker where Apache will most likely underperform Nginx and other HTTP servers which have been created after 2002.

Where they perform differently and where Nginx's architecture shines is when it comes to static files. Even with Apache using mod_event, Nginx will, on average out-perform Apache.

You can see (somewhat recent) results comparing the two here:


You might think that given that Nginx outperforms Apache in the case of static files we should always resort to Nginx. That's not entirely true. In cases where we're not close to Apache's threshold, the differences between Nginx and Apache are most likely not important because we'd always be under capacity. In the cases where we are close to Apache's threshold, we're starting to run into the equivalent Nginx server at ~50% capacity. At this point, an engineer should start thinking about distributed delivery systems (ie. CDN or a distributed cluster behind a load balancer) where the questions about which webserver are less important than the ability to horizontally scale content delivery. So really, the choice between Apache and Nginx is arguably moot in most important cases.

Today, there's an emergence of individual webservers that are specific to a programming language and allows the language to talk directly to a socket. Examples of this are Node.js, Tornado, and Netty. I will contend that you will almost always want to put this kind of server behind a frontend web server (either Nginx or Apache). The reason being is, the Internet is a hostile place. Both Apache and Nginx have wonderfully convenient ways of handling DoS, ensuring SSL is used properly, logging facilities, and so on. In the case that you aren't behind a frontend web server, all of this logic must happen in application code as opposed to using mature, battle tested and optimized web server code (typically written in C). Undoubtedly, hardening a web application requires an approach beyond just web server configuration and taking a look at application code. But you do get an awful lot for free by putting a frontend web server in front of whatever server is hosting your application code.

[1] http://www.solvedns.com/statistics/
[2] https://en.wikipedia.org/wiki/Epoll

Comments

Popular Posts