Skip to main content

Differences and similarities in web servers

If you talked to a new grad and asked them "how can I achieve concurrency in code?" There are two obvious answers: multi-threading and multi-processing. Factually, both approaches are schemes to map instructions to the CPU/bare metal. Although from the hardware's standpoint, they are very different, from an engineer's standpoint, many use cases tend to be fairly similar. This matters for web servers.

In fact, for a long time, those two methods (multi-processing and multi-threading) were the primary ways to map HTTP requests to a web server's hardware. That is, if a request comes in, a request will map to a process or map to a thread. In Apache HTTPD-speak, you can control which scheme you use. That mapping component is called the "Multi-Processing Module" or MPM. This is all fine and dandy if the requests don't do anything requiring blocking or asynchronous behavior and all requests are short lived. However, nearly all modern web applications require data stores which are far from synchronous.

Traditionally, at least in Linux (~96% of web servers[1]), this is a problem. In the old world, web servers have depended on two system calls: select() and poll(). In both cases, your listening process/dispatcher continuously polls a number of file descriptors which represent socket IO. Sockets get created for each connection and are alive until closed (so they can be blocked). Webservers, internally, have an internal loop that calls select()/poll(). So essentially, for each new connection, you add to your O(n) operation that happens each loop whether they're ready to be processed or not. As you can imagine, when you've got a ton of file descriptors for which we're waiting for things, a lot of wasted time is spent just polling. This all changed with the introduction of the system call epoll() on Oct 19, 2002 (Linux kernel version 2.5.44 [2]). epoll() essentially turns the O(n) polling into an O(1) polling as it only returns the file descriptors that you need to operate on. This is the latest and greatest system call (on Linux) that modern web servers use to monitor socket IO. There's a lot of misinformation on the Internet where a lot of engineers say "Node.js/Nginx is different because of their use of epoll()". This just isn't true. Apache (with mpm_event), Node, Nginx, Lighthttpd can all use epoll() on Linux.

I'd like to predicate this section with the fact that, the following paragraph wasn't always true and Apache has, at one point, been significantly slower than other webservers. Today, however, Apache with mod_event (the most modern configuration of Apache) uses an event driven model to dispatch to a thread or process pool. That is, Apache will accept a request from the kernel using epoll(), push it onto a queue, and have a pool of threads process that queue. Nginx, does the exact same thing. That is, if we're talking about web applications which are based on externally executed code, Nginx and Apache will perform very similarly. The reason is, while the underlying architecture may be different, the blocking component is the code and that code waits for things like databases. So if we're using PHP, Python, etc. the difference between Apache and Nginx is minimal when both webservers are tuned correctly. The story is different if using mpm_prefork or mpm_worker where Apache will most likely underperform Nginx and other HTTP servers which have been created after 2002.

Where they perform differently and where Nginx's architecture shines is when it comes to static files. Even with Apache using mod_event, Nginx will, on average out-perform Apache.

You can see (somewhat recent) results comparing the two here:


You might think that given that Nginx outperforms Apache in the case of static files we should always resort to Nginx. That's not entirely true. In cases where we're not close to Apache's threshold, the differences between Nginx and Apache are most likely not important because we'd always be under capacity. In the cases where we are close to Apache's threshold, we're starting to run into the equivalent Nginx server at ~50% capacity. At this point, an engineer should start thinking about distributed delivery systems (ie. CDN or a distributed cluster behind a load balancer) where the questions about which webserver are less important than the ability to horizontally scale content delivery. So really, the choice between Apache and Nginx is arguably moot in most important cases.

Today, there's an emergence of individual webservers that are specific to a programming language and allows the language to talk directly to a socket. Examples of this are Node.js, Tornado, and Netty. I will contend that you will almost always want to put this kind of server behind a frontend web server (either Nginx or Apache). The reason being is, the Internet is a hostile place. Both Apache and Nginx have wonderfully convenient ways of handling DoS, ensuring SSL is used properly, logging facilities, and so on. In the case that you aren't behind a frontend web server, all of this logic must happen in application code as opposed to using mature, battle tested and optimized web server code (typically written in C). Undoubtedly, hardening a web application requires an approach beyond just web server configuration and taking a look at application code. But you do get an awful lot for free by putting a frontend web server in front of whatever server is hosting your application code.

[1] http://www.solvedns.com/statistics/
[2] https://en.wikipedia.org/wiki/Epoll

Comments

Popular posts from this blog

How to get 70 job interviews in 17 days

When I was but a wee lad, I dreamed of being a janitor for NASA. There's just something awesome about playing a part, no matter how small, of something that dares to push the world forward. While I no longer hold my old lofty dreams of working for NASA, I have had to recently look for jobs. For whatever contrived reason, the conventional thinking behind how people should search for jobs falls under two categories: networking and job sites. While those approaches work, a very useful metaphor for how you might view yourself is that you're a business. You sell your product (your time) for a salary (your revenue). You might argue, most wages in the western world are $20,000 USD to $500,000 USD. If you're a business, then it's probably also helpful to think about selling your product (time) like a business. Typically, different price points in business dictate different sales and marketing techniques. Below say, $10 a month, direct sales (sales where you sell to people on …

Millennial guide to living a half life

I'm in a room in a foreign city. The AirBnb stinks because my wife farted. Sometimes she does that when she's hungry. Me too, we fart a lot.
We browse the Internet for food delivery, paralyzed by choices. Do we go out? Do we stay in? Where do we eat? After 30 minutes of deliberation, we just about give up. Unfortunately, we can't eat if all we do is dip our toes.
This isn't just food. This is dating. This is work. This is life. This is our generation's ethos: let's keep our options open, maybe one day we'll get to live a full life experience, but for now we'll stay in the shallow end of the pool.
What would a life look like if all we did was dabble and never got to the good stuff?

How to set up a data warehouse in a day without any code

For a short period of time, I was the Director of User Acquisition for a company called US Mobile. In my eyes, user acquisition requires a huge amount of cross functional visibility in order to get the answers to questions like "how do I spend our budget?" or "where is the growth bottleneck?" At the time, we didn't really have easy ways to get answers to those questions, so we had to solve the problem of getting easy access to data. 
At every company that I've ever worked for (ie. Amazon), the gap between what the engineers can get access to and what the business team can get their hands on is large. There are, of course, vertical specific solutions that tailor themselves to specific demographics. For example, Google Analytics, Mixpanel, and Kissmetrics give a view into website analytics which is great for PMs and marketers. Intercom, Olark, et al. typically have some analytics for customer success. Some of the data is going to be internally in the database.…