Lunch and learn – Topics in Scaling Web Applications – From tens to hundreds concurrency

Topics in Scaling Web Applications

From tens to hundreds concurrency

About the Author

Victor Piousbox acts in a capacity of a senior-level full-stack software engineer. He leverages his overall development experience of 8 years to recommend and implement non-trivial technical solutions. He likes to find and address performance bottlenecks in applications. He works hard on being able to recommend the best tools and the right approach to a challenge.

Intro

In this episode we’ll talk about the particular challenges we faced in february 2017 at Operaevent, when we were addressing resilience, performance, and scalability of our infrastructure.

Situation

We have a hybrid stack that makes heavy use of Ruby and Javascript. We are in N-tier architecture, with ReSTful and socket communication between the back- and frontends. The storage is mongodb for persistent storage, redis for in-memory storage, s3 for file storage, and caching is on-disk.

The frontends are: the chatbot interface (implementing IRC), a jQuery-heavy web UI, a React.js chrome extension, and the jQuery-heavy OBS layer.

The middle tier is: the ruby API, node.js socket emitter, a number of ruby services, a number of background and periodic workers

Problem

After reaching some usage threshold, our services, particularly the chatbot interface, started crashing a lot. The worst of it was when it would consistently go off-line at night, off business hours, when nobody is in the office to fix or at least restart it. This would happen every night for several weeks, at the most inconvenient of times: at 4am or around midnight. It was critical for us to start guaranteeing much better uptime, in order for our service to be usable.

Action

We spent a non-trivial amount of effort troubleshooting the issues. We would find a bottleneck in performance, and address it. This allowed us to seek the next bottleneck, after addressing which, we would be well-positioned to seek the next one. With this iterative approach, we implemented tens of changes, the end result of which process was gaining resilience of our application. When we were done with the process, our application became quite resilient and not falling over at all. We don’t have exact metrics on stability, but it was well within the requirements to consider our services stable.

The first step we did was take a look at the logs. Apache logs, application logs (each service has a log), error and access logs.

Additionally, we implemented services that collected metrics that we were interested in. So we collected custom logs on the performance of our boxes.

We installed a number of monitoring agents. mongo monitoring agent was introduced.

There was a particular error message in the logs that preceded downtime. We built a simple stress test that could actually reproduce the exact error, on a small scale. The error was “unable to get a db connection within {timeout} seconds.” Once the error was consistently reproducible, it was much easier to find the exact numbers and exact configuration parameters that was causing it. We increased the number of db connections in the pool of the application, as well as adjusted the timeout interval to a sensible value, to address this bottleneck.

Next was the error having to do with file descriptors: the kernel would complain that there are too many file descriptors open, and we would experiece downtime then.

The basic change to fix this was increasing the number of file descriptors that a process can hold on to at any time. This is per-user, as well as systemwide.

It took us a while to discover that upstart, the service manager of ubuntu 14, does not honor ulimit settings. This is so because, ulimit settings are per-session, and services aren’t run in sessions. upstart has its own mechanism for defining those limits. Furthermore, the number of filedescriptors can be set on system level, which is what we did at the end.

In addition to increasing the limits on open file descriptiors, we separated the services into individual users. At this time, each service is being run by its own user, as opposed to one user running all the services. This allows us better scaling and better separation of services.

The next step was a manual code review. We looked at what the code is doing, to see if any areas of it looked problematic. There were several safeguards, several checks that were computationally expensive. We refactored them in such a way that the check is either fast, or doesn’t happen as often, or happens at a later time, or happens in the background.

We looked at the database queries to see which take the most time. Unsurprizingly, there were some optimizations to be made there. We denormalized some data to reduce the number of queries executed for each chat message. Overall, we probably halfed the number of queries per chat message.

We implemented a watchdog on the service: if the service does not respond within a set time (60 seconds), we get a notification. We could make it so that the watchdog automatically restarts the service, but instead we opted to receive notification only, and restart manually as necessary.

We refactored the application to cleanly separate message sending and receiving from message processing. With the conversion to background workers for every chatbot command, we are better positioned to scale. We can increase the hardware resources we allocate to message processing, and not have duplicate messages. We can also failover message sending/receiving, without affecting message processing. This gives us the ability to scale each individual component as needed. Apart from configuration parameter tweaking, this was the cingle most important change that was introduced.

For sending and receiving messages, we converted from using a database to using an in-memory queue. From mongodb we went to redis. Additionally, we went from polling to callback architecture on that piece. Now instead of polling the database every second or two, we register a callback with redis that gets triggered on queue push. While I believe this did not directly affect resilience and stability, it did cause a noticeable performance improvement.

Findings and Changes

We implemented about a dozen changes, with the cumulative result in that our infrastructure became stable.

Tools we used
  • log analysis, better log collection
  • more monitoring, custom monitoring and log collection
  • custom stress tests
  • code review and optimizations, db queries review and optimizations
  • more caching
  • moving storage in memory (redis)
  • converting timed polling to event callbacks
  • introducing a watchdog, better use of background workers
  • denormalization of data in the db
  • security settings tweaking, application configuration tweaking.

Planning Ahead

We can still separate the services further. At this time, a single virtual box can be runnig several services: it can be an API app at the same time as it is a websocket app. However, we anticipate that all the services will be separated out into individual boxes. Furthermore, we can cluster each service, and have several machines powering a cingle service. The architectural decisions we have made so far in this stack would accommodate that.

We can add utility boxes which do heavier data processing operations. One of the computationally-expensive things we do is report generation. It happens on production boxes right now. We can offload that work to utility boxes, and this way production boxes will not see a usage spike.

Leave a Reply

Your email address will not be published. Required fields are marked *