Archive for May, 2019

May 6 3 Optimizing a web app for a 400x traffic increase

Running my service/hobby project Webhook.site (GitHub page) has presented me with quite a few challenges regarding optimising the VPS (virtual private server) the site is running on, squeezing as much performance as possible out of it.

Originally, when it launched, Webhook.site was using a completely different datastore and Web server software than it is now.

In this post I'll try to chronicle some of the challenges of keeping up with increasing traffic going from 20 to almost 8000 weekly users – a 400x increase!

March 2016

The first line of code was committed to Git on 21st March 2016, during which I committed the basic Laravel framework files, database migrations, models, etc.

SQLite was chosen as the datastore since it was easy to get started with (I planned on migrating to either MariaDB or Postgres later on, but – spoiler alert – that never happened.)

For the Websocket/push functionality, I chose Pusher, a SaaS that specialises in hosting this kind of service. The free plan was more than enough!

For hosting, I placed it on my personal server, a 1 GB DigitalOcean VPS which hosts a bunch of other stuff.

November 2016 – Pusher running out of connections

I posted the site on Hacker News, and the site got it its first surge (around 4000 unique users) in traffic, which it handled acceptably (considering almost all visitors just loaded the front page and closed it after a couple seconds.) I got a good amount of feedback, and also noticed that my free Pusher membership was running out of the allotted connections I had available. Fortunately, someone at Pusher was reading Hacker News and bumped up my connection count!

Early 2017 – Lawyers & moving out for the first time

Traffic was growing, and I started implementing a max amount of requests on a URL (500), since the SQLite database was getting very large (several gigabytes), and some users often forgot to remove the URL after they were done testing and moved to production! Around here I also noticed that the site was getting slow, and I noticed I had forgotten to add some indices in the database! After I added those, the site was a lot faster.

Around this time, I was also contacted by a lawyer representing a company where a developer had used Webhook.site for testing a credential validation and upload service. They forgot that webhook subscription was still in place when it was transferred to production, which had resulted in some very personal data being uploaded, and they wanted me to make sure it was removed. I removed the URL and related data and never heard from them again.

On the server side, I decided it was time to move the site to its own server, since it started interfering with my primary one. I chose the smallest size on DigitalOcean (512MB RAM), on which I installed Debian. Nginx was chosen as the Web server serving PHP 7.1 via FPM.

At this point, Webhook.site had around 70 daily users.

Late 2017 – Caching issues

I started running into various performance problems regarding Laravel, caching and Redis connectivity.

First, I couldn't figure out why the site was so slow. I had enabled Laravel's rate limiting feature, which by default caches a users' IP address and stores the amount of connection attempts so it can rate limit the connections.

As it turns out, I had forgot to change the default caching mechanism, which was disk cache. So each visit to the site caused Laravel to read and write a file to disk, which took up a bunch of IO.

As a result, I installed Redis and pointed the cache to it, which immediately improved performance.

Early 2018 – New year, new datastore

Traffic was now growing steadily, and the site had around 300-400 daily users at this point.

From the commit logs, I can see that I enabled Redis to use a UNIX socket instead of TCP connectivity. This improved performance quite a bit, since it didn't take up precious TCP connections that could be used to serve the site.

In March 2018, the amount of users had doubled again – around 600 daily active users – and SQLite just wasn't cutting it anymore; the database file was getting very large.

I had to switch to a different datastore. Considering the options – moving data from SQLite to MySQL wasn't as straightforward as I'd hoped – I chose Redis, since I calculated that total amount of data could fit on RAM if I only stored the last few days worth, which would be easy since Redis supports expiring keys.

I re-wrote a large portion of the application (including a migration script) and temporarily upgraded the DigitalOcean VPS to a 16 GB instance so it could store all the data in memory until the bulk of it expired. After a week or so, I downgraded to 1 GB RAM now that everything could fit in memory.

Mid 2018 – Self-hosted Socket.io

The amount of users doubled again, and even the increased amount of connections allowed to me by Pusher now wasn't enough, which resulted in users having to reload the app manually instead of being able to stream new requests in real-time.

I switched over to Laravel Echo Server, a Socket.io server written in Node.js, which has served me well to this day. I proxied it through Nginx and the site kept humming along.

Late 2018 – Accidental DDoS & HAProxy to the rescue

The site started taking up lots of time to keep running mainly due to a few users who accidentally deployed URLs to production, causing the site to be hit by large amounts of requests from many different IP addresses at once (or few IP addresses sending tons of traffic.)

I tried to implement a blocking mechanism in the firewall since even though Laravel's rate limiter worked fine, I wanted something that worked at the firewall level, so connections would be discarded before hitting PHP, which was eating up resources.

I also upgraded the server in December to 4 GB ram, which was just barely able to fit all the data in RAM. It mostly ran at close to 100% CPU.

The firewall blocking mechanism worked for a little while, but turned out to be very buggy – I was running iptables and UFW commands from PHP! – and I disabled the functionality and started thinking of alternative solutions.

At around this time, a user had apparently forgot to remove the URL from a test version of some sort of advertising solution deployed on mobile phones. Tens of thousands of mobile phones were constantly requesting the same endpoint, and it was ruining the site for everybody. Something had to be done. I needed something that could quickly drop incoming connections that matched the URL in question.

Having had experience with HAProxy in the past, I was well aware of how efficient it is compared to Nginx at acting as a proxy – up until now I had also used Nginx to forward Laravel Echo Server traffic. So I decided to try putting HAProxy in front of Nginx, and moving the Echo server proxy to HAProxy, and adding rules to the HAProxy configuration file that immediately dropped requests to the URL that caused trouble. That turned out to be a good decision.

CPU usage fell to under 30% immediately, resulting in the site loading instantly even though it was getting bombarded by requests – success.

April 2019 "DDoS" – nitty-gritty kernel tweaking

Around April 11, another surge of accidental traffic hit the server, apparently also from some sort of advertising software, where a bunch of popular apps were loading the same Webhook.site URL. I assumed this was just forgetfulness; someone forgot to remove a snippet of code before pushing to production.

Still running on a relatively small 4 GB/2-core VPS, I took a look at tweaking various default Linux configuration files, namely /etc/sysctl.conf and /etc/security/limits. At the bottom of this post, I've listed some links to resources that helped me find the right values.

Here's what I ended up with (I'll save a deeper explanation for a later post):

[pastacode lang="c" manual="vm.overcommit_memory%20%3D%201%0Afs.file-max%20%3D%202097152%0Anet.core.somaxconn%20%3D%204096%0Anet.ipv4.ip_local_port_range%C2%A0%3D%C2%A02000%2065000%0Anet.ipv4.tcp_rfc1337%20%3D%201%0Anet.ipv4.tcp_fin_timeout%C2%A0%3D%C2%A015%0Anet.ipv4.tcp_tw_recycle%20%3D%201%0Anet.ipv4.tcp_tw_reuse%20%3D%201%0Anet.netfilter.nf_conntrack_generic_timeout%C2%A0%3D%C2%A060%0Anet.ipv4.netfilter.ip_conntrack_tcp_timeout_established%C2%A0%3D%C2%A054000%0Anet.core.netdev_max_backlog%C2%A0%3D%C2%A04000%0Anet.ipv4.tcp_max_syn_backlog%C2%A0%3D%C2%A02048%0Anet.netfilter.nf_conntrack_tcp_timeout_close_wait%20%3D%2020%0Anet.netfilter.nf_conntrack_tcp_timeout_fin_wait%20%3D%2020%0Anet.netfilter.nf_conntrack_tcp_timeout_time_wait%20%3D%2020%0Anet.netfilter.nf_conntrack_max%20%3D%20524288%0Anet.ipv4.tcp_mem%20%3D%2065536%20131072%20262144%0Anet.ipv4.udp_mem%20%3D%2065536%20131072%20262144" message="/etc/sysctl.conf" highlight="" provider="manual"/]

Then, around 10 days later, the same thing happened, but this time, the site was able to keep serving visitors even though it was getting hammered. I had blocked the URL directly in HAProxy – which basically means it's returning a short error code, saving CPU cycles and bandwidth – and the server was able to keep up and saturate the 100mbit network connection.

Bandwidth graph during the days Webhook.site experienced very large amounts of traffic

After a few days, someone must have realised that they were bombarding Webhook.site with traffic and shut it off. So far, it hasn't happened again, and the site consumes its usual few megabits per second in traffic.

As of writing this article, Webhook.site now runs on a 4-core 8GB VPS and handles thousands of connections per second without breaking a sweat.

Daily and weekly unique users on Webhook.site from Google Analytics

Future plans

Thanks to the very generous supporters of the site on Patreon, I've been able to pour some money at upgrading the server to keep handling more traffic instead of just shutting it down. That, along with the various optimizations, has kept the site online, helping tens of thousands of visitors a month while still being fairly inexpensive to run as a hobby project.

It's also worth mentioning DigitalOcean again – here's my referral link where you can get $100 in credit – during all of this I've never heard a single complaint from them, even when the server was consuming 100mbit/s traffic for days!

With that being said, it has become quite clear that lots of people simply forget that they subscribed something to Webhook.site, and as a result, accidentally spamming the service in such a manner that is basically a DDoS. The longer Webhook.site keeps running, the more the server will keep receiving those old, long-forgotten webhook subscriptions. My plan for this is at some point to switch to using a wildcard CNAME record so that URLs will be in the format https://<UUID>.webhook.site. This will let me create an A record pointing to 127.0.0.1 (redirecting the traffic back to the sender) on a case by case basis, somewhat sidestepping the issue.

Additionally, lots can be done for scalability regarding infrastructure: I've kept everything on a single, smaller server basically as a matter of stubbornness and wanting to see how far I can push a single VPS. It would probably be more efficient to separate the services so that HAProxy, Redis, Nginx and Echo Server aren't competing for resources.

Finally, this has already taught me a lot, and I'm looking forward to see what else I can do to keep Webhook.site humming along if the visitor count keeps increasing.

Appendix: Resources for tweaking sysctl.conf