I was running into the issue where our CPU on our nginx webapp servers was not being fully utilized, and caused timeouts whenever CPU went above about 10% and memory was hardly being used. I had tried changing the configurations for nginx in the past with no success, so things were getting out of hand. When our traffic spiked yesterday morning due to Google Cloud Developers Conference, Where they are using Tint, we went down, and I had to increase our server count to 20 8GB 8vCPU servers.
Twenty servers to handle 20RPS just seemed ridiculous to me since nginx can handle thousands of RPS on a tuned machine. So I spent a couple of hours yesterday formulating a process to guess and check the effects of the server configurations in order to find out what was causing the issue.
- Isolate a single production server by removing it from all load balancers.
- Set up a Blitz.io account and validate the server in step 1 using the various methods outlined within blitz.
- Load test the server to see its performance.
- Shell into the server and change the server configuration, I was experimenting with /etc/nginx/nginx.conf and /etc/php5/fpm/pool.d/www.conf (don’t forget to restart the server)
- Load test the server while running ‘top’ and see if the performance changed.
Those 5 steps allowed me to finally figure out a combination of settings that allowed nginx and PHP to better utilize the CPU.
Server Configuration Changes
I changed pm.max_children = 5 to pm.max_children = 375
See the links below for more details on what these settings mean.
- All of our traffic (~1600 concurrent users on Google Analytics realtime overview) can be handled by a single server with these new configurations. CPU of the single server handling all of our traffic was ~40%.
- 6 of these servers behind a load balancer, an average of 53RPS could be handled while keeping response time less than 1s. Usually our RPS is around 5-15.