Currently, I spend my days working at Wooga. In the project I am part of, we are dealing with millions of daily users hitting a rails based application in addition to a super-optimized MySQL instance and a master-slave redis setup. This last item is what this post is about.
‘One shall reconnect after forking’
This becomes very obvious once you read about it, but it wasn’t like that for us (@diegoeche and myself). Until the issue decided to block a recent release of a let’s say “long-awaited feature” of our project.
In an initializer we define a couple of global variables to access the different redis DBs. For example we can access a hash like:
$redis.hget 'this_hash' 'field'
In this particular case, one would expect to have a
nil returned by the Hash get operation, however, after some debugging we encountered: the terror, the panic, the horror of … A LIST!
Since unicorn is a process based app server, unless you reconnect after forking, you will find that the unicorn master and all of its workers (forks of the master) are sharing the same exact file descriptor (socket) to communicate with redis. A fork is basically a clone of the previous process. Therefore, under situations of high-traffic this concurrency issue occurs way too often.
[redis-rb] recommends reconnection after forking should be done by calling:
after_fork do Redis.current.client.quit end
Redis.current is more suitable when using only one connection and always referencing it. Otherwise, it will even create an instance variable, connect back to redis and disconnect from it afterwards. In our case, we forced every variable to reconnect by calling disconnect on each of them, something like:
after_fork do redis_reconnect_after_initialize! end def redis_reconnect_after_initialize! $redis.client.quit! $redis_connection2.client.quit! ... end
As a side note here: In the beginning we relied on the ruby GC by reassigning the global redis variables. But the GC took way too long to clean up the unassigned variables, and we ended up having a lot more file descriptors (FD) pointing to redis than we originally wanted. To count the number of open connections (FDs) we used the not so loved ObjectSpace module. Each of these connections had to be intentionally disconnected and re-connected.
After the first release of the patch, and restarting every app server, we had more issues than ever before. Every worker in our physical app servers, was trying to connect to redis, and an increasing amount of timeouts started to happen.
After a good stackoverlow search, and a 15 min long, spanish backend discussion, we came up with our first hypothesis.
The number of max open files for the redis-server process was reached.
It took us a bit to realize about it, until:
$ cat /var/proc/`pidof redis-server`/fd 1024 1024 --- $ ulimit -n 1024
Bingo! So there it was, we had increased the number of redis clients by a factor of X, and therefore, no more connections could succeed when trying to connect since all the open sockets were already opened.
We could see that even trying a
How to increase the number of files opened for a daemon process?
There are many ways to set the number of open files for a process, we could have used the limits.conf or ../../ to set the limits in a user basis, and set it to our ‘redis’ user. However, for us the answer was
The prlimit legendary awesomeness lies on the fact one can change a process limits ON THE FLY as long as the kernel you use supports it. – @diegoeche –
We emerged ‘linux-util’ on our Gentoo based redis server and finally we were able to change the –nofile to something more 21st century instead of 1024.
- Concurrency issues and high loads are best friends.
- When using unicorn or any other app server using forking to have multiple process, be careful, forks are process clones
- Servers used as databases or willing to handle many incoming connections should be tuned accordingly: e.g
prlimit --nofile 10000
- Don’t rely on the GC for cleaning up your mess. It might work in Java, not so much in Ruby.