Hassy Veldstra (@hhv1) was talking about Anatomy of a million user realtime Node.js application at Scotland.js, these are my notes from his talk.
App is heyso.im - it's a chat service. "Like chat roulette with music and without the dicks." Chances are that someone else in the world is listening to the same song as you - runs in Spotify.
Spotify has very few apps so new apps receive a lot of attention. How much? [Ted Dziuba "I'm going to scale my foot up your ass"] Picked the target value of 1 million and used Node as it's "web scale".
Build something simple then load test it until it breaks, fix and repeat until it gets to 1 million users. First draft was about 30k users. Realistic load testing is hard, difficult to simulate 1 million real users.
What do we mean by 1 million? Peak or sustained? Chose peak so this works out to 5,555 users per second for 6 minutes. Load testing cluster was 34 machines split between EC2 and Digital Ocean simulating realistic user sessions.
Used web sockets (sockjs) to communicate with the browser. Sockets should run over port 443 to get round proxies and firewalls, even if it's not SSL.
Why not use Jabber? It's way to complex, has too many features I didn't need.
Number of app servers that talk to a matchmaker server which sits on top of a Redis pubsub server. 15 app servers handled 1 million clients ~70k clients per app server. 500k chats at any one time => 2^19 << 2^32 (Redis limit). 8GB RAM 4 core app servers. 16GB RAM 8 core Redis server. Load balancing done in the client with DNS in dev, production will use HAPROXY. Cost about £920.
Need to tweak the server settings to cope with this many connections - ports and open files. Can tune TCP stack and custom compile items.
Used TSUNG for load testing. Distributed, protocol independent, realistic load. Limited to ~64k connections on a single IP. Define load testing job in XML: send requests based on files, code or previous requests - pause, loop, branch, phases, arrival rates.
Tested for: length for session, number of searches, messages per chat, message length, good/bad clients (disconnections), distribution of tracks (90% searching for 10% of tracks).
Built in monitoring, reporting, open source, mature (2001). Can be run as a distributed cluster - controller takes care of distributing the load between test runners to match load profile defined.
Uptime Genius - let me break your site before Reddit does.