[reSIProcate] Notes from discussion around proxies
I think all the stuff below is out of scope of what we want to accomplish in
the first version but it relevant to long term design. This is the summary
of the notes I have from discussion with several people (Jason, Dan P,
Rohan, Adam, Robert and others) about how to do it. This is just to save
some of the ideas - it's a very rough cut.
To define some terminology:
The domain example.com has many users with AOR like fluffy@xxxxxxxxxxx
They operate servers at multiple sites, devices at the same site are assumed
to be in the same data center with fast communications between but no
geographic redundancy while different sites are assumed to have somewhat
more limited communications between them with a higher latency.
The users are organized into partitions so that a user only exists in one
partition and a registrar only deals with some set of partitions.
The registrar servers are organized into groups where all the servers in a
single group share the registration data for some set of partitions.
The proxy function may be separated from the registrar/redirect server
function.
Basic deployment:
There is a single proxy that does everything.
Simple deployment:
There are two to ten proxies that do everything, share registrations, and
DNS for example.com points at them.
Advanced deployment:
There are many proxies that do everything but the users are arranged into
several partitions. Each partition is handled by one group which has three
servers in it.
Complex deployment:
The users are ranged into several partitions. Each partition is handled by
two groups of servers. The first group is at a primary site while the second
is at a secondary site. The servers at the secondary site will also be the
primary group for some other partition. There are edge proxies that are only
transaction state-full and do authentication and forward to the correct
group. UA form two connection to their primary group and two to their backup
group.
Design issues:
How to put users in partitions: Proposal 1) Double hash function. You hash
once and go to that group. If user no there, then hash again with second
function and go there. By changing the modulus off hash function, one can
rebalance the users when the number of partitions is changed.
Proposal 2) provisioned static lookup for every user. Edge proxies know
about new and old group for each user. Try new and if fail, try old.
Design Issue:
Replication and storage of registration data in registrar/redirect server.
Proposal 1) each server in the group has a TCP connection to other servers.
When they get a new registration, they write the key binary information
across to each of the other servers. On conflicting data, most recent data
wins. All data has a time-stamp based on time of server that got the data
and servers are loosely time synchronized. When a server restarts, before it
becomes active, it first downloads the complete registration DB of all the
other servers it connects too. If A sends data to B, and B decided it's
data is more recent, it should promptly send it's data back to A.
Proposal 2) like 1 but do some sort of delta change storage on both sides so
when something starts up, it can decide to get full snapshot or get deltas.
Design Issues: Storage of registration data
Proposal: Store in memory - use some sort of tree to index.
Design Issues: Storage of user AOR & password
Would like to eventually be able to integrate with AAA systems. Assume in a
DB and default DB is embedded memory DB of unless an connection has been
configured to external DB system. Assume all users loaded on startup. DB
could have triggers for updates or could assume a cache time out and reload
users that were accessed after that.
Design Issues: Routing logic
can route on domainIsMe, method, event type
can route to correct group based on user partition
can route based on longest match of left most characters and these types of
routes can include a rewrite of the RURI
data structures work for 200,000 routes
targets or routes can have multiple entries with priorities and weights for
statistical distribution
routes can have time of day qualifiers
Registrar/Redirect server can be configured to redirect or proxy.
Proxies can be configured so that when they receive a 3xx they either
recursively process it or they forward it upstream.
Design Issues: (big one) Edge or center based presence
In an edge based proposal, the center could send neutral state if there were
no registered UA but would rely on edge for bulk of presence work.
Some other assumptions on the scale of a large system
100 M users, each with two connections
2000 proxies
6 hour re-registration
5 BHCA
3 Presence updates per hour
20 buddies per user
0.5 registration per hour per user due to offline
want hardware cost to be less than 1 cent per user
designed for hardware estimated to exist in 2008
In avalanche restart, register with primary in 10 minute and secondary in 6
hours.
Some rough budgets for various parts of system
Memory
2.4G Map of AOR to group (assuming not hash based)
0.1G AOR -> contact (assume 1 k registration)
1.0G DTLS - assume 10k/connection
0.5G Presence Sub - assume .5k/subscription, non edge based
0.1G Presence Data - 2k/user
CPU
100 Reg/Sec - avalanche restart
8 Reg/Sec - normal registration
100 call/sec
20 reg/sec come online
1500 notify/sec notifications
Bandwidth
the notifies are primary thing at 24 Mbps
Rule of thumb.
Like single server to do 1000 transaction per second
Like single server to deal with 100k users with no presence
Center based presence probably generates 10-100 times the load of just basic
calling.