Carl Waldbieser, an active member of the CAS community, was kind enough to share this analysis.
Lafayette College has an active user base of XXX and regularly records 78 CAS authentication events/minute on average with peaks of 220 events/minute. In preparation of deploying CAS
5.1.x, locust.io was used to put CAS under load and soak and stress tests. Results indicate that CAS
5.1.x deployed with reasonable hardware in a multi-node deployment architecture using nginx+ and hazelcast. Deployment architecture, testing scenarios and results are detailed in the rest of this blogs post.
In preparation for a service upgrade from CAS server version
5.0.x to version
5.1.x, load testing trials were conducted on the CAS stage environment. All trials were carried out against the same deployment architecture, with all nodes configured identically. The deployment architecture and nodes have not changed since the last load test was conducted around April 25, 2017.
The deployment architecture itself consists of 3 virtual machine nodes:
Each node has 3.7 GiB real memory available to it and 2 CPUs. The characteristics of the CPUs are as follows:
- Architecture: x86_64
- CPU op-mode(s): 32-bit, 64-bit
- Byte Order: Little Endian
- CPU(s): 2
- On-line CPU(s) list: 0,1
- Thread(s) per core: 1
- Core(s) per socket: 2
- Socket(s): 1
- NUMA node(s): 1
- Vendor ID: GenuineIntel
- CPU family: 6
- Model: 42
- Model name: Intel Xeon E312xx (Sandy Bridge)
- Stepping: 1
- CPU MHz: 1899.999
- BogoMIPS: 3799.99
- Hypervisor vendor: KVM
- Virtualization type: full
- L1d cache: 32K
- L1i cache: 32K
- L2 cache: 4096K
- NUMA node0 CPU(s): 0,1
The nodes are deployed behind an Nginx+ proxy in an active-active-active configuration. The nodes share ticket information using encrypted hazelcast messages, so any application state is shared.
The Test Swarm
The testing framework used was locust.io, a Python based load testing framework. The test suite deploys a fixed number of “locusts” against a web site. The initial population ramps up with a configurable “hatch rate”. In the tests, locusts were conceptually divided into 3 “lifetime” categories:
- Short-lived locusts live approximately 60 seconds.
- Medium-lived locusts last for approximately 5 minutes.
- Long-lived locusts exist for approximately 2 hours.
The category to which a given locust is assigned is randomly determined with a ratio of short : medium : long being 7:2:1. Ideally, 70% of the population is short-lived, 20% is medium lived, and 10% is long-lived.
The lifetime of a locust determines how long it will retain and make use of a single web SSO session. Short-lived locusts discard their sessions quickly. Long-lived locusts hold on to them for considerable time. All locusts continually request and validate service tickets throughout their lives every 5-15 seconds.
All locusts are only 25% likely to log out upon their deaths. The CAS service must continue to track TGTs of locusts that have not logged out until the ticket expires, so this behavior can put pressure on the memory storage resources of the nodes.
Each locust uses credentials taken randomly from one of 9 test accounts. Each locust has a 1% chance of entering an erroneous password for an account. Locusts that fail to authenticate will die immediately.
When a locust dies, it is reborn immediately. Its lifetime category remains the same, but its SSO session and all other random parameters are reset.
SSO Session Tracking
SSO sessions are tracked by the TGTs they produce. Any event that creates or destroys a TGT is logged, and these observations are plotted after the fact. Because only 25% of locusts will explicitly end a session, many sessions will accumulate and consume storage in the CAS ticket registry until the session times out. Using the probability of long, medium, and short lived locusts in the population, the actual number of active sessions at any time is estimated. The charts produced should provide a reasonable estimate of how many simultaneous sessions are being managed by the CAS service at any given time.
|Date / duration||2017-09-05 from 09:30:00-04:00 until 16:44:00-04:00 (7h 14m)|
|Number of locusts||150|
The first trial produced authentication events at a rate of 1,800.11 events/minute. The majority of these were service ticket creation and validation events. The trial was concluded with no noticeable degradation in performance.
Net SSO sessions increased at a rate of 73.5 sessions per minute until the idle session timeout duration was reached.
|Date / duration||2017-09-20, 09:00:00-04:00 - 17:00:00-04:00 (8 hours)|
|Number of locusts||50|
An average of 600.46 events per second were handled by the CAS service under load during this trial. There were no noticeable service disruptions.
Net SSO sessions increased at a rate of 27.4 sessions per minute, until the session idle timeout duration was reached.
|Date / duration||2017-09-22, 09:05:00-04:00 - 09:33:00-04:00 (28 minutes)|
|Number of locusts||175|
Net SSO sessions increased at a rate of 82.9 sessions per minute.
|Date / duration||2017-09-22, 11:49:00-04:00 - 12:30:00-04:00 (41 minutes)|
|Number of locusts||200|
Net SSO sessions increased at a rate of 93.0 sessions per minute.
|Date / duration||2017-09-22, 15:10:00-04:00 - 15:47:00-04:00 (37 minutes)|
|Number of locusts||125|
Net SSO sessions increased at a rate of 64.0 sessions per minute.
|Date / duration||2017-09-22, 16:35:00-04:00 - 16:50:00-04:00 (20 minutes)|
|Number of locusts||250|
Net SSO sessions increased at a rate of 124.6 sessions per minute.
Effect of Number of Locusts on Mean Rate of Events
Observations from the previous trial and the current trial were plotted in order to give some sense of the influence the number of locusts in the test swarm would have on the mean rate of events processed by the service each minute. The data suggest that for each additional locust added, there are approximately 12 more events generated per minute.
Observed and Predicted Mean Rates
Effect of Number of Locusts on Increase in SSO Sessions
The rate at which net new SSO sessions are created during the period from the beginning of a trial until the discarded TGTs begin to timeout is also useful. Since it seems to be a linear function of the number of locusts, this figure can be used to predict the number of SSO sessions that will be present were a trial to reach the session timeout mark.
Measurements 1 taken from the production CAS service from September 1-22, 2017 during normal business hours (9am to 5pm) have the following characteristics:
|mean||78 events / minute|
|median||75 events / minute|
|mode||59 events / minute|
|max||220 events / minute|
|min||8 events / minute|
The data suggests that the production CAS service is operating well under the maximum sustainable load, and should have plenty of capacity to spare for temporary spikes in utilization.
1 Splunk query for Sep 1-21, 2017:
index=auth_cas (sourcetype=cas OR sourcetype=cas5) action=* date_hour >= 9 date_hour <= 16 date_wday!="saturday" date_wday!="sunday" | bin _time span=1m | stats count by _time | stats min(count) max(count) mean(count) mode(count) median(count) stdev(count)