CAS 5.1.x Load Tests by Lafayette College


Contributed Content
Carl Waldbieser, an active member of the CAS community, was kind enough to share this analysis.

Lafayette College has an active user base of XXX and regularly records 78 CAS authentication events/minute on average with peaks of 220 events/minute. In preparation of deploying CAS 5.1.x, locust.io was used to put CAS under load and soak and stress tests. Results indicate that CAS 5.1.x deployed with reasonable hardware in a multi-node deployment architecture using nginx+ and hazelcast. Deployment architecture, testing scenarios and results are detailed in the rest of this blogs post.

In preparation for a service upgrade from CAS server version 5.0.x to version 5.1.x, load testing trials were conducted on the CAS stage environment. All trials were carried out against the same deployment architecture, with all nodes configured identically. The deployment architecture and nodes have not changed since the last load test was conducted around April 25, 2017.

Overview

The deployment architecture itself consists of 3 virtual machine nodes:

  • cas3.stage.lafayette.edu
  • cas4.stage.lafayette.edu
  • cas5.stage.lafayette.edu

architecture-500x538

Each node has 3.7 GiB real memory available to it and 2 CPUs. The characteristics of the CPUs are as follows:

  • Architecture: x86_64
  • CPU op-mode(s): 32-bit, 64-bit
  • Byte Order: Little Endian
  • CPU(s): 2
  • On-line CPU(s) list: 0,1
  • Thread(s) per core: 1
  • Core(s) per socket: 2
  • Socket(s): 1
  • NUMA node(s): 1
  • Vendor ID: GenuineIntel
  • CPU family: 6
  • Model: 42
  • Model name: Intel Xeon E312xx (Sandy Bridge)
  • Stepping: 1
  • CPU MHz: 1899.999
  • BogoMIPS: 3799.99
  • Hypervisor vendor: KVM
  • Virtualization type: full
  • L1d cache: 32K
  • L1i cache: 32K
  • L2 cache: 4096K
  • NUMA node0 CPU(s): 0,1

The nodes are deployed behind an Nginx+ proxy in an active-active-active configuration. The nodes share ticket information using encrypted hazelcast messages, so any application state is shared.

The Test Swarm

The testing framework used was locust.io, a Python based load testing framework. The test suite deploys a fixed number of “locusts” against a web site. The initial population ramps up with a configurable “hatch rate”. In the tests, locusts were conceptually divided into 3 “lifetime” categories:

  • Short-lived locusts live approximately 60 seconds.
  • Medium-lived locusts last for approximately 5 minutes.
  • Long-lived locusts exist for approximately 2 hours.

The category to which a given locust is assigned is randomly determined with a ratio of short : medium : long being 7:2:1. Ideally, 70% of the population is short-lived, 20% is medium lived, and 10% is long-lived.

The lifetime of a locust determines how long it will retain and make use of a single web SSO session. Short-lived locusts discard their sessions quickly. Long-lived locusts hold on to them for considerable time. All locusts continually request and validate service tickets throughout their lives every 5-15 seconds.

All locusts are only 25% likely to log out upon their deaths. The CAS service must continue to track TGTs of locusts that have not logged out until the ticket expires, so this behavior can put pressure on the memory storage resources of the nodes.

Each locust uses credentials taken randomly from one of 9 test accounts. Each locust has a 1% chance of entering an erroneous password for an account. Locusts that fail to authenticate will die immediately.

When a locust dies, it is reborn immediately. Its lifetime category remains the same, but its SSO session and all other random parameters are reset.

SSO Session Tracking

SSO sessions are tracked by the TGTs they produce. Any event that creates or destroys a TGT is logged, and these observations are plotted after the fact. Because only 25% of locusts will explicitly end a session, many sessions will accumulate and consume storage in the CAS ticket registry until the session times out. Using the probability of long, medium, and short lived locusts in the population, the actual number of active sessions at any time is estimated. The charts produced should provide a reasonable estimate of how many simultaneous sessions are being managed by the CAS service at any given time.

Trial 01
Date / duration 2017-09-05 from 09:30:00-04:00 until 16:44:00-04:00 (7h 14m)
Number of locusts 150
Hatch rate 10/s

image

The first trial produced authentication events at a rate of 1,800.11 events/minute. The majority of these were service ticket creation and validation events. The trial was concluded with no noticeable degradation in performance.

image

Net SSO sessions increased at a rate of 73.5 sessions per minute until the idle session timeout duration was reached.

Trial 02
Date / duration 2017-09-20, 09:00:00-04:00 - 17:00:00-04:00 (8 hours)
Number of locusts 50
Hatch rate 10/s

image

An average of 600.46 events per second were handled by the CAS service under load during this trial. There were no noticeable service disruptions.

image

Net SSO sessions increased at a rate of 27.4 sessions per minute, until the session idle timeout duration was reached.

Trial 03
Date / duration 2017-09-22, 09:05:00-04:00 - 09:33:00-04:00 (28 minutes)
Number of locusts 175
Hatch rate 10/s

image

image

Net SSO sessions increased at a rate of 82.9 sessions per minute.

Trial 04
Date / duration 2017-09-22, 11:49:00-04:00 - 12:30:00-04:00 (41 minutes)
Number of locusts 200
Hatch rate 10/s

image

image

Net SSO sessions increased at a rate of 93.0 sessions per minute.

Trial 05
Date / duration 2017-09-22, 15:10:00-04:00 - 15:47:00-04:00 (37 minutes)
Number of locusts 125
Hatch rate 10/s

image

image

Net SSO sessions increased at a rate of 64.0 sessions per minute.

Trial 06
Date / duration 2017-09-22, 16:35:00-04:00 - 16:50:00-04:00 (20 minutes)
Number of locusts 250
Hatch rate 10/s

image

image

Net SSO sessions increased at a rate of 124.6 sessions per minute.

Effect of Number of Locusts on Mean Rate of Events

Observations from the previous trial and the current trial were plotted in order to give some sense of the influence the number of locusts in the test swarm would have on the mean rate of events processed by the service each minute. The data suggest that for each additional locust added, there are approximately 12 more events generated per minute.

image

Observed and Predicted Mean Rates

mean_rate mean_rate_observed
locusts
0 1.09 N/A
25 300.51 N/A
50 599.93 600.46
75 899.35 N/A
100 1,198.76 N/A
125 1,498.18 1,496.84
150 1,797.60 1,800.11
175 2,097.02 2,095.36
200 2,396.43 2,395.12
225 2,695.85 N/A
250 2,995.27 2,996.53
275 3,294.69 N/A
300 3,594.10 N/A
325 3,893.52 N/A
350 4,192.94 N/A

Effect of Number of Locusts on Increase in SSO Sessions

The rate at which net new SSO sessions are created during the period from the beginning of a trial until the discarded TGTs begin to timeout is also useful. Since it seems to be a linear function of the number of locusts, this figure can be used to predict the number of SSO sessions that will be present were a trial to reach the session timeout mark.

image

Conclusions

Measurements 1 taken from the production CAS service from September 1-22, 2017 during normal business hours (9am to 5pm) have the following characteristics:

mean 78 events / minute
median 75 events / minute
mode 59 events / minute
max 220 events / minute
min 8 events / minute
standard deviation 28

The data suggests that the production CAS service is operating well under the maximum sustainable load, and should have plenty of capacity to spare for temporary spikes in utilization.

Carl Waldbieser

1 Splunk query for Sep 1-21, 2017:

index=auth_cas (sourcetype=cas OR sourcetype=cas5) action=* date_hour >= 9 date_hour <= 16 date_wday!="saturday" date_wday!="sunday" | bin _time span=1m | stats count by _time | stats min(count) max(count)  mean(count) mode(count) median(count) stdev(count)

Related Posts

CAS 6.0.0 RC3 Feature Release

...in which I present an overview of CAS 6.0.0 RC3 release.

Apereo CAS - Multifactor Authentication with RADIUS

Learn how Apereo CAS may be configured to trigger multifactor authentication using a RADIUS server and its support for the Access-Challenge response type.

CAS Vulnerability Disclosure

Disclosure of a security issue with the MFA features.

CAS 6.0.0 RC2 Feature Release

...in which I present an overview of CAS 6.0.0 RC2 release.

Apereo CAS - dotCMS SAML2 Integration

Learn how to integrate dotCMS, a Content Management System and Headless CMS, with Apereo CAS running as a SAML2 identity provider.

Effective Software Troubleshooting Tactics

A collection of what hopefully are obvious troubleshooting tactics when it comes to diagnosing software deployment issues and configuration problems.

Apereo CAS - MaxMind Geo2IP ISP Integration

Learn how you may determine the Internet Service Provider, organization name, and autonomous system organization and number associated with the user's IP address in CAS using MaxMind services and present warnings in the authentication flow for the end-user if an IP address is matched.

Notes from Better by Design 2018

Be interested in humans and human success.

Apereo CAS - Authentication Lifecycle Phases

Tap into the Apereo CAS authentication engine from outside, and design extensions that prevent an unsuccessful authentication attempt or warn the user after-the-fact based on specific policies of your choosing.

CAS 6.0.0 RC1 Feature Release

...in which I present an overview of CAS 6.0.0 RC1 release.