CAS 5.1.x Load Tests by Lafayette College


Contributed Content
Carl Waldbieser, an active member of the CAS community, was kind enough to share this analysis.

Lafayette College has an active user base of XXX and regularly records 78 CAS authentication events/minute on average with peaks of 220 events/minute. In preparation of deploying CAS 5.1.x, locust.io was used to put CAS under load and soak and stress tests. Results indicate that CAS 5.1.x deployed with reasonable hardware in a multi-node deployment architecture using nginx+ and hazelcast. Deployment architecture, testing scenarios and results are detailed in the rest of this blogs post.

In preparation for a service upgrade from CAS server version 5.0.x to version 5.1.x, load testing trials were conducted on the CAS stage environment. All trials were carried out against the same deployment architecture, with all nodes configured identically. The deployment architecture and nodes have not changed since the last load test was conducted around April 25, 2017.

Overview

The deployment architecture itself consists of 3 virtual machine nodes:

  • cas3.stage.lafayette.edu
  • cas4.stage.lafayette.edu
  • cas5.stage.lafayette.edu

architecture-500x538

Each node has 3.7 GiB real memory available to it and 2 CPUs. The characteristics of the CPUs are as follows:

  • Architecture: x86_64
  • CPU op-mode(s): 32-bit, 64-bit
  • Byte Order: Little Endian
  • CPU(s): 2
  • On-line CPU(s) list: 0,1
  • Thread(s) per core: 1
  • Core(s) per socket: 2
  • Socket(s): 1
  • NUMA node(s): 1
  • Vendor ID: GenuineIntel
  • CPU family: 6
  • Model: 42
  • Model name: Intel Xeon E312xx (Sandy Bridge)
  • Stepping: 1
  • CPU MHz: 1899.999
  • BogoMIPS: 3799.99
  • Hypervisor vendor: KVM
  • Virtualization type: full
  • L1d cache: 32K
  • L1i cache: 32K
  • L2 cache: 4096K
  • NUMA node0 CPU(s): 0,1

The nodes are deployed behind an Nginx+ proxy in an active-active-active configuration. The nodes share ticket information using encrypted hazelcast messages, so any application state is shared.

The Test Swarm

The testing framework used was locust.io, a Python based load testing framework. The test suite deploys a fixed number of “locusts” against a web site. The initial population ramps up with a configurable “hatch rate”. In the tests, locusts were conceptually divided into 3 “lifetime” categories:

  • Short-lived locusts live approximately 60 seconds.
  • Medium-lived locusts last for approximately 5 minutes.
  • Long-lived locusts exist for approximately 2 hours.

The category to which a given locust is assigned is randomly determined with a ratio of short : medium : long being 7:2:1. Ideally, 70% of the population is short-lived, 20% is medium lived, and 10% is long-lived.

The lifetime of a locust determines how long it will retain and make use of a single web SSO session. Short-lived locusts discard their sessions quickly. Long-lived locusts hold on to them for considerable time. All locusts continually request and validate service tickets throughout their lives every 5-15 seconds.

All locusts are only 25% likely to log out upon their deaths. The CAS service must continue to track TGTs of locusts that have not logged out until the ticket expires, so this behavior can put pressure on the memory storage resources of the nodes.

Each locust uses credentials taken randomly from one of 9 test accounts. Each locust has a 1% chance of entering an erroneous password for an account. Locusts that fail to authenticate will die immediately.

When a locust dies, it is reborn immediately. Its lifetime category remains the same, but its SSO session and all other random parameters are reset.

SSO Session Tracking

SSO sessions are tracked by the TGTs they produce. Any event that creates or destroys a TGT is logged, and these observations are plotted after the fact. Because only 25% of locusts will explicitly end a session, many sessions will accumulate and consume storage in the CAS ticket registry until the session times out. Using the probability of long, medium, and short lived locusts in the population, the actual number of active sessions at any time is estimated. The charts produced should provide a reasonable estimate of how many simultaneous sessions are being managed by the CAS service at any given time.

Trial 01
Date / duration 2017-09-05 from 09:30:00-04:00 until 16:44:00-04:00 (7h 14m)
Number of locusts 150
Hatch rate 10/s

image

The first trial produced authentication events at a rate of 1,800.11 events/minute. The majority of these were service ticket creation and validation events. The trial was concluded with no noticeable degradation in performance.

image

Net SSO sessions increased at a rate of 73.5 sessions per minute until the idle session timeout duration was reached.

Trial 02
Date / duration 2017-09-20, 09:00:00-04:00 - 17:00:00-04:00 (8 hours)
Number of locusts 50
Hatch rate 10/s

image

An average of 600.46 events per second were handled by the CAS service under load during this trial. There were no noticeable service disruptions.

image

Net SSO sessions increased at a rate of 27.4 sessions per minute, until the session idle timeout duration was reached.

Trial 03
Date / duration 2017-09-22, 09:05:00-04:00 - 09:33:00-04:00 (28 minutes)
Number of locusts 175
Hatch rate 10/s

image

image

Net SSO sessions increased at a rate of 82.9 sessions per minute.

Trial 04
Date / duration 2017-09-22, 11:49:00-04:00 - 12:30:00-04:00 (41 minutes)
Number of locusts 200
Hatch rate 10/s

image

image

Net SSO sessions increased at a rate of 93.0 sessions per minute.

Trial 05
Date / duration 2017-09-22, 15:10:00-04:00 - 15:47:00-04:00 (37 minutes)
Number of locusts 125
Hatch rate 10/s

image

image

Net SSO sessions increased at a rate of 64.0 sessions per minute.

Trial 06
Date / duration 2017-09-22, 16:35:00-04:00 - 16:50:00-04:00 (20 minutes)
Number of locusts 250
Hatch rate 10/s

image

image

Net SSO sessions increased at a rate of 124.6 sessions per minute.

Effect of Number of Locusts on Mean Rate of Events

Observations from the previous trial and the current trial were plotted in order to give some sense of the influence the number of locusts in the test swarm would have on the mean rate of events processed by the service each minute. The data suggest that for each additional locust added, there are approximately 12 more events generated per minute.

image

Observed and Predicted Mean Rates

mean_rate mean_rate_observed
locusts
0 1.09 N/A
25 300.51 N/A
50 599.93 600.46
75 899.35 N/A
100 1,198.76 N/A
125 1,498.18 1,496.84
150 1,797.60 1,800.11
175 2,097.02 2,095.36
200 2,396.43 2,395.12
225 2,695.85 N/A
250 2,995.27 2,996.53
275 3,294.69 N/A
300 3,594.10 N/A
325 3,893.52 N/A
350 4,192.94 N/A

Effect of Number of Locusts on Increase in SSO Sessions

The rate at which net new SSO sessions are created during the period from the beginning of a trial until the discarded TGTs begin to timeout is also useful. Since it seems to be a linear function of the number of locusts, this figure can be used to predict the number of SSO sessions that will be present were a trial to reach the session timeout mark.

image

Conclusions

Measurements 1 taken from the production CAS service from September 1-22, 2017 during normal business hours (9am to 5pm) have the following characteristics:

mean 78 events / minute
median 75 events / minute
mode 59 events / minute
max 220 events / minute
min 8 events / minute
standard deviation 28

The data suggests that the production CAS service is operating well under the maximum sustainable load, and should have plenty of capacity to spare for temporary spikes in utilization.

Carl Waldbieser

1 Splunk query for Sep 1-21, 2017:

index=auth_cas (sourcetype=cas OR sourcetype=cas5) action=* date_hour >= 9 date_hour <= 16 date_wday!="saturday" date_wday!="sunday" | bin _time span=1m | stats count by _time | stats min(count) max(count)  mean(count) mode(count) median(count) stdev(count)

Related Posts

Introduction to CAS Commandline Shell

A short review of an interactive command-line shell provided by Apereo CAS.

Multitenancy With CAS

A short review of multitenancy feature variants and equivalents in Apereo CAS.

JWT Of All Things With CAS

A short tutorial on how to let Apereo CAS handle authentication events accompanied by JWTs.

CAS 5.2.0 RC4 Feature Release

...in which I present an overview of CAS 5.2.0 RC4 release.

August 2017 uPortal Slack summary

Summarizing Slack traffic about uPortal in August 2017.

CAS 5.2.0 RC3 Feature Release

...in which I present an overview of CAS 5.2.0 RC3 release.

Do State The Obvious

A short and sweet reminder and explanation of how marketing in tech works.

Stop Writing Code

A legitimate, comprehensive and inescapably detailed account of the spiderweb of deceit in today’s technology scene; The open-source software ecosystem where victims fall prey to the Goldilocks Syndrome of software customizations and home-grown functionality. This guide aims to uncover the deepest darkest secrets of this treacherous path and yet intentionally offers no hopes or viable solutions at all…except one.

CAS 5.1.x User Swap - Cause and Analysis

Travis Schmidt shares an analysis of chasing down a bug in CAS 5.1.x where user identities were swapped.

Summer 2017 uPortal Roadmap Update

Primarily summarizing the Open Apereo 2017 uPortal Roadmap BOF.