Contributed Content
Carl Waldbieser (<waldbiec [at] lafayette.edu>), an active member of the CAS community, was kind enough to share this analysis.

# Overview

Load testing trials were conducted on the CAS stage environment in order to provide insight into what kind of sustained load the production CAS service will be able to carry. All trials were carried out against the same deployment architecture, with all nodes configured identically.

# Deployment Architecture

The deployment architecture itself consists of 3 virtual machine nodes. Each node has 3.7 GiB real memory available to it and 2 CPUs.

The characteristics of the CPUs are as follows:

• Architecture: x86_64
• CPU op-mode(s): 32-bit, 64-bit
• Byte Order: Little Endian
• CPU(s): 2
• On-line CPU(s) list: 0,1
• Core(s) per socket: 2
• Socket(s): 1
• NUMA node(s): 1
• Vendor ID: GenuineIntel
• CPU family: 6
• Model: 42
• Model name: Intel Xeon E312xx (Sandy Bridge)
• Stepping: 1
• CPU MHz: 1899.999
• BogoMIPS: 3799.99
• Hypervisor vendor: KVM
• Virtualization type: full
• L1d cache: 32K
• L1i cache: 32K
• L2 cache: 4096K
• NUMA node0 CPU(s): 0,1

The nodes are deployed behind an Nginx+ proxy in an active-active-active configuration. The nodes share ticket information using encrypted Hazelcast messages, a feature built into the CAS software, so any application state is shared.

# The Test Swarm

The testing framework used was locust.io, a Python based load testing framework. The test suite deploys a fixed number of “locusts” against a web site. To lean more about locust, please see this guide.

The initial population ramps up with a configurable “hatch rate”. In the tests, locusts were conceptually divided into 3 “lifetime” categories:

• Short-lived locusts live approximately 60 seconds.
• Medium-lived locusts last for approximately 5 minutes.
• Long-lived locusts exist for approximately 2 hours.

The category to which a given locust is assigned is randomly determined with a ratio of short : medium : long being 7:2:1. Ideally, 70% of the population is short-lived, 20% is medium lived, and 10% is long-lived.

The lifetime of a locust determines how long it will retain and make use of a single web SSO session. Short-lived locusts discard their sessions quickly. Long-lived locusts hold on to them for considerable time.

All locusts are only 25% likely to log out upon their deaths. The CAS service must continue to track TGTs of locusts that have not logged out until the ticket expires, so this behavior can put pressure on the memory storage resources of the nodes.

Each locust uses credentials taken randomly from one of 9 test accounts. Each locust has a 1% chance of entering an erroneous password for an account. Locusts that fail to authenticate will die immediately.

When a locust dies, it is reborn immediately. Its lifetime category remains the same, but its SSO session and all other random parameters are reset.

# The Trials

## Trial 1

# of Locusts Hatch Rate
300 10/s

The first trial produced authentication events at a rate of over 3,500 event/minute. The majority of these were service ticket creation and validation events. By 13:40 (~5 hours into the trial), degraded performance became noticeable. By 15:50 (~7 hours in), the nodes were swamped. The trial was discontinued shortly thereafter.

## Trial 2

# of Locusts Hatch Rate
300 10/s

The 2nd trial was similar to the first, but the number of locusts was briefly increased to 500 for a 3 minute duration.

The characteristics of trial 2 are very similar to those of trial 1. After responding to ~3,500 / minute for close to 6 hours, the nodes were overwhelmed.

## Trial 3

# of Locusts Hatch Rate
150 10/s

The 3rd trial used half the number of locusts used in the first 2 trials. The sustained event rate was ~1,700 authentication events per minute. Unlike the previous 2 trials, this trial was concluded prior to the nodes becoming overwhelmed. It is unknown whether the nodes could continue to sustain responding to events at this rate indefinitely.

## Trial 4

# of Locusts Hatch Rate
50 10/s

The 4th trial showed the nodes were capable of sustaining a ~600 authentication events/minute rate for a full 24 hours. Because the maximum lifetime of a TGT is 8 hours, there is some reason to believe this rate could have been sustained indefinitely.

## Trial 5

# of Locusts Hatch Rate
25 10/s

This trial was useful in establishing the mean rate of 291 events per minute given a swarm of 25 locusts. This test and test 6 are notably shorter in duration than the other tests, as it is assumed at this point the nodes can sustain the loads indefinitely.

## Trial 6

# of Locusts Hatch Rate
10 10/s

The mean rate of authentication events for this trial is 118 events per minute.

# Conclusions

Measurements taken from the production CAS service from April 17-21, 2017 during normal business hours (9am to 5pm) have the following characteristics:

Mean Median Mode Max Min Std Dev
149 events/minute 139 events/minute 126 events/minute 494 events/minute 2 events/minute 68

While there appears to be some “burstiness” in the rate of authentication events, all 3 types of averages are well below the the expected threshold which the trials suggest are indefinitely sustainable. Even the maximum rate of 494 events per minute is well below the sustained rate of trial 4 (~ 600 events / minute).

The data suggests that the production CAS service is operating well under the maximum sustainable load, and should have plenty of capacity to spare for temporary spikes in utilization.