Performance, Scalability and Availability in IT

Modeling Case Study

Posted by J on March 30, 2009

One of our customers is an early stage company that allows teachers and students to create collaborative projects on-line. When they approached us, they were concerned that they would soon run out of capacity. At 600 users, they were already seeing the system loaded.q-curves

In the proposal stage, we constructed a simple Queuing model (3 nodes: CPU, Disks, Users). For the modeling cognoscenti, it was a closed multi-class model implemented in our modeling spreadsheet (not available on-line). The model suggested that at 600 users, the system had already reached its peak throughput and that it was bottle-necked on the disks. Projection showed that response times would become unacceptable at 800 users.

Our proposal was to reconfigure the software to use Amazon’s Simple Storage System instead of local disks. This would have benefits to their business case aside from performance (this is not the place for Amazon evangelism). Indeed, the revised model showed acceptable performance until about 2,400 users. The before-and-after curves are shown in the adjoining graph.

This would still be far from their goal of sizing to 60,000 users. To achieve that objective, they would need to use multiple load-balanced web servers (or another higher-powered web server).

Epilog: the last time I spoke with the founder, the growth projections had been revised due to the economy and he was reassessing the business case for his web site. Alas.


Posted in Case Studies | Tagged: , , | Comments Off on Modeling Case Study

Capabilities of Queuing Models

Posted by J on March 30, 2009

This post is intended to be a companion to our queuing software. Only M/M/m queues are available for free on-line. All other cases are available as Excel-based implementations to our clients.

M/M/m queues

Short for Markovian arrivals, Markovian (gaussian) service times, and m resources. For each resource, we need to specify its service time and visit ratio:

  • Service time: the time a particular job spends at that resource.
  • Visit ratio: the number of times a resource visits that resource. For a 1-cpu-2-disks system, a database query will have a visit ratio of 1.0 for the CPU. For each of the disks, it depends. If the data is distributed equally among the disks, the visit ratios will be 0.5 each (or less, if the data is sometimes found in a cache).

Multi-class queues

An extension of the M/M/m model where all jobs (classes) are not the same. The service times and visit ratios must be specified for each class. There are a few different cases:

  • Different priority disciplines to resolve which class gets a resource (first-come-first-served, processor-sharing, constant-delay <aka infinite-server>, priority).
  • Open vs. closed classes, i.e., whether the number of customers of each class is fixed or a variable.

Finally, the algorithms to take load-dependent servers into account — situations where the amount of load on a server changes the service time — are also available in spreadsheet form.

Posted in Introduction to our Software | Tagged: , , | Comments Off on Capabilities of Queuing Models

Trends in Load and Performance Testing

Posted by J on February 26, 2009

Everyone realizes the value of load and performance testing. What most people find difficult to understand is why it is so expensive, why it takes so long and why it is such a black art. Exciting new trends in this space promise a way out of the malaise; more on that below, but first a little elaboration on the limits of load and performance testing.

  1. We want the system under test to be as close to the production setup as possible.  The more different it is, the more factors you have to correct for, and the less confidence you will have in the results.  A system sized the same as production (and populated with a similar amount of data) is often a budget-buster, especially because it is used for a small part of the system life cycle and is idle for the remainder of the time.
  2. If a system sized similarly to production is not used, or not populated similar to production, interpreting the results by extrapolating them from measurements is itself subject to a degree of guesswork. If you are using the results to convey bad news, you have some convincing to do. If you are using the results to convey good news, you still have some convincing to do.
  3. The stress testing tools are expensive because they are designed to cover a range of technologies, from windows applications to batch scripts to web applications.
  4. Since  scripts need to be written in a specialized language, writing scripts for stress testing is a specialized skill.  Writing such scripts before the application is available is even more of a challenge.
  5. It is common practice to use scripts running within the company’s infrastructure to test web applications.  This makes testing more repeatable but it ignores the vissisitudes of connecting through the internet to get to the company’s servers.

There are a few encouraging trends on the horizon.   Read the rest of this entry »

Posted in Performance, Scalability and Availability | Tagged: , , , , | 1 Comment »

Availability Paper and Software

Posted by J on November 19, 2008

Managing hundreds of business applications in an enterprise with a quantitative view to their availability is a challenge that requires a combination of quantitative methods and capturing knowledge from across the organization. A unified methodology modeling availability of applications in an enterprise is proposed. The proposal provides technology management of the enterprise with tools to continuously manage the evolution of their technology infrastructure. See the full paper for details. Read the rest of this entry »

Posted in Introduction to our Software | Tagged: , , , , | Comments Off on Availability Paper and Software

Performance and Scalability Survey results

Posted by J on October 23, 2008

The results of the survey are available at the Quantitecture Web Site.

Comments on this blog are closed but you may contact us through the company web site.

Thank you.

Posted in Performance, Scalability and Availability | Tagged: , | 2 Comments »

The Accuracy of Performance Data

Posted by J on September 21, 2008

To what extent must measurements or modeling results hew to the “real world”? We present a continuum with 4 different options, each with their own pros and cons; and end with a recommendation. Read the rest of this entry »

Posted in Performance, Scalability and Availability | Tagged: | Comments Off on The Accuracy of Performance Data

Performance, Scalability and Availability

Posted by J on September 10, 2008

Performance is a metric of the system speed your clients experience. When the user makes a request, how quickly does the system come back?

What is an acceptable level of performance? It depends. If any UI takes longer than 0.25 seconds, it has been shown that it will cognitively interrupt the user’s thought process. Web sites don’t typically perform that fast — they can’t because of all the network delays — so our users have to put up with a level of slowness anyway.

After that very stringent threshold, it’s an expectations game. How complex does the user think the task is? If they think it is simple, they will be less tolerant of delays. If they think it is complex, more tolerant.

Around the 30 seconds mark, for web applications, you start to run into timeout values of the various components between the browser and the servers. If the operation will take longer than that, the UI should release the user from having to wait for the browser to refresh.



Scalability is a metric of how many clients you can service. The economics of the business are strongly influenced by scalability.

Scalability of a system is constrained by system bottlenecks. Every component of the system has a throughput rate — a rate it can not exceed. In a typical system, one component is operating at its maximum throughput rate and all others have a bit of slack.

If one increases the capacity of the bottleneck component, that component may no longer remain the bottleneck; the bottleneck will shift to another component.

Availability is best defined by visualizing its absence. When your system isn’t available, your business is stalled.


There are two ways to increase availability of a service: 1. reduce the down time of each component and 2. adopt a design that allows for redundant individual components that may fail but the overall system continues to function.

Dollar for dollar, systems are more robust with the second strategy.

Copyright © 2008, J Singh

Posted in Terminology | Tagged: , , | Comments Off on Performance, Scalability and Availability

Intersection of Programs and Data

Posted by J on August 22, 2008

What makes Performance Engineering so interesting and also frought with danger?  It’s the sheer complexity and the continued need to take all factors into account.

Professor Spielman of Yale was recently awarded the 2008 Gödel Award.  The award was for his 2004 paper (I read a 2008 update) on the performance of the Simplex algorithm.  The algorithm has been around since 1947 and most people who went through an Engineering education in the last 50 years learned it.  The theoretical analysis suggests that in the worst case, it’s pretty bad but in practice it seems to work pretty well.  The reason it works so well, it turns out, is that the typical data it gets subjected to makes it perform reasonably well.  Alternative algorithms which have better worst-case performance don’t perform as well in practice.

The lesson from this post, and the paper, is this: You can’t study the performance of systems in a vacuum — you have to make assumptions about the data. Many a performance analysis has been invalidated by the fact that it was done with the wrong data.  We have 60 years of history to prove it!

Posted in Performance, Scalability and Availability | Comments Off on Intersection of Programs and Data

Queue Analysis Software

Posted by J on August 21, 2008

To start to build an infrastructure around some of the analysis we hope to do for our clients, here are the beginnings of a demo.  It’s a System Performance Calculator based on Queuing Theory. Read the rest of this entry »

Posted in Introduction to our Software | Tagged: , | Comments Off on Queue Analysis Software

System Performance Estimation during Requirement Analysis

Posted by J on July 12, 2008

In a previous blog entry, I have advocated for the urgency of getting an early view into system performance even if the conclusions are approximate.

In this entry, I want to describe a couple of successful techniques: Dimensional Analysis and Scaling. The two techniques are related but complementary. Scaling is the simpler of the techniques but Dimensional Analysis can be extremely powerful. They have been applied for analysis of physical systems in many branches of science. A particularly nice tutorial is available in this Applied Mathematics textbook. You don’t have to read the whole book — Chapter 1 is available here and that is all you need. The treatment of Dimensional Analysis still a bit dense but scaling is sufficient in many instances for the types of estimations we typically require. Read the rest of this entry »

Posted in Performance, Scalability and Availability | Tagged: | Comments Off on System Performance Estimation during Requirement Analysis