Quantitecture

Performance, Scalability and Availability in IT

Availability Paper and Software

Posted by J on November 19, 2008

Managing hundreds of business applications in an enterprise with a quantitative view to their availability is a challenge that requires a combination of quantitative methods and capturing knowledge from across the organization. A unified methodology modeling availability of applications in an enterprise is proposed. The proposal provides technology management of the enterprise with tools to continuously manage the evolution of their technology infrastructure. See the full paper for details.

We have developed the mathematics for calculating the availability of systems. The mathematics is quite simple; the key to its usage is building a network representing the system.

The collection of the data on which to base the calculations can be a challenge. A system for collection of this data has to allow entry by members of a large organization, and consolidate that information into a query and reporting tool. The requirements for such a system have been identified. A system demonstrating an implementation of the proposed methods has been developed. We are seeking suitable partners to test the proposed methodology.

Here is the Availability Demo.  Two motivating examples:

GoodEats Catering Service takes delivery orders over the internet.  They have two system with 1 CPU, 2 Disks each and an internet connection.  You have devised a fail-over mechanism that works like this: one of the servers is designated the primary server.  This server handles the web traffic.  The database on this primary server uses log shipping to continuously send logs to the Secondary server.  When there is an outtage, the owner tries to reboot the primary server and hope that the problem goes away.  If it doesn’t, there is a manual procedure to apply the shipped logs to the Secondary server and route internet traffic to it.  The switch-back process can take a long time to accomplish.  Once the primary server has been repaired, the switch-back is done on Monday morning so it would cause minimum disruption.  The owner wants to know the availability of the system.

An Investment Advisor Company consolidates their clients’ account and position data and provides consolidated information about their portfolio via a set of reports delivered over a web site.  The information provided includes positions, trades, corporate actions and portfolio performance data.  The data presented is based on a series of feeds that arrive from the various custodians overnight.  The performance information is calculated from the data.  It is a time consuming process during which the web site is unavailable to the customers.  The challenge is to estimate the availability of the system for clients.

So log into the application.  You just need to provide your Google login ID.  Google will let you sign up right there on the page if you don’t already have one.

For the first example,

  1. Create some base nodes:
    • CPU1, Disk11 and Disk12, with A-Ratings of 2.4, 3 and 3 respectively.
    • CPU2, Disk21 and Disk22, with A-Ratings of 2.4, 3 and 3 respectively.
    • Internet with A-Rating of 1.2.
  2. Create 2 Aggregate nodes, Server1 and Server2, each with its CPU, disks and a (shared) internet.
  3. Create a Fail-over node.  Assume that the fail-over process itself has an availability rating of 1.5 and it takes 1.0 hours to fail over and, once failed-over, takes on average 10 days (240 hrs) to switch back.

For the second example, based on a real world example, we entered this data representing the various failure modes of the application:

  1. A CPU and disks that make up the server.
  2. Two databases hosted on the server.  Each database is managed but of course can fail due to logs filling up, time-out conditions, errors in the application code creating an error condition in the database.
  3. Timely arrival of feeds from custodians.
  4. Processing of all data in the feeds.  This was a significant variable because the feeds sometimes contained back-dated events which caused the performance calculation engine to run a long time.
  5. Processing of all data entered by the users the previous day.  This was a significant variable because the users sometimes entered back-dated events which caused the performance calculation engine to run a long time.
  6. The web delivery infrastructure.

The results are shown in Availability Calculator Results (example 2)SvcBatchDone is the node that represents availability of data in time for usage by the customers.  AppAvailability is the net availability of the system. Availability of 0.88 translates to 87%, which is close to what they experienced.

Feel free to utilize the system for your situation.  If you need assistance, or would like to see us enhance the application, please contact us.

Sorry, the comment form is closed at this time.

 
Follow

Get every new post delivered to your Inbox.