Evident Software Site Mapsite map 
ProductsServicesNews & EventsPartnersCompanyContact Us
Product Summary
Evident ClearStone
Evident ClearStone Live
Evident Enterprise
    Quick Links
Print Friendly Version
 
Evident ClearStone
Flash Overview
 
Evident ClearStone
Product Overview

 
Evident ClearStone
High Level Architecture
 

Use Case: Capacity Planning

This page provides an overview of the use of the Evident ClearStone monitoring and reporting solution for grid capacity planning within a
compute grid environment.

Introduction

Agent-based system management solutions provide information about system level metrics such as CPU usage, running processes, memory utilization, network I/O, storage I/O, etc.  These metrics are typically derived from operating system level tools or agents that are monitoring performance at the “system” level, not at the “grid” level.  However, for most businesses system utilization matters less than response time, application failures and tasks in queue – critical indicators for end user performance, but invisible to system level tools. With respect to grid enabled applications or services, the limitations of system management tools, when compared to Evident ClearStone, are summarized as follows:
   
1. The atomic events of the grid workload itself (tasks, jobs, services) and the consumers of those services (Business Unit’s or BU’s) are not visible to the system management tools. The detail grid workload activity and audit information is only available via the grid system’s own Reporting Database (e.g. DataSynapse GridServer Reporting Database)
   
2. Without these atomic sources of information, system management tools will not be able to determine the performance of the grid workloads and the utilization of the grid infrastructure.
   
3. Furthermore, system management tools do not typically have the analytics necessary to map workloads to applications, services, or consumers (users). Without this visibility to grid execution, service/application reporting, analytics, capacity planning, resource sharing, or grid chargeback are difficult, if not impossible. These gaps are addressed via out of the box features from the Evident ClearStone product.

System Management Tools

In comparison to system management tools Evident ClearStone provides automated, cross grid, agentless reporting focused on grid workloads, service execution, and cost.  Both historical and near real time information is collected and analyzed, and trends are typically stored for 3-6 months.  Metrics from system management agents such as system-level CPU, processes, memory, storage, and network I/O utilization can be integrated into standard Evident ClearStone reports. If integrated in this fashion, Evident’s reports provide engine CPU and memory utilization cross-referenced with grid-level utilization (workload), so that the operational teams can see the effects of grid service and task level loads on the overall CPU and memory utilization of the engines (system) that comprise the grid. This information can be used to schedule workload processing to ensure “smooth” systems performance across the entire grid.

By way of example, system management tools are able to extend the framework to support measuring additional metrics or events, but that’s only a means to collect and meter information. These extensions do not automatically provide grid-oriented views of the workloads or grid-oriented resource performance (brokers and engines). This would require significant custom development to support the correlation and analytics required for consolidated grid views, workload optimization, or grid resource sharing features.

Evident ClearStone

Evident ClearStone comes bundled with a reporting Data Warehouse, report scheduling and a reporting portal. Typically, software installation and first report generation take 2-3 hours. No external agents or development is required, and administration is nominal. The solution has no or minimal overhead impact on the grid engines (assuming task level reporting is not enabled), and no or minimal additional load on the grid brokers (assuming a grid reporting database is already being used).  

Benefits of using Evident for grid service reporting include turnkey reporting, leverage from regular software updates that regularly enhance out of the box functionality (50+ reports added in 2007), and full vendor support.

Beyond simple collection of metrics, Evident provides both summary rollups and analytics that consolidate and enrich both current and historical information, reducing the analysis required from the operational teams. The following example illustrates Evident ClearStone’s value to capacity planning within a multi-application compute grid production environment:

Step 1: Compare Engines and Tasks
 

Compare available to busy grid engines, and pending task counts, across a single grid (hourly granularity is supplemented by daily/weekly/monthly trends that can be overlaid). In the following diagram, the light blue is unutilized capacity.

 

click for larger image
 

Step 2: Determine Who is Driving Demand?

 
Pinpoint the grid services driving demand during peak periods where engine starvation is occurring. 
 

click for larger image
 

Step 3:  Is System Resource Capacity Affecting Workload Execution?

 
Correlate granular systems metrics (both on grid and off grid) to service requests to determine if systems are CPU (by machine or process), memory or I/O bound (assumption is that some engines may not be dedicated to the grid.)

This allows the operator to validate if increases in task times (see graphics on Step #5) are due to bottlenecks created by resource limitations outside the grid (assuming engine starvation has been ruled out for that specific time period and service.)
 

click for larger image
 

Step 4: Can Pending Workloads be Shifted to Available Capacity?

 
Identify available engines across grids; correlate available engines with pending workloads (so that the performance of pending jobs in the queue is not affected by any sharing).  Change engine sharing policies to increase capacity headroom during peak demand hours for saturated grids; and/or time shifting workload schedules to leverage non-reserved resources available during those windows. 
 

click for larger image
 

Step 5: Has the Shift in Workload Generated the Desired Results?

 
Monitor task times (even across brokers) to confirm that new engine sharing policies have intended impact on service performance.
 

click for larger image
 
   
1. If the task count and the task duration are flat, then the grid capacity is ok (i.e. tasks are being processed accordingly and the duration of tasks is as expected, thus capacity is acceptable.)
   
2. If both task count and duration are increasing or decreasing in unison, then capacity is also acceptable  (i.e., as more task are accepted into the grid the duration should increase and the same is true in reverse)
   
3. If task count and duration are moving in opposite directions, then there could be a capacity problem.  For instance, if task count is decreasing, but duration increases, then there is a bottleneck or application issue - less load on the grid should result in shorter execution times.  If task count increases, and duration decreases, then this is probably acceptable, but should be investigated because it does not make sense for a fixed size grid.  However, this might be as expectable if grid resources are being added dynamically (as shown in the figure above).

Summary

With insight into performance, usage and allocated capacity across grids, workload management, capacity planning, and service quality can be improved. System management information alone, however, limits the effectiveness of capacity planning and workload scheduling processes to “right size” grid engine supply with demand, resulting in idle resources that may be available to serve high priority services. The coupling of system level utilization with grid service visibility provides the correlation required to improve resource provisioning and allocation before performance problems occur. This visibility can, for instance, help avoid engine starvation for time critical services in resource limited environments. It can also help to avoid unnecessary over-provisioning, increasing the Return on Assets from existing capital (engines and software license), free up excess capacity to accommodate new applications and/or volumes, and encourage sharing among Lines of Business.

top