Oracle Coherence Monitoring – Challenges and Best Practices

Today’s mission-critical applications are heavy consumers and producers of data. This is a challenging problem for applications to manage and deliver data reliably to users and other systems with real-time scalability and high performance. Enterprises and SaaS providers have adopted distributed caching technologies such as Oracle Coherence to provide fast and reliable access to frequently used data. As a shared infrastructure, Oracle Coherence enables scalable in-memory distributed caching, real-time data analysis, parallel transactions, event processing, and application grid computing across a distributed server infrastructure.

Using this sophisticated, feature-rich application grid technology does come with its engineering and management challenges during application lifecycle (from pre-production to post-production). Users find there is a lack of application performance management tools that help them correlate the utilization of the business application with that of the health, capacity, and performance of the Oracle Coherence grids. More often than not, developers are tasked to create home-grown scripts or monitoring tools to expose internal Coherence grid performance metrics. This takes valuable time and resources away from the main task at hand. Furthermore, as applications are ready for production, the operations teams are exposed – leaving them running “blind.” Given the high profile nature of these banking, gaming, social, hospitality, logistics, e-commerce, retail, travel, intelligence, and data mining applications; it becomes essential to leverage proper application performance management tools like Evident ClearStone® to support the application lifecycle of these distributed applications and the supporting infrastructure.

Evident ClearStone for Oracle Coherence is a monitoring and management solution that collects, aggregates, analyzes, correlates, manages, and visualizes Coherence applications grids and the supporting application infrastructure. It gives developers, operations engineers, and other application stake-holders the comprehensive real-time monitoring information they need in order to mitigate problems, identify bottlenecks, optimize performance, and scale business-critical applications. Enterprises can deploy ClearStone in under 30 minutes. ClearStone’s monitoring tools require no advanced programming skills. The ClearStone solution is already in use at leading financial and e-commerce sites.

Today’s Challenges with Monitoring the Coherence Data Grid

When developers and architects begin to work with Coherence, they quickly discover that collecting performance data is a time-consuming and arduous process. For example, every Coherence JVM produces a log file. The larger the grid, the more log files there are to read and to manage. If you have a 50 node grid, then there are 50 concurrent log files being written to disk. With that many log files, it’s quite cumbersome to get a clear picture of what’s healthy and what’s not. Furthermore, while the log files provide a useful event data, they lack detailed performance statistics that would be useful for measuring load, utilization, capacity, and performance of the data grid.

The main source of Coherence’s performance statistics is JMX instrumentation. JMX data collection might involve hundreds to thousands of MBeans depending on the number of Coherence nodes that have been deployed. In this highly distributed environment, a simplistic tool like JConsole is inadequate for performance monitoring, capacity planning, or troubleshooting a cluster with tens or hundreds of nodes. True, the JMX Reporter feature in Coherence does expose high-level statistics in text format about the grid. However, human operators still need to analyze this data to derive any practical insight from it.

Even with log files, Coherence MBeans, and JMX Reporter, Coherence lacks a built-in mechanism for notifying users about problems or failures in the data grid. Without notifications, managing Coherence in production is a challenge.

Monitoring throughout the Application Lifecycle

Let’s explore the challenges and benefits of monitoring at each stage of the application lifecycle. As the diagram below shows, the application lifecycle includes three phases: development, testing, and production.

The Development Phase

During this stage, developers and architects are focused on applying the right design patterns that are most optimal for storing and accessing the data in the Coherence data grid. These design patterns determine how data is structured, serialized, partitioned, persisted, processed, and queried. There are multiple techniques that developers can apply to achieve similar results. The goal is to find the right approach based on the application’s requirements and access patterns.

As developers go thru the iterative process of configuring Coherence and developing the proper techniques on accessing, storing, and processing data in Coherence, it’s important to measure the performance, utilization, capacity, and impact within the data grid. But monitoring this data can be difficult.

The Testing Phase

Beyond validating the functional aspects of the application logic, test teams should create test cases for:

  • Stress testing
  • Capacity planning
  • Destructive testing

Stress testing answers questions regarding the limits and breaking point of the data grid. How does one know when this limit is reached or breaks? Other than waiting for an outage or poring through myriad log files, it can be difficult to tell when a limit is reached, if testers can only rely on Coherence’s own reporting tools.

Capacity planning assesses the grid’s ability to support the application running with a full set of data or data that grows or changes. Measuring the impact of a load can be difficult without real-time monitoring tools. Testers need to be able to see how a heavy load affects all aspects of the grid.

Destructive testing helps engineers understand what happens to the cluster when a server or a node goes offline or when a client experiences a long garbage collection cycle. Even with the loss of a server, the cluster should still have sufficient resources to perform and to scale without losing data. Here, too, Coherence’s lack of monitoring and analysis tools makes the testing process difficult and time-consuming. If the cluster fares poorly, engineers will have difficulty identifying the weak spot.

The Production Phase

Most enterprise management or systems management tools today are built for monitoring system availability; their primary purpose is simply to indicate whether a system is up or down. These tools are incapable of analyzing operations or correlating events at the cluster level. As a result, these tools are blind to cache performance.

To maintain production-level performance, the focus of monitoring needs to extend beyond simple system or node status. It is essential to monitor for outages, performance, capacity, and events/anomalies in the environment, and correlate this to the JVMs.

While some JMX monitoring tools can report performance data from arbitrary JMX data sources, the information is generally unstructured and uncorrelated “as-is” data, and presented in non-user friendly interfaces that are often difficult to configure, navigate, or scale for larger environments. Users with complex or dynamic environments cannot settle for any less than integrated cluster-wide views of caches, node performance, intelligent event analysis, historical playback, threshold-based alerts and deep log analysis. Tier-2 support specialists or developers need efficient tools to analyze the performance, utilization, and events that led up to a problem. They require the ability to look back in time to see the grid’s state and workload to determine if it was operating within operational thresholds.

The Solution: Evident ClearStone for Oracle Coherence

Evident ClearStone is a real-time monitoring and management solution for scaling large-data business applications. Already in production at leading e-commerce sites such as J. Crew, Shopzilla, and a major financial services reporting firm, ClearStone enables enterprises to understand and optimize the performance of business-critical applications, including Java applications, NoSQL applications, and high-performance (HP) or extreme transaction processing (XTP) applications.

ClearStone for Coherence is a ClearStone Management Pack for monitoring, analyzing, and managing the real-time performance, capacity, and operational health of business-critical applications running on Oracle Coherence. ClearStone deploys within minutes and provides the instant insight and control you need for optimizing business-critical applications.

Enterprise IT organizations can use ClearStone to:

  • Collect: ClearStone collects data from all systems of record in Coherence environments.
  • Correlate: ClearStone aggregates and refines its data and transforms it into business-level metrics. IT organizations and business managers can act on these metrics to make better decisions about IT resources and business operations.
  • Manage: ClearStone enables enterprises to manage applications more nimbly and cost-effectively. Through a dynamic Flex-based UI, alerts, and reports, enterprises gain insight into the health of production applications and the effectiveness of IT resources. ClearStone enables IT organizations to avoid outages and to maximize the return on their infrastructure.
  • Report: ClearStone generates reports that give IT engineers and business managers real-time and historical views of application performance. Enterprises gain actionable insight into the performance of the production applications their business depends on.

ClearStone for Coherence is packed with features to visualize a data grid’s capacity, health, performance, events, logs, and its underlying runtime environment. It is easily deployed into large grids with no impact to the grid. Out of the box, it comes with a built-in flight data recorder to store all the real-time performance data and events. The Adobe Flex user interface is accessible with any Flash-enabled Web browser. The product is easy-to-use and intuitive for developers, testers, and operations personnel.

Why ClearStone?

ClearStone is an application performance management platform for monitoring and management high performance distributed applications. It provides valuable monitoring and management capabilities for the all phases of the application life-cycle: development, testing, and production.

ClearStone enables development teams to design high performance and highly scalable Coherence applications. For operators, the tool provides intuitive visualizations and automated monitoring and management features for supporting Coherence applications.

Benefits for the Development Phase

Using ClearStone’s real-time monitoring features, developers can view the performance and capacity of the cluster, as well as the JVM performance metrics. ClearStone provides insight into:

  • Performance metrics, including GET/PUT throughput and service throughput
  • Capacity metrics, including storage usage and unit sizing
  • JVM performance metrics, including heap, CPU, memory pools, and garbage collection

This information helps developers understand how certain Coherence APIs may impact the grid and how to tune configuration changes.

If developers instrument their application with custom JMX MBeans, they can use ClearStone to consolidate the monitoring of these JVMs and Coherence clusters.

Benefits for the Testing Phase

ClearStone enable users to quickly spot trouble areas and monitor performance in real-time.

Using ClearStone for stress-testing, a developer or tester can monitor the internal service backlog and throughput within the grid to determine bottlenecks that would cause performance degradation of grid overload.

Using ClearStone for capacity planning, a developer, architect, or tester can ensure the cluster is properly configured with sufficient number of storage nodes, high unit limits, eviction policies, JVM heap, etc.

Lastly, using ClearStone for destructive testing, a developer or tester can see in real-time the effect of a destructive test. Developers and testers can more easily pinpoint the weak points in the grid and make appropriate adjustments to the application or to Coherence itself.

ClearStone's Chronographs Historical Analytics option offers rapid access to live and historical data without the need for SQL queries.

ClearStone's Chronographs Historical Analytics option offers rapid access to live and historical data without the need for SQL queries.

Benefits for the Production Phase

ClearStone provides the alerts and notifications missing from Coherence. It warns users about application health, performance, and capacity problems, so that users can take action right away.

Operators can set production monitoring policies to alert them when the grid or application exceeds or falls below normal operational levels. Operations would rely on ClearStone to publish real-time alerts to existing management systems for immediate problem notification. ClearStone for Coherence can also correlate system and application-level data with the Coherence metrics, providing a cohesive and integrated view of your entire application.

Evident ClearStone provides real-time insight into the Coherence cache.

Evident ClearStone provides real-time insight into the Coherence cache.

Conclusion

Developers, testers, and operations engineers all need immediate insight into the status of applications running on Coherence. That insight is almost impossible to achieve working with only with the myriad log files provided by Coherence.

ClearStone for Oracle Coherence provides deep, real-time analytics and fine-grained management controls that are missing from Coherence. ClearStone offers real-time dashboards that correlate and analyze data from all systems of record in the Coherence data grid. Through ClearStone, enterprises gain the real-time and historical insight they need in order to optimize their application data grid in all phases of the application lifecycle.

You can try ClearStone for Oracle Coherence free. Download today!