This page is targeted at users who are developing or supporting high performance applications that utilize Oracle Coherence and require performance monitoring:
- Whether you’re in a pre-production role or post-production role, this page is designed to highlight operational risks and issues that one may experience using and supporting Oracle Coherence data grid.
- This page identifies both common and some not-so-common issues that can be addressed using real-time monitoring tools with embedded analytics such as Evident ClearStone Live.
- This page helps users justify the need for investing in the right tools to address these challenges and mitigate risk – before and after an application goes into production.
Is Monitoring a ‘Must Have’?
Are you embarking on a new Oracle Coherence project? Or do you currently have existing applications in pre-production or in production today that use Oracle Coherence? Is your grid running “blind” without proper real-time monitoring tools? Can you “see” the entire infrastructure in a way that pinpoints performance issues without having to wade through mountains of uncorrelated data? Will you pass your internal Production Readiness Checklist or operational risk reviews? These are some of the questions to consider for managing any high performance application using Oracle Coherence as a data grid.
If you are involved with mission critical, high performance, revenue generating applications like trading, risk, e-commerce, travel reservations, etc., then it becomes essential to have the right monitoring tools for the application and infrastructure – ones that can support both simple up/down dashboards and those that provide much more sophisticated analysis for architects and applications teams. Let’s explore the challenges and benefits of monitoring at each stage of the development lifecycle.
If this is your first Coherence project, you will realize that Oracle Coherence provides facilities to log the grid activity of each member in log files. However, the logs do not provide any performance statistics for cache performance, member performance, or the grid data. To support this, the grid must be properly configured and instrumented to track performance data across the grid via JMX. However, users then bear the responsibility of providing their own monitoring solution for the grid, as Coherence does not provide any built-in notifications or tools to analyze grid performance or health, set threshold-based alerts, or correlate real-time performance with historical data.
During the development stage (including design and architecture), architects and developers will find that the Coherence logs are insufficient for tracking performance of the grid. To support this, the grid must be properly configured and instrumented to track performance data across the grid via JMX. Most developers will use development tools like JConsole as a starting point to examine the hundreds/thousands of MBeans in the grid. This approach is unacceptably tedious and time consuming for more than a couple caches and over a dozen Coherence members, even for basic performance and utilization metrics. In addition, the majority of the metrics record cumulative counters, so it’s very difficult to analyze this data without proper analysis and visualization tools. Furthermore, as architects build different Coherence cache configurations, one would really need to see the impact to the grid.
Some engineering teams would build their own dashboards or reporting tools to fill the void in vendor tools and products. This additional effort requires engineering resources, time, and funds that may not have been factored into the development lifecycle. These can lead to delayed releases or obstacles in readying an application for production.
As the project progresses into testing/UAT phases, the need for monitoring becomes more pertinent. The test phase will not only validate functionality, but also undergo performance and scale testing. It’s expected the test/UAT grids may be larger and more complex than development grids. Scaling and data collection latency becomes an issue. Measuring the capacity and performance of the cluster members, caches, JVMs, and clients over time becomes critical. High performance applications require frequent and granular measurements – within seconds as opposed to minutes of an event – across the entire grid – and in real-time. As applications are upgraded, testers need to verify how the application and grid performance compares to earlier releases. This requires monitoring tools with capabilities to perform custom historical performance analysis.
When the application is ready to be delivered into production, architects must provide adequate sizing in order to secure proper hardware resources for the application. This capacity planning exercise should be more than an exercise; it should be measured using a monitoring tool that provides a holistic view of the grid. This avoids the potential mistake for under-provisioning or over-provisioning.
Before bringing the application online, the operations staff will need to be equipped with monitoring tools to support the application and the underlying infrastructure. Today’s enterprise management or systems management products are typically tools for monitoring availability (up/down). They have no understanding, or ability to correlate events, at the cluster level – so cache performance is not visible to them. The focus of monitoring needs to extend beyond simple system or node status – it is essential to monitor for outages, performance, capacity, and events/anomalies in the environment, and correlate these to non-Coherence JMX metrics that are crucial to mission-critical applications. A “must have” is the ability to correlate and consolidate Coherence JVM log data with JMX-based cluster stats. It’s critical to understand when the grid or the application exceed or falls below normal operational levels. Operations would rely on the monitoring tool to publish real-time alerts to existing management systems for immediate problem notification. Alerts from sophisticated systems provide not just status changes, but derive Events from the mountain of metrics available so that operators have a more concise and accurate view of cluster health and performance.
Once in production, the tier-2/3 production support staff requires analysis tools to view and measure the grid from top to bottom, including logs and events. A single dashboard that avoids switching from tool to tool – which is unproductive and inaccurate (due to differing measuring techniques and frequencies) – is essential for these highly skilled resources so as to reduce troubleshooting and restoration times. While some JMX monitoring tools can report performance data from arbitrary JMX data sources, the information is generally unstructured and uncorrelated “as-is” data, and presented in non-user friendly user interfaces that are often difficult to configure, navigate, or scale for larger environments. Users with complex or dynamic environments cannot settle for any less than integrated cluster-wide views of caches, node performance, intelligent event analysis, historical playback, threshold-based alerts and deep log analysis. Tier-2 support specialists or developers need efficient tools to analyze the performance, utilization, and events that led up to a problem. They require the ability to look back in time to see the grid’s state and workload to determine if it was operating within operational thresholds. Monitoring tools must have these integrated features to enable users to quickly spot trouble areas and monitor performance in real-time.
Why Not Build Your Own?
You want your developers building business logic, not management tools. Real-time monitoring and visualization for large scale, high performance / XTP environments is challenging. It’s not just a matter of collecting and displaying JMX metrics, but rather deciding how best to correlate, analyze, visualize and alert, on conditions that may be intermittent or unpredictable. Here’s a common example: Correlating JVM logs may be a simple task, but it is time consuming – to the point of being unworkable for more than a few dozen JVM’s. A better alternative is to use a tool that identifies, correlates and plays back the most relevant log entries for specific performance errors, such as long wait times.
The Solution: Evident ClearStone Live
Evident ClearStone Live (ECSL) for Oracle Coherence is packed with features to visualize a data grid’s capacity, health, performance, events, logs, and its underlying runtime environment. It is easily deployed- within thirty minutes – into large grids with no impact to the grid. Out of the box, it comes with a built-in flight data recorder to store all the real-time performance data and events. The Adobe Flex user interface is accessible with any Flash enabled internet browser. Regardless of the role of the user, the product is easy and intuitive.
Here are some key features how ECSL can help monitor your data grid:
- Real-time monitoring and alerting (10 seconds) enables granular measurements and immediate notification of performance problems, outages, or other events from the grid.
- Cluster-wide monitoring at a macro and micro level.
- Monitoring of all the Java Platform MBeans of any Coherence JVM in a single pane of glass provide users JVM level metrics such as heap utilization, garbage collection, thread pool, CPU utilization, etc.
- Correlated Coherence node performance with that of the JVM performance.
- Equipped with a flight data recorder to provide the latest 24 hours of performance and events of the grid.
- The real-time visualizations enable users to easily drill back in time to help answer what happened an hour ago to 24 hours ago.
- Real-time charts are annotated with correlated events.
- Threshold based monitoring enables users to set and forget thresholds. Alerts via SNMP or SMTP.
- Advance visualizations like heat maps to analyze cache partitioning topologies or hot spots in the grid.
- Real-time log monitoring addresses the tedious process of log consolidation and event notification.
- Shared profiles enable different users to customize the user interface.
- Support for event handler plug-ins. Customers can provide their own event handler to process critical events detected by ECSL.
You can try ClearStone for Oracle Coherence free. Download today!






