The folks over at RightScale wrote up a great post on Cluster Monitoring with some new visualizations they’ve been using. The visualizations were certainly interesting (especially to me) but there is a key message in the post. The big takeaway is the importance of having a consolidated time series view of multiple entities so the effect of an anomaly can easily be seen across an environment. In a clustered environment, individual entities are inevitably related and an entity in trouble can, over time, affect other entities. The RightScale folks clearly showed this in their use case. The difficulty becomes when the density of information becomes too great to discern a problem. Imagine 1000 servers represented in a heatmap with 600,000 data points. The amount of information needed to be shown could exceed the number of pixels you have available to show it! But an environment of that size would not likely be in the hands of a developer where detail is critical. Production environments can conceivably reach 100′s or 1000′s of entities and the monitoring needs change along with the change in roles between Developer and Operations (DevOps). Information critical to developers can become “noise” to Operations folks. The detailed information needs to be distilled down to exceptions that are leading to, or clearly indicate, a problem in the environment.
In Evident ClearStone, we bridge to gap between developers and operators by providing exception focused views backed by the detail used to evaluate the exception condition. Furthermore, there is value in showing dissimilar but related entities like the constituent components of a NoSQL or Data Caching Platform (DCP) fabric to see cause and effect of related cascading failures. Below is a summarized view of various Oracle Coherence components showing health indicators for every 30 seconds of the last 15 mins. The color of the health indicator represent severity healthy condition.
Using the exception focused view above, specific details of a particular entity or entities to attempt to determine a root cause.



View Comments