NoSQL DB logging and reporting are challenges, for reasons discussed in this post. I recently spoke with Evident Software veteran Don Jeffery (@drjmun on Twitter) about those challenges and how Evident ClearStone (ECS) addresses NoSQL DB logging and reporting.
Metrics Collection
ECS collection (using JMX and ODI, our RESTful API over HTTP) creates Neo4j graph nodes from harvested and derived data in the form of resource metrics, resources, relationships and events. A unique identifier is generated for each Neo4j graph node, regardless of type. This identifier can be used to retrieve information from Apache Cassandra 0.7, which we employ as a time-series database, thus supporting current and historical performance monitoring visualizations within customizable perspectives. This product design allows ECS to chart performance metrics and display events such as threshold violations. (A nice writeup of our use of Neo4j and Cassandra was done in a blog post by our CTO, Ivan Ho, and got good play on DZone).
In ClearStone, a Neo4j node represents an instance of any entity we choose to track. The Neo4j node may have attributes that relate to, say, a host, such as an IP address, or data that somehow through heuristics lets us calculate how many processors may be on a piece of hardware. Other resources, possibly from other technologies, may also have information that helps us confirm that new host entity, create an instance of it, and populate it opportunistically as more information comes in. The incredibly free form nature of nodes stored in Neo4j makes this an easy capability to support.
Any time there are events associated with a resource, we keep a timeline of such events married to a snapshot of the associated resource(s) in the inventory at the time of the event occurrence.
Challenges
In the realm of NoSQL logging and reporting, consider the problems involved in monitoring a dynamic distributed environment with tools that are usually specific to a single technology. As Don put it, “what we need to understand is that the virtual and physical resources these products run on often overlap. At the very least there are server farms and networks that are shared”. NoSQL logging and reporting tools need to be able to identify patterns and relationships. They need an “elastic cross-technology solution that gets information on how [the technologies] impinge on one another in a common fabric”.
Another issue: in an environment where nodes are coming and going, a monitoring tool has to keep track of which nodes are current and which are not. As Don said “If we don’t get a report from a node, does that mean it’s just offline, or it has a problem? We also have to know at any given time in the history of our collection what nodes are available. Sampling a number of times helps you get a picture”. Sufficient samples over time can help ascertain whether a node’s state fluctuates a lot, with the caveat that maybe one can never be completely certain of even that. Maybe one rule could be that if a node is always ‘on’, and we get no reports on it for a [fill in the time frame], then we can conclude it has a problem; I think you see the challenge here.
Opportunities
Don understands the challenge and opportunity of NoSQL logging and reporting well; he says that “keeping the best snapshot” of a monitored resource is what we are striving to do with ECS, trying to “identify principal players” that a customer installation consists of, be they caches, nodes in a cluster, whatever they may be, in what is called our inventory. We can’t simply rely on current state, nor rely on history, but rather a combination of the two; “so that’s some of the stuff we’ve been looking at. If we can usefully compare what is in inventory now to what was there in the past, we’ll discover things we hadn’t even thought about, such as usage patterns and virtual and physical resources in that environment. I’m not sure we’ll be initially able to assess causality, but I think we can establish a footprint and allow the user to be able to explore and draw conclusions; I think we can give them the basis for that information.”
“It’s challenging to pinpoint cause and effect. For example, it’s difficult to determine that your publisher success rate is low because your CPU is maxed out; maybe your CPU is maxed out because you are attempting to do so much publishing that you’ve saturated that machine, leading to a low success rate; we can at least give them hints. We can also begin, with ECS 5.0, to give them projections, maybe presented graphically and in a number of visual perspectives; maybe an incident matrix.”
Regardless of what we deliver, Don says that “we want to give them something navigable, so they can begin to see where things are ‘lighting up’, then move distances away. So if a Cassandra cache was causing problems I could inspect the host. Oh, and now that I’m at that host, I can see there’s some other technology on there that’s beginning to have a lot of events.” Maybe this second technology is the actual root cause of the problem that first surfaced in the original monitored resource, a technology that is possible in a different tier, of a different NoSQL or caching technology, or maybe a servlet… you get the picture.
Don says “Being able to provide those hints and that navigation becomes even more important as the size and scale of these systems becomes such an issue that it becomes really difficult to monitor and manage the new environment without some event information or other heuristics that we’ve applied, a view that limits the scope of what they’re seeing to a space that we believe is related to problems that can explore causality. So, that’s one of the things we want to introduce in 5.0, some interesting visualizations, to help them navigate around.”
Weaving powerful semantics among tiers and domains based on an ever growing a better understood inventory of resources will provide a platform for discovering interrelationships whose understanding can well serve both root cause analysis and enforcement of SLA’s. This is the future of Evident ClearStone, as well as its present.
Learn more about our performance monitoring solution for Java, NoSQL and web servers

View Comments