Articles tagged graph database

The views expressed in this blog are strictly personal, and do not necessarily represent the views of Evident Software.

By Bill Nigh

NoSQL DB logging and reporting are challenges, for reasons discussed in this post. I recently spoke with Evident Software veteran Don Jeffery (@drjmun on Twitter) about those challenges and how Evident ClearStone (ECS) addresses NoSQL DB logging and reporting.

Metrics Collection

ECS collection (using JMX and ODI, our RESTful API over HTTP) creates Neo4j graph nodes from harvested and derived data in the form of resource metrics, resources, relationships and events. A unique identifier is generated for each Neo4j graph node, regardless of type. This identifier can be used to retrieve information from Apache Cassandra 0.7, which we employ as a time-series database, thus supporting current and historical performance monitoring visualizations within customizable perspectives. This product design allows ECS to chart performance metrics and display events such as threshold violations. (A nice writeup of our use of Neo4j and Cassandra was done in a blog post by our CTO, Ivan Ho, and got good play on DZone).

In ClearStone, a Neo4j node represents an instance of any entity we choose to track. The Neo4j node may have attributes that relate to, say, a host, such as an IP address, or data that somehow through heuristics lets us calculate how many processors may be on a piece of hardware. Other resources, possibly from other technologies, may also have information that helps us confirm that new host entity, create an instance of it, and populate it opportunistically as more information comes in. The incredibly free form nature of nodes stored in Neo4j makes this an easy capability to support.

Any time there are events associated with a resource, we keep a timeline of such events married to a snapshot of the associated resource(s) in the inventory at the time of the event occurrence.

Challenges

In the realm of NoSQL logging and reporting, consider the problems involved in monitoring a dynamic distributed environment with tools that are usually specific to a single technology. As Don put it, “what we need to understand is that the virtual and physical resources these products run on often overlap. At the very least there are server farms and networks that are shared”. NoSQL logging and reporting tools need to be able to identify patterns and relationships. They need an “elastic cross-technology solution that gets information on how [the technologies] impinge on one another in a common fabric”.

Another issue: in an environment where nodes are coming and going, a monitoring tool has to keep track of which nodes are current and which are not. As Don said “If we don’t get a report from a node, does that mean it’s just offline, or it has a problem? We also have to know at any given time in the history of our collection what nodes are available. Sampling a number of times helps you get a picture”. Sufficient samples over time can help ascertain whether a node’s state fluctuates a lot, with the caveat that maybe one can never be completely certain of even that. Maybe one rule could be that if a node is always ‘on’, and we get no reports on it for a [fill in the time frame], then we can conclude it has a problem; I think you see the challenge here.

Opportunities

Don understands the challenge and opportunity of NoSQL logging and reporting well; he says that “keeping the best snapshot” of a monitored resource is what we are striving to do with ECS, trying to “identify principal players” that a customer installation consists of, be they caches, nodes in a cluster, whatever they may be, in what is called our inventory. We can’t simply rely on current state, nor rely on history, but rather a combination of the two; “so that’s some of the stuff we’ve been looking at. If we can usefully compare what is in inventory now to what was there in the past, we’ll discover things we hadn’t even thought about, such as usage patterns and virtual and physical resources in that environment. I’m not sure we’ll be initially able to assess causality, but I think we can establish a footprint and allow the user to be able to explore and draw conclusions; I think we can give them the basis for that information.”

“It’s challenging to pinpoint cause and effect. For example, it’s difficult to determine that your publisher success rate is low because your CPU is maxed out; maybe your CPU is maxed out because you are attempting to do so much publishing that you’ve saturated that machine, leading to a low success rate; we can at least give them hints. We can also begin, with ECS 5.0, to give them projections, maybe presented graphically and in a number of visual perspectives; maybe an incident matrix.”

Regardless of what we deliver, Don says that “we want to give them something navigable, so they can begin to see where things are ‘lighting up’, then move distances away. So if a Cassandra cache was causing problems I could inspect the host. Oh, and now that I’m at that host, I can see there’s some other technology on there that’s beginning to have a lot of events.” Maybe this second technology is the actual root cause of the problem that first surfaced in the original monitored resource, a technology that is possible in a different tier, of a different NoSQL or caching technology, or maybe a servlet… you get the picture.

Don says “Being able to provide those hints and that navigation becomes even more important as the size and scale of these systems becomes such an issue that it becomes really difficult to monitor and manage the new environment without some event information or other heuristics that we’ve applied, a view that limits the scope of what they’re seeing to a space that we believe is related to problems that can explore causality. So, that’s one of the things we want to introduce in 5.0, some interesting visualizations, to help them navigate around.”

Weaving powerful semantics among tiers and domains based on an ever growing a better understood inventory of resources will provide a platform for discovering interrelationships whose understanding can well serve both root cause analysis and enforcement of SLA’s. This is the future of Evident ClearStone, as well as its present.

Learn more about our performance monitoring solution for Java, NoSQL and web servers

By Ivan Ho

As an Application Performance Management tool, Evident ClearStone collects and receives real-time data from disparate applications and systems in a distributed application environment. 90% of the data that ClearStone manages is the performance data from the application processes and systems. The other 10% are related to the inventory and events of the managed environment.

ClearStone 5 required an elastic data store that can handle all the scalability, performance, persistence, and flexibility requirements demanded of this class of product. The demands placed on this elastic data store ruled out the use of a traditional RDBMS backend.   The solution must have options for load balancing and high availability – being able to distribute and replicate data was essential. It was obvious that a NoSQL DB would come into play.

For example, in a single application environment with multiple servers, ClearStone would monitor system level metrics (CPU, disk, network, I/O) , application container performance, Java platform measurements, a distributed Cassandra cluster or other NoSQL clusters, and a database like PostgresDB.  The schemas of the performance data vary significantly from one type of resource to another, therefore we also needed a solution that would provide  flexibility in dynamic schema creation and updates while keeping the system running 24×7. At time when this was being developed, there was no one data caching, database system, or NoSQL solution that met all of these requirements. Ultimately, the Evident architects decided on using two NoSQL solutions, Apache Cassandra and Neo4j.

Cassandra is implemented as a time-series data store for storing all the real-time data and historical data. Our implementation uses Apache Cassandra 0.7 with the Hector client APIs for Cassandra. With Cassandra 0.7, we can dynamically create and evolve column families for storing all the performance data. The performance data is normalized by metric. We have also partitioned our column families based on the granularity of the data sets.

Neo4j is implemented as an inventory database used for maintaining all the managed resources (i.e. processes, hosts, clusters, etc.) of the application environment. It is used to store current state of all the resources, relationships among the resources, and correlated events to these resources. Anytime there are events associated with a resource, we keep a timeline of such events married to a snapshot of the associated resource(s) in the inventory at the time of the event occurrence. We felt the use of a graph database like Neo4j was ideal for storing metadata for the resources and mapping relationships and correlated events.

Beyond the performance and scalability benefits, we found that there were also some fringe benefits to each of these products. Both of these NoSQL systems have the options of being embedded into ClearStone or run as isolated clusters managed by ECS across multiple servers. Lastly both of these two products are cloud-friendly, therefore enabling ClearStone to be deployed in both traditional enterprise environment and cloud environments.

If you would like to learn more about our experience with either one of these NoSQL solutions, please come visit our Support Section and post your questions.

If you would like to try ClearStone, please join our Beta Program here.