Apache Eagle keeps an eye on big data usage

Apache Eagle keeps an eye on big data usage

Apache Eagle, originally developed at eBay, then donated to the Apache Software Foundation, fills a big data security niche that remains thinly populated, if not bare: It sniffs out possible security and performance issues with big data frameworks.

To do so, Eagle uses other Apache open source components, such as Kafka, Spark, and Storm, to generate and analyze machine learning models from the behavioral data of big data clusters.

Looking in from the inside

Data for Eagle can come from activity logs for various data source (HDFS, Hive, MapR FS, Cassandra) or from performance metrics harvested directly from frameworks like Spark. The data can then be piped by the Kafka streaming framework into a real-time detection system that’s built with Apache Storm or into a model-training system built on Apache Spark. The former’s for generating alerts and reports based on existing policies; the latter is for creating machine learning models to drive new policies.

This emphasis on real-time behavior tops the list of “key qualities” in the documentation for Eagle. It’s followed by “scalability,” “metadata driven” (meaning changes to policies are deployed automatically when their metadata is changed), and “extensibility.” This last means the data sources, alerting systems, and policy engines used by Eagle are supplied by plugins and aren’t limited to what’s in the box.

Because Eagle was put together from existing parts of the Hadoop world, it has two theoretical advantages. One, there’s less reinvention of the wheel. Two, those who already have experience with the pieces in question will have a leg up.

What are my people up to?

Aside from the above-mentioned use cases like analyzing job performance and monitoring for anomalous behavior, Eagle can also analyze user behaviors. This isn’t about, say, analyzing data from a web application to learn about the public users of the app, but rather the users of the big data framework itself — the folks building and managing the Hadoop or Spark back end. An example of how to run such analysis is included, and it could be deployed as-is or modified.

Eagle also allows application data access to be classified according to levels of sensitivity. Only HDFS, Hive, and HBase applications can make use of this feature right now, but its interaction with them provides a model for how other data sources could also be classified.

Let’s keep this under control

Because big data frameworks are fast-moving creations, it’s been tough to build reliable security around them. Eagle’s premise is that it can provide policy-based analysis and alerting as a possible complement to other projects like Apache Ranger. Ranger provides authentication and access control across Hadoop and its related technologies; Eagle gives you some idea of what people are doing once they’re allowed inside.

The biggest question hovering over Eagle’s future — yes, even this early on — is to what degree Hadoop vendors will elegantly roll it into their existing distributions or use their own security offerings. Data security and governance have long been one of the missing pieces that commercial offerings could compete on.