LEARN

IT Event Analytics: The Complete Introduction

Event analytics is a computing process that addresses the triage and resolution of IT events and incidents. An event can describe any change in state or condition of a component on your network. Over the course of regular operation, all technology devices create events in the form of log entries and regular status updates, which are recorded as event data in various databases and other files. Event analytics aims to improve the management and understanding of these events.

Historically, these events — and subsequent event actions — had to be managed individually by human analysts, either as the events emerged or by manually searching through log files to look for anomalies and outliers. Event management systems eventually evolved, giving IT managers a way to sift through the various event alerts and streamline operations. However, as networks continued to grow, the number and complexity of alerts in many large-scale enterprises quickly became unwieldy. As such, it’s common to find multiple tools that manage various events — from the success of GTM and marketing campaigns to network latency — in different segments of the organization.

Event analytics is the next generation of event management, designed to consolidate multiple systems into a centralized platform, which simplifies discovery of the root cause of any given problem. As with cloud and universal analytics, machine learning algorithms have also automated much of this process, requiring less human interaction to successfully resolve an event and giving organizations across all industries a significant competitive edge in the fast-paced digital age.




Defining an "Event" in Analytics

In analytics, an event is a record that refers to a change in the state of a device on the network. Events are typically generated with extreme regularity. For example, a server may record an event action or entry every time a web page receives a certain number of pageviews or link clicks, or any other user interaction. In a busy environment, the result can be event log files that consume multiple terabytes of data for every day of operation.

An event doesn’t have to be negative. In fact, depending on the type of interaction, the vast majority of events are innocuous and expected. An event category constitutes one of three types: informational, warning or exception, which correlate to various levels of concern. Event management and event analytics tools are designed to help teams sift through event data to determine where real problems lie; they facilitate creating an event category composed of notable events, episodes, incidents and other actionable occurrences; and they generate regular reports, alerting management in a timely fashion if something is amiss.

Defining Event Data

Event data is a term that describes the additional data added to specific events in an analytics index. Event data commonly includes information such as character set encoding, time stamping, user-defined metadata, non-interaction events, and other standardized fields such as the name of the host or event source. Events can also be segmented within the event data and anonymized if needed.

Among other things, event data enables faster searches, easier analysis and the ability to categorize events based on user criteria. Also, as the event data index grows, anomalies become easier to spot with event tracking, illuminating long-term trends.

Defining an "Incident"

An incident is triggered when there is a security breach or an interruption of service. From an analytics standpoint, an incident begins when a pattern in the event index is discovered that may have relevance to an enterprise system’s security or operation. When a suspicious pattern is uncovered, an entry known as a notable event is generated, identified with an event tracking code. In the event analytics tool, an incident dashboard will display all notable events and episodes. The notable events receive a numerical value or other code assigning an event value (or event label) that categorizes them by severity so the top events can be quickly triaged, assigned, tracked and closed.

A typical incident workflow involves an administrative analyst who monitors the incident review dashboard, assigning a numeric value on new notable events while performing relatively cursory, high-level triage as they are created and populated in the event analytics tool. When a notable event warrants investigation, the administrative analyst assigns the event to a reviewing analyst, who begins a formal investigation of the incident. When concluded, the analyst pushes the event to a final analyst for verification, who then validates the changes and formally closes the case.

Using Correlation Search to Identify Events and Incidents

A correlation search is a type of scheduled or recurring search of analytics event logs that monitors for suspicious events or patterns. Users can configure a correlation search, which is used to generate a notable event when certain conditions are met. These notable events are assigned a risk score, which then generates a corresponding alert action used for event tracking. Correlation searches can span across multiple types of data, which allow IT managers to more accurately identify suspicious attack patterns.

A correlation search is typically used to address a specific security need, including user clicks leading to malicious download, infected outbound links, malicious javascript embedded in a web page element or when a plugin contains vulnerabilities. If you want to be alerted when security incidents happen, a correlation search automates this process through regular searches of the event index.

monitoring-metrics-that-matter-screenshot

Identiyfing Notable Events 

A notable event, or top event, is a specific event record that is generated by a correlation search. Typically this takes the form of an alert or the creation of an incident, delivered to the analyst after certain criteria have been met. Essentially, notable events are the ones you need to worry about.

A notable event aggregation policy is used to group together and organize notable events. These policies can be set by a human analyst or a machine learning algorithm that automates their implementation. During this process, duplicate entries are removed, and an episode is created once the events are appropriately grouped. The notable aggregation policy contains both the notable events and the rules that automate the actions in response to an episode. (In other words, the policy contains both the problem and the solution.)

Groups of Notable Events: Becoming an Episode

An episode is a group of notable events that have been identified and clustered together in an event category by an event analytics system. Episodes typically describe potentially serious events that have an adverse effect on service – for example, a network outage, an application that is no longer running, malicious javascript or infected plugin, a security breach, or user interactions resulting in data loss. An episode is generally a subset of an incident, which describes a longer and more sustained sequence of events.

An episode review system will denote the severity of the notable events, and assign an event value or event label for top events that are useful for analysts undertaking triage of the various episodes impacting the network. Episodes also include a status tag, denoting whether it is new, in progress (and to which analyst it is assigned), resolved (when the problem is solved by the responding analyst), or closed (when a final analyst has reviewed the work and validated that the changes made are appropriate).

Using Alert Triggers

Analysts can configure alert trigger conditions, which can be triggered in real time or on a schedule. When an analyst creates a set of alert trigger conditions, correlation search results (see above) are checked to see if they match the conditions. These conditions allow an analyst to do event tracking for multiple events and event data fields. For example, an alert trigger condition may be set to monitor the bounce rate of a web page that would be triggered if the bounce rate exceeded a prescribed threshold.

There are two kinds of alerts: real-time alerts and scheduled alerts. Real-time alerts monitor for qualifying events and are useful in cases where an emergency exception situation has been created, such as a crashed server or a network outage. Real-time alerts can be set with a per-result triggering condition for specific events, which notifies the analyst every time the condition is met. Or they can be configured with a rolling time window triggering condition, which monitors event conditions in real-time, but only generates an alert if conditions are met several times during a specific window of time.

In contrast, scheduled alerts are not run in real-time but are run periodically. Typically alerts are run during slower times of the day (such as late at night) or at the end of a given time period. An alert to check whether more than ten failed credit card transactions were recorded during the previous day, or if there were any link clicks on a malicious web page, would be typical use cases for a recurring scheduled alert.

Event Analytics as a part of Event Management

Event analytics can be considered the next revolution of event management technology. Event management tools are used to monitor a specific component or system within an IT infrastructure, whereas event analytics often describes systems that have broad, cross-functional capabilities. That said, there is no broad consensus on standard usage of either term, and they are often used interchangeably.

Much like event analytics systems, event management tools provide detection, notification, and filtering services. They also allow for event response, generate events reports and closure via triage. Though event management tools may be narrower in scope, they can also be predictive and scalable. Event management is also a core component of SIEM – security information and event management – a progressive security management technology that lets security analysts respond to threats in real-time.

Event management follows a very specific process, from event creation to incident resolution.

First, an event is created, likely with routine notifications regarding normal device changes in the enterprise. However, some will inevitably exceed the specifications and ranges defined in a correlation search by the event analyst. These are notable events, and they will be recorded as such in the event management system.

As notable events are filtered and aggregated, they are collected into an episode, which defines a cluster of notable events in an event category. It is the event analyst’s job to act upon these episodes, first by determining event value and performing triage to determine how best to remedy the episode, which either resolves it directly or assigns it to another analyst. The status of the episode is updated to “in progress” or “pending” while it is being investigated and acted upon. Once a fix is in place, the episode is marked as resolved.

A final analyst reviews all of the above and, once satisfied with the results, closes the episode.




Getting Started with Event Analytics

It’s easy to get a start with today’s event monitoring and event analytics tools. While installing on-premise software is always an option, cloud-based analytics tools, or cloud analytics, are an increasingly appealing choice. Cloud-based analytics tools are simpler to install, require much less in the way of infrastructure, and are just as configurable and customizable as on-premise tools.

Some key features to look for:

  • Support for a wide variety of types of machine data and technology devices (servers, switches, databases, web servers, etc.)
  • Live dashboards that update in real-time, with easy-to-understand visualizations that simplify episode data for analysts
  • Centralized event collection from both on-premise and cloud-based sources
  • Ability to perform triage on episodes and assign remedial actions accordingly
  • Ability to easily create and manage correlation searches
  • Proactive logic that forecasts problems before they arise
  • Long-term data storage
  • APIs and SDKs that allow for more refined customizations and connection to other tools

What is Splunk?

This posting does not necessarily represent Splunk's position, strategies or opinion.

Stephen Watts
Posted by

Stephen Watts

Stephen Watts works in growth marketing at Splunk. Stephen holds a degree in Philosophy from Auburn University and is an MSIS candidate at UC Denver. He contributes to a variety of publications including CIO.com, Search Engine Journal, ITSM.Tools, IT Chronicles, DZone, and CompTIA.