Log data as collected from operating systems, appliances, and applications is extremely unstructured. You can’t imagine how much so. Unless you are an analyst working with logs or you have a past where you were “grepping” data in log files, you are probably not aware of the infinite variety of formats and lack of standardization. If this is the case, you are likely missing out on the complete story logs can tell – be it security incidents, IT operations irregularities, and a myriad of other business insights.

A typical log message includes a line of text with a timestamp attached to it. Something like this:

virtualhost:443 127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 "http://www.example.com/start.html" "Mozilla/4.08 [en] (Win98; I ;Nav)"

This is an entry from an Apache web server access log, and is very often used as an example when demonstrating log-related tools. How this needs to be processed may not be immediately apparent to you, but this has become such a widespread format that most (if not all) tools are able to extract the important elements of this message, so that you can get meaningful insight into what your web server is doing.

The ugly truth is, however, that even if we take a simple event, it can take many shapes and forms, even if they originate from the same computer. For instance, let me log in to my laptop a couple of times:

Login from SSH:

Feb 10 08:59:40 bzorp sshd[3273356]: Accepted password for bazsi from 127.0.0.1 port 45782 ssh2
Feb 10 08:59:40 bzorp sshd[3273356]: pam_unix(sshd:session): session opened for user bazsi(uid=1000) by (uid=0)

Login from the console:

Feb 10 09:00:12 bzorp login[3273950]: pam_unix(login:session): session opened for user bazsi(uid=1000) by LOGIN(uid=0)
Feb 10 09:00:12 bzorp systemd-logind[2292]: New session 3249 of user bazsi.

Login from the graphical interface:

Feb 10 09:00:23 bzorp gdm-password][3277695]: pam_unix(gdm-password:session): session opened for user bazsi(uid=1000) by (uid=0)
Feb 10 09:00:24 bzorp systemd-logind[2292]: New session 3253 of user bazsi.

At the very least, you might think that the “session opened” message is the same across all software that allows authentication and login. Well, no. It depends on whether the system uses PAM or not, the version number of the PAM libraries, and whether the specific application has chosen to use PAM for authentication purposes.  These are details that the final consumer of the logs may or may not care about — but they nonetheless can substantially alter the format of the log events, making the extraction of what is important much more difficult. So how do we extract these important elements from unstructured data?  Read on to find out how Dynamic Message Classification can significantly ease the burden of deploying Observablity in your enterprise, and overcome a key limitation these solutions have been saddled with for years.

 

Message Classification is Required for Observability

Message classification, or curation, is the process of formatting and/or enriching the logs so that relevant elements are highlighted and easy to access by data consumers such as a SIEM.  Proper processing of raw application, device, and observability data is an absolute requirement in order to build effective queries in a SIEM or other analytics tool that drive dashboards and alerts, and ultimately allow the data to tell its story.  The query languages that drive these tools require an effective schema for each data source to work optimally, and these schemas depend on accurate message classification.  Coming up with these schemas, for hundreds of data sources, to drive all those shiny dashboards in your SOC is difficult to do even once.  Consequently, it is a task that most administrators never fully complete. To make things worse, this is also a moving target – you may have finished your complete solution a week ago, only then to be faced with new data sources added in the last few days. It’s not enough to simply deploy an agent on the edge and then expect nice dashboards in a SIEM. This may work for a few well-formatted and consistent data sources, but things quickly devolve into an intractable mess as more complex and varied sources are onboarded.

There is no easy solution here. Should anyone claim the contrary, they are probably not giving you the entire truth. I don’t see the situation changing for the better in the foreseeable future unless the industry adopts a fundamental change in its approach to message classification and curation.  The log events emitted by applications and devices originate from code developed by diverse sets of software engineers, each having their own idea of what a “good” log entry looks like (or no opinion at all, which can make things even worse). Industry attempts at creating a common log-related message format have only produced new standards, which have just made things worse as well.

So how, with logs being such a critical input to all security and IT operations efforts, do we bridge this gap?

 

Dynamic Message Classification

The direction to take from where we stand today is to realize that collecting, classifying, and curating logs and observability data is a dynamic process, just like the deployment and engineering of applications and firmware that generate the logs in the first place. It has been a long-held belief (or wish, rather) that this initial processing of logs be a static “set and forget” task when deploying a SIEM or log analytics solution, when we should instead treat the process with the same agile CI/CD approach that forms the basis of today’s latest development worldview. In short, the industry needs a means to continuously measure, observe and improve the “Observability Supply Chain” as a whole, freeing each downstream analytics tool from this onerous (and resource-intensive) task as well as allowing for far more flexibility in choosing (and potentially replacing) the individual analytics and storage components of an enterprise logging/observability architecture.

With this, we come full circle to the “message classification” problem mentioned in the title. An industry solution to this problem will relieve your analytics toolset from the never-ending task of data classification and curation of unstructured data, and unleash your logging to realize the operational and business insights bundled up in this critical data!

Resilient syslog architectures webinar by Balazs Scheidler

On-demand Webinar

Resilient syslog
architectures

On-demand Webinar

Identifying and eliminating
syslog message drops

Balázs Scheidler - Webinar

Follow Our Progress!

We are excited to be realizing our vision above with a full Axoflow product suite.

Request Early Access

  • A zero-commitment trial of AxoRouter to see how it automatically identifies your data sources and applies the relevant curation to them.

    I have read and agree to the terms & conditions.

    Request a Demo

    • A zero-commitment demo of the Axoflow Platform.
    • A chance to see how optimized telemetry can improve your observability operations and reduce costs.

      I have read and agree to the terms & conditions.

      Subscribe for Product News

      • Technology oriented content only.
      • Not more than 1-3 posts per month.
      • You can unsubscribe any time.

      By signing up you agree to receive promotional messages
      according to Axoflow's Terms of Services.