This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

hdfs: Store messages on the Hadoop Distributed File System (HDFS)

Starting with version 3.7, AxoSyslog can send plain-text log files to the Hadoop Distributed File System (HDFS), allowing you to store your log data on a distributed, scalable file system. This is especially useful if you have huge amounts of log messages that would be difficult to store otherwise, or if you want to process your messages using Hadoop tools (for example, Apache Pig).

Note the following limitations when using the AxoSyslog hdfs destination:

  • Since AxoSyslog uses the official Java HDFS client, the hdfs destination has significant memory usage (about 400MB).

  • You cannot set when log messages are flushed. Hadoop performs this action automatically, depending on its configured block size, and the amount of data received. There is no way for the AxoSyslog application to influence when the messages are actually written to disk. This means that AxoSyslog cannot guarantee that a message sent to HDFS is actually written to disk. When using flow-control, AxoSyslog acknowledges a message as written to disk when it passes the message to the HDFS client. This method is as reliable as your HDFS environment.

Declaration:

   @include "scl.conf"
    
    hdfs(
        client-lib-dir("/opt/syslog-ng/lib/syslog-ng/java-modules/:<path-to-preinstalled-hadoop-libraries>")
        hdfs-uri("hdfs://NameNode:8020")
        hdfs-file("<path-to-logfile>")
    );

Example: Storing logfiles on HDFS

The following example defines an hdfs destination using only the required parameters.

   @include "scl.conf"
    
    destination d_hdfs {
        hdfs(
            client-lib-dir("/opt/syslog-ng/lib/syslog-ng/java-modules/:/opt/hadoop/libs")
            hdfs-uri("hdfs://10.140.32.80:8020")
            hdfs-file("/user/log/logfile.txt")
        );
    };

The hdfs() driver is actually a reusable configuration snippet configured to receive log messages using the Java language-binding of AxoSyslog. For details on using or writing such configuration snippets, see Reusing configuration blocks. You can find the source of the hdfs configuration snippet on GitHub. For details on extending AxoSyslog in Java, see the Getting started with syslog-ng development guide.

1 - Prerequisites

To send messages from AxoSyslog to HDFS, complete the following steps.

Steps:

  1. If you want to use the Java-based modules of AxoSyslog (for example, the Elasticsearch, HDFS, or Kafka destinations), you must compile AxoSyslog with Java support.

    • Download and install the Java Runtime Environment (JRE), 1.7 (or newer). You can use OpenJDK or Oracle JDK, other implementations are not tested.

    • Install gradle version 2.2.1 or newer.

    • Set LD_LIBRARY_PATH to include the libjvm.so file, for example:LD_LIBRARY_PATH=/usr/lib/jvm/java-7-openjdk-amd64/jre/lib/amd64/server:$LD_LIBRARY_PATH

      Note that many platforms have a simplified links for Java libraries. Use the simplified path if available. If you use a startup script to start AxoSyslog set LD_LIBRARY_PATH in the script as well.

    • If you are behind an HTTP proxy, create a gradle.properties under the modules/java-modules/ directory. Set the proxy parameters in the file. For details, see The Gradle User Guide.

  2. Download the Hadoop Distributed File System (HDFS) libraries (version 2.x) from http://hadoop.apache.org/releases.html.

  3. Extract the HDFS libraries into a temporary directory, then collect the various .jar files into a single directory (for example, /opt/hadoop/lib/) where AxoSyslog can access them. You must specify this directory in the AxoSyslog configuration file. The files are located in the various lib directories under the share/ directory of the Hadoop release package. (For example, in Hadoop 2.7, required files are common/hadoop-common-2.7.0.jar, common/libs/*.jar, hdfs/hadoop-hdfs-2.7.0.jar, hdfs/lib/*, but this may change between Hadoop releases, so it is easier to copy every .jar file into a single directory.

2 - How AxoSyslog interacts with HDFS

The AxoSyslog application sends the log messages to the official HDFS client library, which forwards the data to the HDFS nodes. The way AxoSyslog interacts with HDFS is described in the following steps.

  1. After AxoSyslog is started and the first message arrives to the hdfs destination, the hdfs destination tries to connect to the HDFS NameNode. If the connection fails, AxoSyslog will repeatedly attempt to connect again after the period set in time-reopen() expires.

  2. AxoSyslog checks if the path to the logfile exists. If a directory does not exist AxoSyslog automatically creates it. AxoSyslog creates the destination file (using the filename set in the AxoSyslog configuration file, with a UUID suffix to make it unique, for example, /usr/hadoop/logfile.txt.3dc1c59e-ab3b-4b71-9e81-93db477ed9d9) and writes the message into the file. After the file is created, AxoSyslog will write all incoming messages into the hdfs destination.

  3. If the HDFS client returns an error, AxoSyslog attempts to close the file, then opens a new file and repeats sending the message (trying to connect to HDFS and send the message), as set in the retries() parameter. If sending the message fails for retries() times, AxoSyslog drops the message.

  4. The AxoSyslog application closes the destination file in the following cases:

    • AxoSyslog is reloaded

    • AxoSyslog is restarted

    • The HDFS client returns an error.

  5. If the file is closed and you have set an archive directory, AxoSyslog moves the file to this directory. If AxoSyslog cannot move the file for some reason (for example, AxoSyslog cannot connect to the HDFS NameNode), the file remains at its original location, AxoSyslog will not try to move it again.

3 - Storing messages with MapR-FS

The AxoSyslog application is also compatible with MapR File System (MapR-FS). MapR-FS provides better performance, reliability, efficiency, maintainability, and ease of use compared to the default Hadoop Distributed Files System (HDFS). To use MapR-FS with AxoSyslog, complete the following steps:

  1. Install MapR libraries. Instead of the official Apache HDFS libraries, MapR uses different libraries. The supported version is MapR 4.x.

    1. Download the libraries from the Maven Repository and Artifacts for MapR or get it from an already existing MapR installation.

    2. Install MapR. If you do not know how to install MapR, follow the instructions on the MapR website.

  2. In a default MapR installation, the required libraries are installed in the following path: /opt/mapr/lib.

    Enter the path where MapR was installed in the class-path option of the hdfs destination, for example:

        class-path("/opt/mapr/lib/")
    

    If the libraries were downloaded from the Maven Repository, the following additional libraries will be requiered. Note that the version numbers in the filenames can be different in the various Hadoop releases:commons-collections-3.2.1.jar, commons-logging-1.1.3.jar, hadoop-auth-2.5.1.jar, log4j-1.2.15.jar, slf4j-api-1.7.5.jar, commons-configuration-1.6.jar, guava-13.0.1.jar, hadoop-common-2.5.1.jar, maprfs-4.0.2-mapr.jar, slf4j-log4j12-1.7.5.jar, commons-lang-2.5.jar, hadoop-0.20.2-dev-core.jar, json-20080701.jar, protobuf-java-2.5.0.jar, zookeeper-3.4.5-mapr-1406.jar.

  3. Configure the hdfs destination in AxoSyslog.

    Example: Storing logfiles with MapR-FS

    The following example defines an hdfs destination for MapR-FS using only the required parameters.

        @include "scl.conf"
    
        destination d_mapr {
            hdfs(
                client-lib-dir("/opt/syslog-ng/lib/syslog-ng/java-modules/:/opt/mapr/lib/")
                hdfs-uri("maprfs://10.140.32.80")
                hdfs-file("/user/log/logfile.txt")
            );
        };
    

4 - Kerberos authentication with the hdfs() destination

Version 3.10 and later supports Kerberos authentication to authenticate the connection to your Hadoop cluster. AxoSyslog assumes that you already have a Hadoop and Kerberos infrastructure.

Prerequisites:

  • You have configured your Hadoop infrastructure to use Kerberos authentication.

  • You have a keytab file and a principal for the host running AxoSyslog. For details, see the Kerberos documentation.

  • You have installed and configured the Kerberos client packages on the host running AxoSyslog. (That is, Kerberos authentication works for the host, for example, from the command line using the kinit user@REALM -k -t <keytab_file> command.)

   destination d_hdfs {
        hdfs(client-lib-dir("/hdfs-libs/lib")
        hdfs-uri("hdfs://hdp-kerberos.syslog-ng.example:8020")
        kerberos-keytab-file("/opt/syslog-ng/etc/hdfs.headless.keytab")
        kerberos-principal("hdfs-hdpkerberos@MYREALM")
        hdfs-file("/var/hdfs/test.log"));
    };

5 - HDFS destination options

The hdfs destination stores the log messages in files on the Hadoop Distributed File System (HDFS). The hdfs destination has the following options.

The following options are required: hdfs-file(), hdfs-uri(). Note that to use hdfs, you must add the following line to the beginning of your AxoSyslog configuration:

   @include "scl.conf"

client-lib-dir()

Type: string
Default: The AxoSyslog module directory: /opt/syslog-ng/lib/syslog-ng/java-modules/

Description: The list of the paths where the required Java classes are located. For example, class-path("/opt/syslog-ng/lib/syslog-ng/java-modules/:/opt/my-java-libraries/libs/"). If you set this option multiple times in your AxoSyslog configuration (for example, because you have multiple Java-based destinations), AxoSyslog will merge every available paths to a single list.

For the hdfs destination, include the path to the directory where you copied the required libraries (see Prerequisites), for example, client-lib-dir("/opt/syslog-ng/lib/syslog-ng/java-modules/:/opt/hadoop/libs/").

disk-buffer()

Description: This option enables putting outgoing messages into the disk buffer of the destination to avoid message loss in case of a system failure on the destination side. It has the following options:

capacity-bytes()

Type: number (bytes)
Default: 1MiB

Description: This is a required option. The maximum size of the disk-buffer in bytes. The minimum value is 1048576 bytes. If you set a smaller value, the minimum value will be used automatically. It replaces the old log-disk-fifo-size() option.

In AxoSyslog version 4.2 and earlier, this option was called disk-buf-size().

compaction()

Type: yes/no
Default: no

Description: If set to yes, AxoSyslog prunes the unused space in the LogMessage representation, making the disk queue size smaller at the cost of some CPU time. Setting the compaction() argument to yes is recommended when numerous name-value pairs are unset during processing, or when the same names are set multiple times.

dir()

Type: string
Default: N/A

Description: Defines the folder where the disk-buffer files are stored.

flow-control-window-bytes()

Type: number (bytes)
Default: 163840000

Description: Use this option if the option reliable() is set to yes. This option contains the size of the messages in bytes that is used in the memory part of the disk buffer. It replaces the old log-fifo-size() option. It does not inherit the value of the global log-fifo-size() option, even if it is provided. Note that this option will be ignored if the option reliable() is set to no.

In AxoSyslog version 4.2 and earlier, this option was called mem-buf-size().

flow-control-window-size()

Type: number(messages)
Default: 10000

Description: Use this option if the option reliable() is set to no. This option contains the number of messages stored in overflow queue. It replaces the old log-fifo-size() option. It inherits the value of the global log-fifo-size() option if provided. If it is not provided, the default value is 10000 messages. Note that this option will be ignored if the option reliable() is set to yes.

In AxoSyslog version 4.2 and earlier, this option was called mem-buf-length().

front-cache-size()

Type: number(messages)
Default: 1000

Description: The number of messages stored in the output buffer of the destination. Note that if you change the value of this option and the disk-buffer already exists, the change will take effect when the disk-buffer becomes empty.

Options reliable() and capacity-bytes() are required options.

In AxoSyslog version 4.2 and earlier, this option was called qout-size().

prealloc()

Type: yes/no
Default: no

Description:

By default, AxoSyslog doesn’t reserve the disk space for the disk-buffer file, since in a properly configured and sized environment the disk-buffer is practically empty, so a large preallocated disk-buffer file is just a waste of disk space. But a preallocated buffer can prevent other data from using the intended buffer space (and elicit a warning from the OS if disk space is low), preventing message loss if the buffer is actually needed. To avoid this problem, when using AxoSyslog 4.0 or later, you can preallocate the space for your disk-buffer files by setting prealloc(yes).

In addition to making sure that the required disk space is available when needed, preallocated disk-buffer files provide radically better (3-4x) performance as well: in case of an outage the amount of messages stored in the disk-buffer is continuously growing, and using large continuous files is faster, than constantly waiting on a file to change its size.

If you are running AxoSyslog on a dedicated host (always recommended for any high-volume settings), use prealloc(yes).

Available in AxoSyslog 4.0 and later.

reliable()

Type: yes/no
Default: no

Description: If set to yes, AxoSyslog cannot lose logs in case of reload/restart, unreachable destination or AxoSyslog crash. This solution provides a slower, but reliable disk-buffer option. It is created and initialized at startup and gradually grows as new messages arrive. If set to no, the normal disk-buffer will be used. This provides a faster, but less reliable disk-buffer option.

truncate-size-ratio()

Type: number((between 0 and 1))
Default: 1 (do not truncate)

Description: Limits the truncation of the disk-buffer file. Truncating the disk-buffer file can slow down the disk IO operations, but it saves disk space. By default, AxoSyslog version 4.0 and later doesn’t truncate disk-buffer files by default (truncate-size-ratio(1)). Earlier versions freed the disk-space when at least 10% of the disk-buffer file could be freed (truncate-size-ratio(0.1)).

AxoSyslog only truncates the file if the possible disk gain is more than truncate-size-ratio() times capacity-bytes().

  • Smaller values free disk space quicker.
  • Larger ratios result in better performance.

If you want to avoid performance fluctuations:

Example: Examples for using disk-buffer()

In the following case reliable disk-buffer() is used.

destination d_demo {
    network(
        "127.0.0.1"
        port(3333)
        disk-buffer(
            flow-control-window-bytes(10000)
            capacity-bytes(2000000)
            reliable(yes)
            dir("/tmp/disk-buffer")
        )
    );
};

In the following case normal disk-buffer() is used.

destination d_demo {
    network(
        "127.0.0.1"
        port(3333)
            disk-buffer(
            flow-control-window-size(10000)
            capacity-bytes(2000000)
            reliable(no)
            dir("/tmp/disk-buffer")
        )
    );
};

frac-digits()

Type: number
Default: 0

Description: The AxoSyslog application can store fractions of a second in the timestamps according to the ISO8601 format. The frac-digits() parameter specifies the number of digits stored. The digits storing the fractions are padded by zeros if the original timestamp of the message specifies only seconds. Fractions can always be stored for the time the message was received.

hdfs-append-enabled()

Type: `true
Default: false

Description: When hdfs-append-enabled is set to true, AxoSyslog will append new data to the end of an already existing HDFS file. Note that in this case, archiving is automatically disabled, and AxoSyslog will ignore the hdfs-archive-dir option.

When hdfs-append-enabled is set to false, the AxoSyslog application always creates a new file if the previous has been closed. In that case, appending data to existing files is not supported.

When you choose to write data into an existing file, AxoSyslog does not extend the filename with a UUID suffix because there is no need to open a new file (a new unique ID would mean opening a new file and writing data into that).

hdfs-archive-dir()

Type: string
Default: N/A

Description: The path where AxoSyslog will move the closed log files. If AxoSyslog cannot move the file for some reason (for example, AxoSyslog cannot connect to the HDFS NameNode), the file remains at its original location. For example, hdfs-archive-dir("/usr/hdfs/archive/").

hdfs-file()

Type: string
Default: N/A

Description: The path and name of the log file. For example, hdfs-file("/usr/hdfs/mylogfile.txt"). AxoSyslog checks if the path to the logfile exists. If a directory does not exist AxoSyslog automatically creates it.

hdfs-file() supports the usage of macros. This means that AxoSyslog can create files on HDFS dynamically, using macros in the file (or directory) name.

Example: Using macros in filenames

In the following example, a /var/testdb_working_dir/$DAY-$HOUR.txt file will be created (with a UUID suffix):

   destination d_hdfs_9bf3ff45341643c69bf46bfff940372a {
        hdfs(client-lib-dir(/hdfs-libs)
     hdfs-uri("hdfs://hdp2.syslog-ng.example:8020")
     hdfs-file("/var/testdb_working_dir/$DAY-$HOUR.txt"));
    };

As an example, if it is the 31st day of the month and it is 12 o’clock, then the name of the file will be 31-12.txt.

hdfs-max-filename-length()

Type: number
Default: 255

Description: The maximum length of the filename. This filename (including the UUID that AxoSyslog appends to it) cannot be longer than what the file system permits. If the filename is longer than the value of hdfs-max-filename-length, AxoSyslog will automatically truncate the filename. For example, hdfs-max-filename-length("255").

hdfs-resources()

Type: string
Default: N/A

Description: The list of Hadoop resources to load, separated by semicolons. For example, hdfs-resources("/home/user/hadoop/core-site.xml;/home/user/hadoop/hdfs-site.xml").

hdfs-uri()

Type: string
Default: N/A

Description: The URI of the HDFS NameNode is in hdfs://IPaddress:port or hdfs://hostname:port format. When using MapR-FS, the URI of the MapR-FS NameNode is in maprfs://IPaddress or maprfs://hostname format, for example: maprfs://10.140.32.80. The IP address of the node can be IPv4 or IPv6. For example, hdfs-uri("hdfs://10.140.32.80:8020"). The IPv6 address must be enclosed in square brackets ([]) as specified by RFC 2732, for example, hdfs-uri("hdfs://[FEDC:BA98:7654:3210:FEDC:BA98:7654:3210]:8020").

hook-commands()

Description: This option makes it possible to execute external programs when the relevant driver is initialized or torn down. The hook-commands() can be used with all source and destination drivers with the exception of the usertty() and internal() drivers.

Using hook-commands() when AxoSyslog starts or stops

To execute an external program when AxoSyslog starts or stops, use the following options:

startup()

Type: string
Default: N/A

Description: Defines the external program that is executed as AxoSyslog starts.

shutdown()

Type: string
Default: N/A

Description: Defines the external program that is executed as AxoSyslog stops.

Using the hook-commands() when AxoSyslog reloads

To execute an external program when the AxoSyslog configuration is initiated or torn down, for example, on startup/shutdown or during a AxoSyslog reload, use the following options:

setup()

Type: string
Default: N/A

Description: Defines an external program that is executed when the AxoSyslog configuration is initiated, for example, on startup or during a AxoSyslog reload.

teardown()

Type: string
Default: N/A

Description: Defines an external program that is executed when the AxoSyslog configuration is stopped or torn down, for example, on shutdown or during a AxoSyslog reload.

Example: Using hook-commands() with a network source

In the following example, the hook-commands() is used with the network() driver and it opens an iptables port automatically as AxoSyslog is started/stopped.

The assumption in this example is that the LOGCHAIN chain is part of a larger ruleset that routes traffic to it. Whenever the AxoSyslog created rule is there, packets can flow, otherwise the port is closed.

source {
    network(transport(udp)
    hook-commands(
          startup("iptables -I LOGCHAIN 1 -p udp --dport 514 -j ACCEPT")
          shutdown("iptables -D LOGCHAIN 1")
        )
     );
};

jvm-options()

Type: list
Default: N/A

Description: Specify the Java Virtual Machine (JVM) settings of your Java destination from the AxoSyslog configuration file.

For example:

   jvm-options("-Xss1M -XX:+TraceClassLoading")

You can set this option only as a global option, by adding it to the options statement of the syslog-ng.conf configuration file.

kerberos-keytab-file()

Type: string
Default: N/A

Description: The path to the Kerberos keytab file that you received from your Kerberos administrator. For example, kerberos-keytab-file("/opt/syslog-ng/etc/hdfs.headless.keytab"). This option is needed only if you want to authenticate using Kerberos in Hadoop. You also have to set the hdfs-option-kerberos-principal() option. For details on the using Kerberos authentication with the hdfs() destination, see Kerberos authentication with the hdfs() destination.

   destination d_hdfs {
        hdfs(client-lib-dir("/hdfs-libs/lib")
        hdfs-uri("hdfs://hdp-kerberos.syslog-ng.example:8020")
        kerberos-keytab-file("/opt/syslog-ng/etc/hdfs.headless.keytab")
        kerberos-principal("hdfs-hdpkerberos@MYREALM")
        hdfs-file("/var/hdfs/test.log"));
    };

Available in AxoSyslog version 3.10 and later.

kerberos-principal()

Type: string
Default: N/A

Description: The Kerberos principal you want to authenticate with. For example, kerberos-principal("hdfs-user@MYREALM"). This option is needed only if you want to authenticate using Kerberos in Hadoop. You also have to set the hdfs-option-kerberos-keytab-file() option. For details on the using Kerberos authentication with the hdfs() destination, see Kerberos authentication with the hdfs() destination.

   destination d_hdfs {
        hdfs(client-lib-dir("/hdfs-libs/lib")
        hdfs-uri("hdfs://hdp-kerberos.syslog-ng.example:8020")
        kerberos-keytab-file("/opt/syslog-ng/etc/hdfs.headless.keytab")
        kerberos-principal("hdfs-hdpkerberos@MYREALM")
        hdfs-file("/var/hdfs/test.log"));
    };

Available in AxoSyslog version 3.10 and later.

log-fifo-size()

Type: number
Default: Use global setting.

Description: The number of messages that the output queue can store.

on-error()

Type: One of: drop-message, drop-property, fallback-to-string, silently-drop-message, silently-drop-property, silently-fallback-to-string
Default: Use the global setting (which defaults to drop-message)

Description: Controls what happens when type-casting fails and AxoSyslog cannot convert some data to the specified type. By default, AxoSyslog drops the entire message and logs the error. Currently the value-pairs() option uses the settings of on-error().

  • drop-message: Drop the entire message and log an error message to the internal() source. This is the default behavior of AxoSyslog.
  • drop-property: Omit the affected property (macro, template, or message-field) from the log message and log an error message to the internal() source.
  • fallback-to-string: Convert the property to string and log an error message to the internal() source.
  • silently-drop-message: Drop the entire message silently, without logging the error.
  • silently-drop-property: Omit the affected property (macro, template, or message-field) silently, without logging the error.
  • silently-fallback-to-string: Convert the property to string silently, without logging the error.

retries()

Type: number (of attempts)
Default: 3

Description: If AxoSyslog cannot send a message, it will try again until the number of attempts reaches retries().

If the number of attempts reaches retries(), AxoSyslog will wait for time-reopen() time, then tries sending the message again.

template()

Type: string
Default: A format conforming to the default logfile format.

Description: Specifies a template defining the logformat to be used in the destination. Macros are described in Macros of AxoSyslog. Please note that for network destinations it might not be appropriate to change the template as it changes the on-wire format of the syslog protocol which might not be tolerated by stock syslog receivers (like syslogd or syslog-ng itself). For network destinations make sure the receiver can cope with the custom format defined.

throttle()

Type: number
Default: 0

Description: Sets the maximum number of messages sent to the destination per second. Use this output-rate-limiting functionality only when using disk-buffer as well to avoid the risk of losing messages. Specifying 0 or a lower value sets the output limit to unlimited.

time-reap()

Accepted values: number (seconds)
Default: 0 (disabled)

Description: The time to wait in seconds before an idle destination file is closed. Note that if hdfs-archive-dir option is set and time-reap expires, archiving is triggered for the affected file.

time-zone()

Type: name of the timezone, or the timezone offset
Default: unspecified

Description: Convert timestamps to the timezone specified by this option. If this option is not set, then the original timezone information in the message is used. Converting the timezone changes the values of all date-related macros derived from the timestamp, for example, HOUR. For the complete list of such macros, see Date-related macros.

The timezone can be specified by using the name, for example, time-zone("Europe/Budapest")), or as the timezone offset in +/-HH:MM format, for example, +01:00). On Linux and UNIX platforms, the valid timezone names are listed under the /usr/share/zoneinfo directory.

ts-format()

Type: rfc3164, bsd, rfc3339, iso
Default: rfc3164

Description: Override the global timestamp format (set in the global ts-format() parameter) for the specific destination. For details, see ts-format().