On Splunk crashing every 18 minutes

Every 18 minutes, give or take ten seconds the Splunk heavy forwarder we run was spontaneously crashing. There had been no change in traffic or other conditions that we knew about on the ESX host the server runs on.

The operating system is Debian 7.2 x86_64, installed from the netinst ISO with a very minimally modified install, and the logs weren’t showing the OOM errors I’d been fighting with previously - 2Gb of memory for Splunk is never enough.

After doing some Googling on random things, I found this thread on Splunk Answers “Splunk Crashing once in 10 min“. It came down to the ulimit setting on the Operating System, which set the amount of files which could be open by any given user/process at a time.

$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 15998
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 15998
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

I edited /etc/security/limits.conf and added the following:

root soft nofile 8192
root hard nofile 16384

This pushed the hard limit from 1024 (the default) to 16384 for the root user, and 8192 as a “soft” limit. From the man page on ulimit:

A hard limit cannot be increased once it is set; a soft limit may be increased up to the value of the hard limit.

So far all’s been quiet, hopefully it’s all sorted 😀

Here’s an example of a crash log:

[build 182037] 2014-12-15 08:09:02
Received fatal signal 6 (Aborted).
Cause:
Signal sent by PID 22511 running under UID 0.
Crashing thread: FwdDataReceiverThread
Registers:
RIP: [0x00007FC9E797C475] gsignal + 53 (/lib/x86_64-linux-gnu/libc.so.6)
RDI: [0x00000000000057EF]
RSI: [0x000000000000580A]
RBP: [0x00007FC9E7A94C71]
RSP: [0x00007FC9DABFE9A8]
RAX: [0x0000000000000000]
RBX: [0x00007FFF6450BD39]
RCX: [0xFFFFFFFFFFFFFFFF]
RDX: [0x0000000000000006]
R8: [0x00007FC9DABFF700]
R9: [0x2070676F6C5F6020]
R10: [0x0000000000000008]
R11: [0x0000000000000202]
R12: [0x00000000012767B9]
R13: [0x0000000001392F20]
R14: [0x00007FC9E7A94C71]
R15: [0x000000000000090F]
EFL: [0x0000000000000202]
TRAPNO: [0x0000000000000000]
ERR: [0x0000000000000000]
CSGSFS: [0x0000000000000033]
OLDMASK: [0x0000000000000000]

OS: Linux
Arch: x86-64

Backtrace:
Linux / splunk_forwarder / 3.2.0-4-amd64 / #1 SMP Debian 3.2.51-1 / x86_64
Last few lines of stderr (may contain info on assertion failure, but also could be old):
2014-12-15 05:59:31.918 +1000 splunkd started (build 182037)
2014-12-15 06:17:39.367 +1000 splunkd started (build 182037)
2014-12-15 06:35:46.446 +1000 splunkd started (build 182037)
2014-12-15 06:52:53.657 +1000 splunkd started (build 182037)
2014-12-15 07:12:02.217 +1000 splunkd started (build 182037)
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
2014-12-15 07:32:09.288 +1000 splunkd started (build 182037)
2014-12-15 07:51:20.832 +1000 splunkd started (build 182037)
splunkd: /opt/splunk/p4/splunk/branches/6.0.0/src/util/Logger.cpp:2319: LoggerStream::~LoggerStream(): Assertion `_logp != __null' failed.

/etc/debian_version: 7.2
Last errno: 12
Threads running: 45
argv: [splunkd -p 8089 start]
Thread: "FwdDataReceiverThread", did_join=0, ready_to_run=Y, main_thread=N
First 8 bytes of Thread token @0x7fc9dac2b450:
00000000 00 f7 bf da c9 7f 00 00 |........|
00000008
First 512 bytes of PolledFd object @0x7fc942c59000:
00000000 d0 24 2a 01 00 00 00 00 00 00 00 00 00 00 00 00 |.$*.............|
00000010 e0 dc 7c 01 00 00 00 00 00 00 00 00 00 00 00 00 |..|.............|
00000020 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000030 00 00 00 00 00 00 00 00 38 00 00 00 1e 00 00 00 |........8.......|
00000040 80 1f 13 aa c9 7f 00 00 1b 00 00 00 00 00 00 00 |................|
00000050 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000060 c0 f6 d7 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000070 30 fa d7 00 00 00 00 00 00 00 00 00 00 00 00 00 |0...............|
00000080 c0 d2 d7 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000090 00 00 00 00 00 00 00 00 05 06 00 00 00 00 00 00 |................|
[sniped for privacy]
00000200
In TcpChannel 0x7fc942c59000, _tcloop=0x7fc9dac521c0, no async write data, _data._shouldKill=N, r/w_timeouts=600000/240000, timeout_count=0 
SSL: version="SSLv3", state="SSL negotiation finished successfully", cipher="AES256-SHA", compression="zlib compression"
rbuf: ptr=0x7fc942c590a0, size=0x2000, rptr=0x0, wptr=0x605
TcpChannelAcceptor: tcloop=0x7fc9dac521c0, activeList=0x7fc97ddaf000, activeCount=50, idleThreads=(nil), idleThreadCount=0, activeThreadCount=0, maxActiveThreadsInTimePeriod=0, idleThreadTrimmerActive=N, disabledReasons=0


x86 CPUID registers:
0: 0000000A 756E6547 6C65746E 49656E69
1: 00010676 02020800 80082201 1FABFBFF
2: 55035A01 00F0B2EB 00000000 09CA212C
3: 00000000 00000000 00000000 00000000
4: 00000000 00000000 00000000 00000000
5: 00000000 00000000 00000000 00000000
6: 00000005 00000001 00000001 00000000
7: 00000000 00000000 00000000 00000000
8: 00000000 00000000 00000000 00000000
9: 00000001 00000000 00000000 00000000
A: 07300401 0000007F 00000000 00000000
80000000: 80000008 00000000 00000000 00000000
80000001: 00000000 00000000 00000001 20100800
80000002: 20202020 49202020 6C65746E 20295228
80000003: 6E6F6558 20295228 20555043 202D3745
80000004: 37303834 20402020 37382E31 007A4847
80000005: 00000000 00000000 00000000 00000000
80000006: 00000000 00000000 01006040 00000000
80000007: 00000000 00000000 00000000 00000100
80000008: 00003028 00000000 00000000 00000000
terminating...