So⊠this started out as me trying to start a discussion about logging on kanidm/kanidm and once I started, I couldnât stop. Awkward.
Is this about lots of logs, or great quality logs, or just âšbigâš logs?
Yes. đȘ”
Ok, so everyone should like their logs, and making logs likeable requires them to be usable. What makes a usable log?
Note: Iâm not necessarily talking about platform metrics here; theyâre related but not the focus.
Monitoring and Observability are distinct elements informing this discussion.
Charity Majors @mipsytipsy Sep 14, 2020
Updated definition: đ Monitoring is for running and understanding other peopleâs code (aka âyour infrastructureâ)
đ Observability is for running and understanding your code â the code you write, change and ship every day; the code that solves your core business problems.
yale-pinions ahead
- Please just prepend âI thinkâ on everything that isnât quoting someone else.
- I started in ops and now âworkâ in âsecurityâ, so I have a deep need to trace things end to end.
- I may work for the 11th largest t-shirt company in the world but I donât get paid if you log more. I just see a lot of good and bad logs; if you donât log properly, I think you canât do anything well.
Iâm going to try to quote or link to smarter people than me because thatâs way easier and youâll trust me more. In fact, Iâve already done it once.
Log Consumers
There are two main consumers of logs:
- Robots keeping things running. đŠŸ
- Humans debugging a specific problem đ”ïžââïž
This means logs should be parseable by our friendly robots so they can then display them for the humans.
The Basics đ°
Logs must have a few things.
- A timestamp. ISO8601 is good, but RFC3339 puts it into practice. Epoch-(pico|nano|micro)seconds are another option. The former is much more friendly to humans.
- An importance/level. Is this a debug message or an error, or something in between?
- Context - who, what, where and why?
- The information - the detail - otherwise why are you logging?
Iâm going to stop us here for a second.
A logged entry must be one âthingâ.
One thing per line/blob/emit/whatever you want to call it.
You might not have had to deal with Java stack traces. I have. Just⊠donât. Multi-line-barf-noise-down-the-file logs are horrible. Debug logs are inherently noisy, but theyâre still (when done right) one-emit-per-thing.
This is a pretty common looking log:
1937-01-01T12:00:27.87+00:20 INFO kanidmd[1234]: 0977b582-28cf-40f4-a427-342e5aa9cbb8 william@example.com logged in from 1.2.3.4
- Timestamp, check.
- Importance, check.
- Context, check.
- Information. check.
Itâs not bad. Iâm still not a fan.
Log Storage đŸ
Typically theyâre stored on a disk. This might be just in a plain line-by-line file in /var/log/kanidm.log
or in whatever binary format journald or Windows are using. Theyâre also typically collected and transferred into a time-series storage platform, like Splunk (đ), for later correlation and querying.
The way they get there is important to think aboutâŠ
- Writing them to disk in a blocking manner is bad, just like any other blocking action in a performance-sensitive system. We canât wait for things to write.
- Whatâs collecting the logs? Is it
systemd
capturingstdout
, is it your eyeballs (terminals are slow) or is a log agent picking them up from the aforementioned log file? - If we are writing to disk, do we flush that write regularly so the logs are collected promptly? If we hold the file handle open in a way other things canât read, thatâs not helpful to anyone. (Windows DNS Debug logs do this).
Bad Ideas â
- Blocking actions.
- Custom formats.
- Only emitting into a format for a particular consumer (eg. evtx/binary logs).
- Syslog. Just in general.
- XML. Just⊠no.
- Logs in databases.
- Writing a whole log system for our system.
Good Ideas đ©âđł
- Structured data⊠JSONâs awesome for logs.
- Donât write the logs in the active thread.
- Just use std(err|out).
- UTF-8.
- Did I mention it needs to be fast?
Context
If youâre writing a system, itâs table stakes to be able to trace an event from start to end. In both super-full-on-debug-mode and normal-just-enough-logs-to-keep-running modes, an admin should be able to tie together the full lifecycle of an event.
Clients could totally pass a âthis is my eventIDâ field in their request. Since weâre talking about a system that is fundamentally running over REST, that could be a header or a data field. Either way, it should be validated and logged in context with the server-side events. You donât have to log it with every related entry - just a server_session="x", client_request_id="y"
would do.
Do not just log whateverâs in that field - you never want to see âhello server admin Iâm a potato đ„â in your logs. Ok, maybe that would be fun. But you donât want command injection or javascript injection in your SIEM or something like that.
Log Formats
Given that robots love things they can parse trivially and humans need to be able to read them, a format that is both human and machine-readable should be mandatory.
- Terminal escape codes are lovely in the command line and literally the worst to end up in logs. Colours shouldnât happen unless you know youâre outputting to a human-facing terminal.
- Key-value output is readable by humans and robots.
- Delimited (tab|comma|emoji) is error-prone for robots to parse.
- CSV and XML are never the answers.
- Structured data is slightly hard for humans to read.
- Structured data can be fed through a parser to format things for humans.
- What else is there?
- No one should never need to regex the logs to parse them. We like our robots to be happy. đ€đȘ
Examples
What would JSON logging look like? I included linebreaks in my examples to make them easier to read here, but they shouldnât be output.
My previous âpretty common looking logâ:
1937-01-01T12:00:27.87+00:20 ERROR kanidmd[1234]: 0977b582-28cf-40f4-a427-342e5aa9cbb8 William@example.com logged in from 1.2.3.4
This would be better :
{
"ts": "1937-01-01T12:00:27.87200+10:00",
"level": "info",
"process": "kanidmd",
"pid": 1234,
"session_id": "0977b582-28cf-40f4-a427-342e5aa9cbb8",
"action": "login",
"user": "William@example.com",
"src": "1.2.3.4",
"result": "success",
"relevant_emoji" : "â
đ§Źđ"
}
Non-actions
But what about complex system logs, you say?
[6c84464e-35f2-430d-813d-d19b421af7f7 filter::info] filter executed plan -> AndPartialThreshold([EqIndexed("class", "group"), EqIndexed("member", "00000000-0000-0000-0000-ffffff000033")])
Becomes this:
{
"ts": "1937-01-01T02:00:27.87313+00:0x0",
"level": "info",
"process": "kanidmd",
"pid": 1234,
"session_id": "0977b582-28cf-40f4-a427-342e5aa9cbb8",
"operation_id": "6c84464e-35f2-430d-813d-d19b421af7f7",
"message": "AndPartialThreshold([EqIndexed('class', 'group'), EqIndexed('member', '00000000-0000-0000-0000-ffffff000033')])",
"event_name": "filter_executed_plan",
"relevant_emoji": "â
đșđŒ"
}
Note the session_id field, it ties the filter execution back to the âWilliam logged inâ event above it.
Admin warnings
This:
[4091405c-6d43-4381-9e9f-dc53752f1f70 admin::warning đ§] Arc Cache size too low - setting to 2048 ...
Becomes:
{
"ts": "1937-01-01T12:00:27.87456+00:20",
"level": "warning",
"process": "kanidmd",
"pid": 1234,
"session_id": "4091405c-6d43-4381-9e9f-dc53752f1f70",
"message": "Arc Cache size too low - setting to 2048",
"event_name": "arc_cache_autotuning",
"relevant_emoji": "đ§đ"
}
tl;dr
Logs should be:
- Fast.
- Timestamped.
- JSON.
- Correlatable.
json_env_logger seems sensible from an import-a-crate-make-go perspective, env_logger is already being used throughout and replacing it with json_env_logger is largely a find-and-replace matter - then the long tail of %s/eprintln/error/g
.
Footnote
Welp, that was pretty long. Hit me up on twitter to discuss logging or just random things. I like đȘ”.
- âTerminals are slowâ links to âThe TTY Demystifiedâ http://www.linusakesson.net/programming/tty/index.php
- RFC3339 is my new go-to for time formats. ISO8601âs OK, but itâs also behind the ISO-paywall, so they can jump in a lake.
- âCommand injectionâ links to âNginx 0.7.64 - Terminal Escape Sequence in Logs Command Injectionâ because who doesnât love unintended consequences?
- 11th Largest T-Shirt Company in the World - I like Shelly. Heâs fun.
- If you like logs and using them for good, or understanding systems better and observability in general, Charityâs a spectacularly great communicator on the topic.