πŸͺ΅ The πŸͺ΅ Great πŸͺ΅ Log πŸͺ΅ Post πŸͺ΅

So… this started out as me trying to start a discussion about logging on kanidm/kanidm and once I started, I couldn’t stop. Awkward.

Is this about lots of logs, or great quality logs, or just ✨big✨ logs?

Yes. πŸͺ΅

Ok, so everyone should like their logs, and making logs likeable requires them to be usable. What makes a usable log?

Note: I’m not necessarily talking about platform metrics here; they’re related but not the focus.

Monitoring and Observability are distinct elements informing this discussion.

Charity Majors @mipsytipsy Sep 14, 2020

Updated definition: πŸ“‰ Monitoring is for running and understanding other people’s code (aka “your infrastructure”)

πŸ“ˆ Observability is for running and understanding your code – the code you write, change and ship every day; the code that solves your core business problems.

yale-pinions ahead

  1. Please just prepend “I think” on everything that isn’t quoting someone else.
  2. I started in ops and now “work” in “security”, so I have a deep need to trace things end to end.
  3. I may work for the 11th largest t-shirt company in the world but I don’t get paid if you log more. I just see a lot of good and bad logs; if you don’t log properly, I think you can’t do anything well.

I’m going to try to quote or link to smarter people than me because that’s way easier and you’ll trust me more. In fact, I’ve already done it once.

Log Consumers

There are two main consumers of logs:

  1. Robots keeping things running. 🦾
  2. Humans debugging a specific problem πŸ•΅οΈβ€β™€οΈ

This means logs should be parseable by our friendly robots so they can then display them for the humans.

The Basics πŸ”°

Logs must have a few things.

  • A timestamp. ISO8601 is good, but RFC3339 puts it into practice. Epoch-(pico|nano|micro)seconds are another option. The former is much more friendly to humans.
  • An importance/level. Is this a debug message or an error, or something in between?
  • Context - who, what, where and why?
  • The information - the detail - otherwise why are you logging?

I’m going to stop us here for a second.


A logged entry must be one “thing”.

One thing per line/blob/emit/whatever you want to call it.


You might not have had to deal with Java stack traces. I have. Just… don’t. Multi-line-barf-noise-down-the-file logs are horrible. Debug logs are inherently noisy, but they’re still (when done right) one-emit-per-thing.

This is a pretty common looking log:

1937-01-01T12:00:27.87+00:20 INFO kanidmd[1234]: 0977b582-28cf-40f4-a427-342e5aa9cbb8 william@example.com logged in from 1.2.3.4
  • Timestamp, check.
  • Importance, check.
  • Context, check.
  • Information. check.

It’s not bad. I’m still not a fan.

Log Storage πŸ’Ύ

Typically they’re stored on a disk. This might be just in a plain line-by-line file in /var/log/kanidm.log or in whatever binary format journald or Windows are using. They’re also typically collected and transferred into a time-series storage platform, like Splunk (🐚), for later correlation and querying.

The way they get there is important to think about…

  • Writing them to disk in a blocking manner is bad, just like any other blocking action in a performance-sensitive system. We can’t wait for things to write.
  • What’s collecting the logs? Is it systemd capturing stdout, is it your eyeballs (terminals are slow) or is a log agent picking them up from the aforementioned log file?
  • If we are writing to disk, do we flush that write regularly so the logs are collected promptly? If we hold the file handle open in a way other things can’t read, that’s not helpful to anyone. (Windows DNS Debug logs do this).

Bad Ideas ❌

  • Blocking actions.
  • Custom formats.
  • Only emitting into a format for a particular consumer (eg. evtx/binary logs).
  • Syslog. Just in general.
  • XML. Just… no.
  • Logs in databases.
  • Writing a whole log system for our system.

Good Ideas πŸ‘©β€πŸ³

  • Structured data… JSON’s awesome for logs.
  • Don’t write the logs in the active thread.
  • Just use std(err|out).
  • UTF-8.
  • Did I mention it needs to be fast?

Context

If you’re writing a system, it’s table stakes to be able to trace an event from start to end. In both super-full-on-debug-mode and normal-just-enough-logs-to-keep-running modes, an admin should be able to tie together the full lifecycle of an event.

Clients could totally pass a “this is my eventID” field in their request. Since we’re talking about a system that is fundamentally running over REST, that could be a header or a data field. Either way, it should be validated and logged in context with the server-side events. You don’t have to log it with every related entry - just a server_session="x", client_request_id="y" would do.

Do not just log whatever’s in that field - you never want to see “hello server admin I’m a potato πŸ₯”” in your logs. Ok, maybe that would be fun. But you don’t want command injection or javascript injection in your SIEM or something like that.

Log Formats

Given that robots love things they can parse trivially and humans need to be able to read them, a format that is both human and machine-readable should be mandatory.

  • Terminal escape codes are lovely in the command line and literally the worst to end up in logs. Colours shouldn’t happen unless you know you’re outputting to a human-facing terminal.
  • Key-value output is readable by humans and robots.
  • Delimited (tab|comma|emoji) is error-prone for robots to parse.
  • CSV and XML are never the answers.
  • Structured data is slightly hard for humans to read.
  • Structured data can be fed through a parser to format things for humans.
  • What else is there?
  • No one should never need to regex the logs to parse them. We like our robots to be happy. πŸ€–πŸ”ͺ

Examples

What would JSON logging look like? I included linebreaks in my examples to make them easier to read here, but they shouldn’t be output.

My previous “pretty common looking log”:

1937-01-01T12:00:27.87+00:20 ERROR kanidmd[1234]: 0977b582-28cf-40f4-a427-342e5aa9cbb8 William@example.com logged in from 1.2.3.4

This would be better :

 {
 "ts": "1937-01-01T12:00:27.87200+10:00",
 "level": "info",
 "process": "kanidmd",
 "pid": 1234,
 "session_id": "0977b582-28cf-40f4-a427-342e5aa9cbb8",
 "action": "login",
 "user": "William@example.com",
 "src": "1.2.3.4",
 "result": "success",
 "relevant_emoji" : "βœ…πŸ§¬πŸ”‘"
 }

Non-actions

But what about complex system logs, you say?

 [6c84464e-35f2-430d-813d-d19b421af7f7 filter::info] filter executed plan -> AndPartialThreshold([EqIndexed("class", "group"), EqIndexed("member", "00000000-0000-0000-0000-ffffff000033")])

Becomes this:

 {
 "ts": "1937-01-01T02:00:27.87313+00:0x0",
 "level": "info",
 "process": "kanidmd",
 "pid": 1234,
 "session_id": "0977b582-28cf-40f4-a427-342e5aa9cbb8",
 "operation_id": "6c84464e-35f2-430d-813d-d19b421af7f7",
 "message": "AndPartialThreshold([EqIndexed('class', 'group'), EqIndexed('member', '00000000-0000-0000-0000-ffffff000033')])",
 "event_name": "filter_executed_plan",
 "relevant_emoji": "βœ…πŸ‘ΊπŸ’Ό"
 }

Note the session_id field, it ties the filter execution back to the “William logged in” event above it.

Admin warnings

This:

 [4091405c-6d43-4381-9e9f-dc53752f1f70 admin::warning 🚧] Arc Cache size too low - setting to 2048 ...

Becomes:

 {
 "ts": "1937-01-01T12:00:27.87456+00:20",
 "level": "warning",
 "process": "kanidmd",
 "pid": 1234,
 "session_id": "4091405c-6d43-4381-9e9f-dc53752f1f70",
 "message": "Arc Cache size too low - setting to 2048",
 "event_name": "arc_cache_autotuning",
 "relevant_emoji": "🚧🐟"
 }

tl;dr

Logs should be:

  1. Fast.
  2. Timestamped.
  3. JSON.
  4. Correlatable.

json_env_logger seems sensible from an import-a-crate-make-go perspective, env_logger is already being used throughout and replacing it with json_env_logger is largely a find-and-replace matter - then the long tail of %s/eprintln/error/g.

Footnote

Welp, that was pretty long. Hit me up on twitter to discuss logging or just random things. I like πŸͺ΅.



#logs #splunk #json