BIPI
BIPI

SIEM Log Onboarding: Schema, Parsing, and Why Your Joins Are Slow

Cybersecurity

Log onboarding is where most SOC programs quietly fail. Bad parsers create bad fields, bad fields kill detections, and bad joins make every hunt take three minutes when it should take three seconds. A practical guide to onboarding that scales.

By Arjun Raghavan, Security & Systems Lead, BIPI · September 11, 2023 · 11 min read

#siem#log-onboarding#parsing#splunk#sentinel#elastic

Every detection engineer has had this moment. A new SIEM source is onboarded. The data appears. The first query against it returns nothing because the parser stripped the field the rule depended on. The rule was tested against vendor sample data, not the data the pipeline actually delivers. Three months later, an incident reveals that EDR telemetry has been arriving with a malformed timestamp and the entire month of detections fired late.

Onboarding Is a Pipeline Problem

Treat log onboarding as a data pipeline with the same rigor a data engineer brings to a production warehouse. There is a source, a transport, a parser, a schema, and a consumer. Every stage can break, and every break manifests as a silent detection failure rather than a loud error.

  • Source: is the log generator actually emitting the events you assume, at the verbosity you assume
  • Transport: are events arriving in order, on time, and without truncation
  • Parser: does the parser produce the fields detection rules query against
  • Schema: do field names and types match the SIEM's common information model
  • Consumer: do existing rules and dashboards continue to work after onboarding

Pick a Common Information Model and Hold to It

Splunk has CIM, Microsoft Sentinel has ASIM, Elastic has ECS, Chronicle has UDM. They differ in detail but agree on the principle: detections should query normalized field names like src_ip, user.name, process.command_line, not vendor specific names like SrcAddr or CmdLine. Without normalization, every detection has to be written in three versions for three different log sources, and migrating to a new tool means rewriting everything.

Parser Tests Are Detection Tests

Before a parser goes to production, run a corpus of representative raw events through it and assert that every required field is extracted with the correct type. This is not optional. The corpus should include at minimum one normal event, one malicious event from an atomic test, one event with unusual characters in the payload like quotes and backslashes, and one truncated event. If the parser drops a field or misparses a type in any of these, fix it before onboarding the source.

Schema Drift Is Inevitable

Vendors change log formats without telling you. A Windows feature update adds three new fields and renames one. An EDR agent upgrade replaces process_path with image_path. A cloud provider switches a field from string to nested object. If your parser is brittle, detections silently break the day after the upgrade. Build the parser to tolerate unknown fields and to alert on missing required fields, not the other way around.

Why Your Joins Are Slow

  1. The join key is not indexed: in Splunk, fields outside the default index are extracted at search time, making joins O of N
  2. The join key has high cardinality on both sides: joining DeviceEvents to DeviceProcessEvents on DeviceId across 30 days is a billion row operation
  3. The time window on the join is too wide: bound joins to a five minute window when looking for parent and child process relationships
  4. The join is happening at query time when it should happen at ingestion time, via lookup or enrichment

Enrichment at Ingest, Not at Query

Asset metadata, user role, geolocation, and threat intelligence tags should be attached to events at ingest, not joined at query time. Splunk does this with lookup tables in props.conf. Sentinel does it with Logic App enrichment or watchlists. Elastic does it with enrich processors in ingest pipelines. The detection engineer's query then becomes a simple filter on a pre joined field rather than a multi index lookup that scans terabytes.

Volume Budgeting

  • Calculate the per source daily ingest volume before onboarding, not after the bill arrives
  • Tier sources by detection value: full fidelity for endpoint and identity, summary for high volume sources like DNS
  • Use cheap storage tiers for compliance retention, hot storage only for the windows queries actually run against
  • Sample only when you have proven that the sample preserves detection signal, never sample silently
A SIEM with bad parsing is more dangerous than a SIEM with no SIEM, because it gives the SOC confidence that detections are running when half of them are silently broken.

Onboarding Checklist

Before declaring a source onboarded, every one of these must be true. Parser tests pass on the corpus. CIM mapping is complete for required fields. A sample atomic test for a relevant technique generates an event that the parser handles correctly. At least one detection rule has been validated end to end against the new source. The dashboard for source health includes this source. Monitoring is configured to alert on a five minute gap in ingestion. Anything less is technical debt being accepted into the SOC's most critical pipeline.

Read more field notes, explore our services, or get in touch at info@bipi.in. Privacy Policy · Terms.