SIEM vs. Security Data Lake: Architecture and Cost

The security data architecture conversation has evolved beyond "which SIEM should we use?" to a more fundamental question: where should enterprise telemetry live, and what should consume it?

Two architectures dominate the current discussion. SIEM, the traditional platform that ingests security telemetry, applies detection rules, and generates alerts. Security data lake, a large-scale storage layer that retains telemetry at lower cost, typically built on cloud-native object storage with a query engine on top.

Most enterprises now operate some combination of both. The SIEM handles detection and alerting. The data lake handles long-term retention and compliance. And the space between them, the integration, the data movement, the schema reconciliation, creates operational complexity that neither was designed to address.

The real question is not SIEM vs. data lake. It is whether there is a better architecture underneath both.

What is a security data lake, and how did we get here?

Security data lakes emerged as a response to SIEM cost constraints. As telemetry volumes grew and SIEM ingestion pricing made full retention unaffordable, organizations needed a place to put the data their SIEM could not economically hold.

The data lake offered a compelling proposition: store everything, cheaply, in cloud object storage (S3, Azure Blob, GCS). Use a query engine, Athena, BigQuery, or a dedicated security analytics platform, to search when needed. Keep the SIEM for real-time detection. Use the lake for everything else.

Amazon Security Lake, Snowflake Security Data Lake, and several vendor-specific implementations formalized this pattern. The OCSF (Open Cybersecurity Schema Framework) provided a common data model. The architecture made retention affordable.

But it also introduced a set of trade-offs that are becoming increasingly apparent.

SIEM vs. security data lake: core architectural differences

The two architectures are optimized for different jobs, and those optimization choices have consequences.

SIEM is optimized for real-time detection. It ingests events, applies correlation rules, generates alerts, and supports analyst workflows. Its strengths are detection speed, alert management, and integration with security orchestration tools. Its weaknesses are cost at scale (ingestion-based pricing), limited retention (weeks to months), and a data model optimized for event-centric alerting rather than longitudinal analysis.

Security data lake is optimized for low-cost storage. It retains large volumes of telemetry affordably, supports ad-hoc queries, and satisfies compliance retention requirements. Its strengths are cost efficiency for storage and the ability to retain months or years of data. Its weaknesses are query latency (minutes to hours for large scans), lack of real-time detection capability, minimal data enrichment, and a data model that is schema-on-read, meaning structure is applied at query time rather than at ingest.

The critical difference is in how data is consumed. SIEM delivers alerts. Data lakes deliver query results. Neither maintains continuous understanding of enterprise activity.

Why organizations end up with both, and the integration tax

In practice, most security architectures include both a SIEM and a data lake because neither alone satisfies all requirements.

The SIEM provides the real-time detection and alerting that security operations depend on. But its retention window is too short and its cost too high for full telemetry retention. The data lake provides the retention, but it cannot perform real-time detection, its query performance is too slow for interactive investigation, and the data lacks the enrichment that analysts need.

The result is a two-system architecture with significant integration overhead. Telemetry must be routed to both destinations, often through separate pipelines. Schemas must be maintained in both systems, or reconciled when analysts need to correlate data across them. Queries that span the SIEM's hot data and the lake's cold data require manual workflow changes or custom tooling.

This integration tax is not a one-time cost. It is an ongoing operational burden that grows with every new data source, every schema change, and every expansion of the telemetry footprint.

The missing layer: what neither SIEM nor data lake does well

Both SIEM and security data lake architectures share a common gap: neither was designed to create and maintain structured, machine-consumable knowledge from enterprise telemetry.

SIEM stores events and generates alerts. The data model is event-centric. There is no persistent entity history, no continuous metadata enrichment, and no optimization for machine consumers (autonomous agents).

Data lakes store raw or semi-structured data. The schema is applied at query time. There is no continuous enrichment, no entity resolution at ingest, and no maintained understanding, just stored bytes that can be queried if you know the right syntax and have patience for the response time.

What is missing is a layer that does the following: captures all telemetry from all sources; applies metadata extraction, entity resolution, and enrichment at ingest, continuously; retains the structured result in hot, searchable storage for months to years; and exposes it through interfaces optimized for both human query and machine consumption.

This is not what SIEM does. This is not what a data lake does. This is what a telemetry substrate does.

The telemetry substrate model: a system of record underneath both

A telemetry substrate sits underneath both SIEM and data lake, replacing the data layer that each has historically provided while solving the problems that each creates.

In this architecture, the substrate handles collection, enrichment, retention, and structured access. The SIEM receives a curated feed of detection-relevant events from the substrate and performs its detection and alerting function. The data lake is either eliminated entirely (because the substrate provides hot retention at equivalent or lower cost) or reduced to a cold archival tier for data that no system needs to query.

The substrate model changes the economics and the capabilities simultaneously. Cost is decoupled from volume. Retention is hot by default. Data is structured at ingest, not at query time. Entity histories are maintained continuously. And the data is accessible to both human analysts and autonomous agents through machine-consumable interfaces.

Bloo implements this substrate model. It captures all enterprise telemetry, structures it with metadata extraction and entity resolution at ingest, retains it in hot searchable storage at predictable cost, and serves as the canonical system of record that SIEM, compliance tools, and AI agents all consume.

Decision framework: SIEM + lake vs. substrate-first architecture

The choice between maintaining a SIEM + data lake architecture and adopting a substrate-first approach depends on where the organization is today and where it needs to be.

SIEM + lake makes sense when the organization has invested heavily in SIEM detection engineering, the detection content is mature and differentiated, and the data lake satisfies compliance retention requirements with acceptable query performance. In this case, the integration tax is a known cost, and the architecture is stable.

Substrate-first makes sense when SIEM costs are the primary constraint, the data lake is underutilized or operationally burdensome, retention gaps exist between current capabilities and compliance requirements, and the organization anticipates AI-driven security operations that will require persistent, structured, machine-consumable telemetry.

In either case, the substrate can be adopted incrementally. Bloo can operate alongside existing SIEM and data lake deployments, initially handling the data sources that the SIEM cannot economically ingest and providing structured retention that the data lake cannot deliver. Over time, the substrate absorbs more of the data layer responsibility, and the SIEM narrows to its core function: detection and alerting.

The end state is an architecture where the system of record exists independently of any single application layer, durable, structured, and ready for whatever consumes it next.