Building a Bounded-Memory Pipeline for 944 Million Zeek Events in Rust
Table of Contents
Introduction
Recently, I set out to solve what initially appeared to be a simple problem:
Convert a very large Zeek
conn.logdataset stored as NDJSON into CSV format.
The challenge was scale. The input dataset contained approximately 944 million records, making it impossible to rely on approaches that load data into memory. The solution needed to be:
- Fast
- Memory efficient
- Fault tolerant
- Capable of processing billions of events
- Suitable as a foundation for larger Security Operations Center (SOC) pipelines
The resulting system processed the entire dataset with the following performance characteristics:
Processed: 944,100,000 lines
Workers: 16
Queue: 64
Errors: 0
Duration: 5656 seconds (~1.57 hours)
Rate: ~193,600 records/sec
More importantly, it maintained bounded memory usage throughout execution.
The Problem
The source data consisted of newline-delimited JSON (NDJSON) records generated by Zeek. A typical record looks like:
The goal was to transform these records into a normalized CSV format suitable for downstream analytics. The naive solution would be:
- Read entire file
- Parse all JSON
- Create a DataFrame
- Export to CSV
This approach quickly becomes impractical when dealing with hundreds of millions of records.
First Attempt: Polars
My initial approach leveraged Polars and its lazy APIs. The idea was straightforward:
new
.finish?
.select
.sink
The expectation was that Polars would stream records directly from JSON to CSV. However, after experimenting with various versions and APIs, it became clear that achieving true streaming behavior while maintaining predictable memory consumption was more complicated than expected. For analytical workloads, Polars is exceptional. For this specific use case a massive streaming transformation pipeline a custom solution offered more control over memory, concurrency, and backpressure.
Designing a Streaming Pipeline
The core design principle became:
Never hold more data in memory than necessary.
Instead of loading data, records are processed as a stream. The resulting architecture looked like this:
Reader
β
βΌ
Acquire Semaphore
β
βΌ
Thread Pool
(16 workers max)
β
βΌ
Result Channel
β
βΌ
CSV Writer
Each stage performs one responsibility and passes work downstream. The reader continuously streams lines from disk. Workers parse and transform records. A dedicated writer thread serializes CSV output.
Implementing Backpressure
One of the most important lessons learned involved backpressure. Initially, I assumed that limiting the thread pool to 16 workers would automatically limit memory consumption. That assumption was wrong. Most thread pool implementations maintain an internal queue of pending tasks. If work is submitted faster than workers can process it, the queue can grow indefinitely:
Reader
β
Thread Pool Queue
β
Workers
Memory usage grows along with the queue. To solve this problem, I introduced a custom semaphore. Before submitting work to the thread pool, the reader acquires a permit:
sem.acquire;
processors.execute;
If all permits are currently in use, the reader blocks until a worker completes and releases one. This creates explicit backpressure. The system can never have more active parsing jobs than the configured semaphore limit.
Why Bounded Systems Matter
Many data pipelines fail because they optimize for throughput but ignore memory growth. A bounded system guarantees predictable resource consumption regardless of input size.
In this implementation:
Maximum In-Flight Parse Jobs = Worker Count
With:
Workers = 16
there can never be more than 16 active parsing jobs at any point in time. Whether the input file contains:
- 10,000 records
- 10 million records
- 944 million records
memory usage remains bounded and predictable. This property is essential for long-running SOC infrastructure.
Measuring Throughput
To monitor performance, progress updates were emitted periodically:
let rate =
progress_frequency as f64 /
elapsed.as_secs_f64;
Output looked like:
Processed: 67800000 lines | Rate: 512.19k/sec
The implementation also tracked:
- Total runtime
- Error count
- Processing rate
using Rust's Instant API.
let start = now;
This provided visibility into bottlenecks and optimization opportunities.
Lessons Learned
1. Streaming Beats Loading
For ETL-style workloads, streaming often outperforms DataFrame approaches because memory remains constant.
2. Thread Pools Do Not Imply Backpressure
A common misconception is that limiting a thread pool to N workers automatically limits memory usage. Most thread pools maintain an internal queue of pending tasks. Without explicit flow control, the queue can continue growing until memory is exhausted.
3. Semaphores Are Powerful
A semaphore provides a simple mechanism for controlling in-flight work. By acquiring a permit before scheduling work and releasing it afterward, the producer naturally slows down whenever processing becomes saturated.
4. Simplicity Scales
The final architecture is conceptually simple:
Read
β Parse
β Transform
β Write
The simplicity made it easier to reason about failures, throughput, and resource consumption.
5. Bounded Systems Are Stable Systems
The most valuable property of the final design was not speed. It was predictability. Regardless of input size, memory consumption remained under control and throughput remained stable.
Potential Applications
Although the original task was converting Zeek logs to CSV, the architecture has broader uses.
SOC Data Pipelines
Zeek
β Normalize
β Enrich
β Store
Threat Hunting
Transform logs into formats suitable for:
- ClickHouse
- Elasticsearch
- OpenSearch
- Splunk
Data Lake Ingestion
Convert raw network telemetry into structured formats before loading into:
- Parquet
- Arrow
- Iceberg
Stream Processing
Replace file output with:
- Kafka
- Redpanda
- NATS
- Pulsar
to create real-time processing pipelines.
Toward a Rust-Native SOC Pipeline
What began as a CSV conversion utility ultimately evolved into a prototype ingestion engine. The architecture already contains many characteristics required for production SOC infrastructure:
- Bounded memory
- Parallel processing
- Explicit backpressure
- Error isolation
- High throughput
- Long-running stability
Future enhancements could include:
Reader
β
Parser
β
Enrichment
β
Detection
β
Storage
Supporting:
- GeoIP enrichment
- ASN enrichment
- Threat intelligence
- Session reconstruction
- Detection rules
Performance Results
Final benchmark:
Processed: 944,100,000 lines
Workers: 16
Queue: 64
Max In-Flight Parse Jobs: 16
Errors: 0
Duration: 5656 seconds
Rate: ~193,600 records/sec
The most surprising result was not the throughput. It was the stability. Despite processing nearly one billion records, the pipeline completed without memory exhaustion, queue explosion, or data loss.
Conclusion
The most important lesson from this project is that large-scale log processing is often less about raw speed and more about system design. A billion-record dataset is manageable when:
- Memory is bounded
- Work is streamed
- Backpressure is enforced
- Components remain simple
The resulting pipeline processed nearly one billion Zeek records with zero errors while maintaining stable performance throughout execution. What started as a file conversion task ended up becoming a practical exploration of how modern SOC ingestion systems can be built using Rust.