Building a Bounded-Memory Pipeline for 944 Million Zeek Events in Rust

Introduction

Recently, I set out to solve what initially appeared to be a simple problem:

Convert a very large Zeek conn.log dataset stored as NDJSON into CSV format.

The challenge was scale. The input dataset contained approximately 944 million records, making it impossible to rely on approaches that load data into memory. The solution needed to be:

Fast
Memory efficient
Fault tolerant
Capable of processing billions of events
Suitable as a foundation for larger Security Operations Center (SOC) pipelines

The resulting system processed the entire dataset with the following performance characteristics:

Processed: 944,100,000 lines
Workers:   16
Queue:     64
Errors:    0
Duration:  5656 seconds (~1.57 hours)
Rate:      ~193,600 records/sec

More importantly, it maintained bounded memory usage throughout execution.

The Problem

The source data consisted of newline-delimited JSON (NDJSON) records generated by Zeek. A typical record looks like:

{
  "ts": 1700000000.123,
  "uid": "C4fXnA4l7",
  "id.orig_h": "192.168.1.10",
  "id.resp_h": "8.8.8.8",
  "id.resp_p": 53,
  "proto": "udp",
  "orig_ip_bytes": 78,
  "resp_ip_bytes": 110
}

The goal was to transform these records into a normalized CSV format suitable for downstream analytics. The naive solution would be:

Read entire file
Parse all JSON
Create a DataFrame
Export to CSV

This approach quickly becomes impractical when dealing with hundreds of millions of records.

First Attempt: Polars

My initial approach leveraged Polars and its lazy APIs. The idea was straightforward:

LazyJsonLineReader::new(...)
    .finish()?
    .select(...)
    .sink(...)

The expectation was that Polars would stream records directly from JSON to CSV. However, after experimenting with various versions and APIs, it became clear that achieving true streaming behavior while maintaining predictable memory consumption was more complicated than expected. For analytical workloads, Polars is exceptional. For this specific use case a massive streaming transformation pipeline a custom solution offered more control over memory, concurrency, and backpressure.

Designing a Streaming Pipeline

The core design principle became:

Never hold more data in memory than necessary.

Instead of loading data, records are processed as a stream. The resulting architecture looked like this:

                  Reader
                     │
                     ▼
            Acquire Semaphore
                     │
                     ▼
              Thread Pool
            (16 workers max)
                     │
                     ▼
            Result Channel
                     │
                     ▼
               CSV Writer

Each stage performs one responsibility and passes work downstream. The reader continuously streams lines from disk. Workers parse and transform records. A dedicated writer thread serializes CSV output.

Implementing Backpressure

One of the most important lessons learned involved backpressure. Initially, I assumed that limiting the thread pool to 16 workers would automatically limit memory consumption. That assumption was wrong. Most thread pool implementations maintain an internal queue of pending tasks. If work is submitted faster than workers can process it, the queue can grow indefinitely:

Reader
    ↓
Thread Pool Queue
    ↓
Workers

Memory usage grows along with the queue. To solve this problem, I introduced a custom semaphore. Before submitting work to the thread pool, the reader acquires a permit:

sem.acquire();

processors.execute(move || {
    let result = process_line(line);

    let _ = p_tx.send(result);

    sem.release();
});

If all permits are currently in use, the reader blocks until a worker completes and releases one. This creates explicit backpressure. The system can never have more active parsing jobs than the configured semaphore limit.

Why Bounded Systems Matter

Many data pipelines fail because they optimize for throughput but ignore memory growth. A bounded system guarantees predictable resource consumption regardless of input size.

In this implementation:

Maximum In-Flight Parse Jobs = Worker Count

With:

Workers = 16

there can never be more than 16 active parsing jobs at any point in time. Whether the input file contains:

10,000 records
10 million records
944 million records

memory usage remains bounded and predictable. This property is essential for long-running SOC infrastructure.

Measuring Throughput

To monitor performance, progress updates were emitted periodically:

let rate =
    progress_frequency as f64 /
    elapsed.as_secs_f64();

Output looked like:

Processed: 67800000 lines | Rate: 512.19k/sec

The implementation also tracked:

Total runtime
Error count
Processing rate

using Rust's Instant API.

let start = Instant::now();

This provided visibility into bottlenecks and optimization opportunities.

Lessons Learned

1. Streaming Beats Loading

For ETL-style workloads, streaming often outperforms DataFrame approaches because memory remains constant.

2. Thread Pools Do Not Imply Backpressure

A common misconception is that limiting a thread pool to N workers automatically limits memory usage. Most thread pools maintain an internal queue of pending tasks. Without explicit flow control, the queue can continue growing until memory is exhausted.

3. Semaphores Are Powerful

A semaphore provides a simple mechanism for controlling in-flight work. By acquiring a permit before scheduling work and releasing it afterward, the producer naturally slows down whenever processing becomes saturated.

4. Simplicity Scales

The final architecture is conceptually simple:

Read
→ Parse
→ Transform
→ Write

The simplicity made it easier to reason about failures, throughput, and resource consumption.

5. Bounded Systems Are Stable Systems

The most valuable property of the final design was not speed. It was predictability. Regardless of input size, memory consumption remained under control and throughput remained stable.

Potential Applications

Although the original task was converting Zeek logs to CSV, the architecture has broader uses.

SOC Data Pipelines

Zeek
→ Normalize
→ Enrich
→ Store

Threat Hunting

Transform logs into formats suitable for:

ClickHouse
Elasticsearch
OpenSearch
Splunk

Data Lake Ingestion

Convert raw network telemetry into structured formats before loading into:

Parquet
Arrow
Iceberg

Stream Processing

Replace file output with:

Kafka
Redpanda
NATS
Pulsar

to create real-time processing pipelines.

Toward a Rust-Native SOC Pipeline

What began as a CSV conversion utility ultimately evolved into a prototype ingestion engine. The architecture already contains many characteristics required for production SOC infrastructure:

Bounded memory
Parallel processing
Explicit backpressure
Error isolation
High throughput
Long-running stability

Future enhancements could include:

Reader
  ↓
Parser
  ↓
Enrichment
  ↓
Detection
  ↓
Storage

Supporting:

GeoIP enrichment
ASN enrichment
Threat intelligence
Session reconstruction
Detection rules

Performance Results

Final benchmark:

Processed: 944,100,000 lines
Workers:   16
Queue:     64
Max In-Flight Parse Jobs: 16
Errors:    0
Duration:  5656 seconds
Rate:      ~193,600 records/sec

The most surprising result was not the throughput. It was the stability. Despite processing nearly one billion records, the pipeline completed without memory exhaustion, queue explosion, or data loss.

Conclusion

The most important lesson from this project is that large-scale log processing is often less about raw speed and more about system design. A billion-record dataset is manageable when:

Memory is bounded
Work is streamed
Backpressure is enforced
Components remain simple

The resulting pipeline processed nearly one billion Zeek records with zero errors while maintaining stable performance throughout execution. What started as a file conversion task ended up becoming a practical exploration of how modern SOC ingestion systems can be built using Rust.