Back to Blog

Building Zero-Latency PII Detection

How we built a PII detection system that adds microseconds, not seconds, to LLM request times. A look at the engineering tradeoffs and architecture decisions.

When we started building proxy0, we had a simple requirement: detect and redact PII from LLM requests without adding noticeable latency. "Noticeable" to us meant staying under 10 milliseconds for typical payloads—ideally much less.

This turned out to be a harder problem than we expected. Here's how we solved it.

The Latency Problem

Most PII detection systems are built for batch processing. They're designed to scan documents, databases, or data lakes—environments where a few seconds of processing time is acceptable. Throw an extra second or two of latency into a real-time LLM interaction and you've destroyed the user experience.

Consider the baseline: a typical LLM API call to GPT-4 or Claude takes 500ms to 3 seconds depending on the response length. Adding 100ms of PII detection overhead means a 3-20% increase in total latency. Adding 1 second means you've potentially doubled your response time.

Our target

We set an aggressive target: p99 latency under 5ms for payloads up to 10,000 tokens. That means even in the worst case, PII detection adds less than 1% overhead to typical LLM calls.

Why Existing Solutions Are Slow

Traditional PII detection approaches have fundamental performance limitations:

ML-Based NER Models

Named Entity Recognition models like spaCy or Hugging Face transformers provide excellent accuracy but come with significant overhead. Even optimized inference adds 50-200ms for typical text lengths. Running inference on every LLM request would crush our latency budget.

Regex Pattern Matching

Pure regex is fast but misses context-dependent PII. The string "John" might be a name or part of "John Deere" (a company). "May 4th" might be a date or just text. Without context, regex either over-matches (false positives) or under-matches (false negatives).

Cloud API Services

Services like AWS Comprehend or Google DLP are accurate but add network round-trip latency. Even with optimized networking, you're looking at 30-100ms minimum per call. Plus, there's an inherent irony in sending data to a cloud service to check if it contains sensitive data.

Our Architecture

We took a hybrid approach that combines the speed of pattern matching with the accuracy of ML models, processed in a pipeline that minimizes latency.

Layer 1: Fast Pattern Matching

The first layer uses highly optimized pattern matching for structured PII—data with predictable formats:

We use a multi-pattern matching algorithm (Aho-Corasick) that scans text in a single pass regardless of how many patterns we're looking for. This layer typically completes in under 1ms.

Layer 2: Dictionary Lookup

The second layer handles names using optimized dictionary lookups. We maintain compressed tries of common first names, last names, and location names. The data structures are designed for cache-efficient access patterns.

Key optimization: instead of checking every word against every dictionary, we use bloom filters to quickly reject non-matches. A word that doesn't appear in the bloom filter definitely isn't in our dictionary, and we can skip the full lookup.

Layer 3: Contextual Validation

This is where we avoid the false positive problem. After identifying candidates from layers 1 and 2, we apply lightweight contextual rules:

These rules are implemented as finite state machines that can be evaluated in constant time per candidate.

Layer 4: ML Fallback (Optional)

For customers requiring maximum accuracy, we offer an optional ML layer that runs a distilled NER model on ambiguous cases. This model is tiny (under 10MB) and runs locally—no network calls. It's only invoked when the first three layers flag something as uncertain.

In practice, the ML layer is triggered on less than 5% of requests, keeping average latency low while maintaining high accuracy.

Performance Results

Here's how our system performs on real-world payloads:

Payload Size p50 Latency p99 Latency Memory
100 tokens 0.3ms 0.8ms 2MB
1,000 tokens 0.9ms 2.1ms 2MB
5,000 tokens 2.4ms 4.2ms 3MB
10,000 tokens 4.1ms 6.8ms 4MB

These benchmarks were run on a standard 2-core VM. Performance scales linearly with CPU cores for concurrent requests.

Accuracy vs. Speed Tradeoffs

Every engineering decision involves tradeoffs. Here are some we made:

Tradeoff 1: Dictionary Coverage

Our name dictionaries cover the 50,000 most common names in each category. This catches ~98% of names in typical business communications but may miss unusual names. We chose coverage that balances memory usage and lookup speed against comprehensiveness.

Tradeoff 2: Pattern Specificity

Our patterns are tuned to minimize false positives rather than maximize recall. We'd rather miss an ambiguous case than flag innocent text as PII. Users can adjust this threshold based on their risk tolerance.

Tradeoff 3: Context Window

Our contextual validation looks at a fixed window (±5 tokens) around each candidate. This is usually sufficient for disambiguation but may miss very long-range dependencies. A larger window would improve accuracy but increase latency.

Implementation Details

A few technical choices that significantly impacted performance:

Language Choice

The core detection engine is written in Rust. The combination of zero-cost abstractions, no garbage collection pauses, and excellent SIMD support makes it ideal for this workload. We expose bindings for Python, Node.js, and Go.

Memory Layout

We carefully designed data structures for cache efficiency. The pattern matcher and dictionaries are laid out to minimize cache misses during scans. On modern CPUs, memory access patterns often matter more than algorithmic complexity.

SIMD Optimization

Where possible, we use SIMD instructions to process multiple characters in parallel. The initial scan that identifies candidate regions is fully vectorized, processing 32 bytes per instruction on AVX2-capable processors.

Lazy Evaluation

The pipeline is lazy—later stages only process candidates identified by earlier stages. If the fast pattern matcher finds nothing, the contextual validator never runs. This keeps latency low for the common case where text contains little or no PII.

Lessons Learned

Building this system taught us several lessons:

  1. Benchmark early and often. We found that intuition about what's "fast enough" was often wrong. Measure everything.
  2. The 99th percentile matters. A system that's fast on average but occasionally slow destroys user experience. We optimized for consistent performance.
  3. Simple beats clever. Our first attempt used sophisticated ML everywhere. The current hybrid approach is simpler and faster.
  4. Local processing wins. Eliminating network calls wasn't just about latency—it simplified deployment and improved reliability.

What's Next

We're continuing to improve the system:

The goal remains the same: make PII detection so fast that there's no reason not to use it on every LLM request. Security shouldn't come at the cost of user experience.

p0

proxy0 Team

Building guardrails for AI agents. Two lines of code.

Share this article

Stay Updated

Get notified when we publish new technical deep-dives.