Accuracy: precision, recall & false positives

The first question a security team asks about any scanner is “how noisy is it?” A gate that cries wolf gets turned off. So we build precision-first, we measure it on a held-out corpus, and we publish the numbers.

How we measure it

We maintain a held-out evaluation corpus of 264 packages: 198 confirmed-malicious releases (sourced from public OSV malware advisories) and 66 benign packages chosen specifically because they trip naive heuristics — MCP servers, AI-agent tooling, security utilities, and CLIs that legitimately read credentials, spawn processes, or fetch remote data. Every change to a detection rule is re-run against this corpus before it ships.

Precision: 100% — zero false positives

On the current corpus, every package our gate would block was genuinely malicious — no false positives. This is a deliberate engineering constraint, not a happy accident: we do not ship a detector change that introduces a false positive on the corpus. Precision is the gate we hold the product to, because in a CI pipeline a single false block erodes trust in every verdict that follows.

100% precision is measured on this evaluation corpus, not a universal guarantee for every package in existence. We grow the corpus as we encounter new benign edge cases (the kinds that previously caused false alarms), so the bar gets harder over time.

Recall: 34% from static signals alone — and why

On the same corpus, PkgRadar’s static rules independently flag about a third of known-malicious releases from the code and manifest alone. We are honest about that number rather than inflating it, because the gap is the direct consequence of two deliberate choices:

Static-only analysis. We never execute package code, so malware whose only tell is runtime behavior (no suspicious strings, no install hooks, no obfuscation) is invisible to static rules by design. See our detection methodology for why we make that trade.
Precision over recall. We would rather miss a subtle sample than flag a clean one. Loosening rules to chase recall is easy; keeping the gate quiet enough that teams leave it on is the hard, valuable part.

The static recall figure is a floor, not the production catch rate. In production we layer OSV confirmation and cross-release campaign correlation on top of the static signals, so the number of malicious releases actually surfaced — and the lead time on them — is higher than static rules alone. See the lead-time benchmark and coverage stats for the live picture. Improving static recall without sacrificing precision is where most of our detection work goes.

What a verdict means

High — stacked, high-severity indicators. The CI gate blocks these by default.
Review — a weaker or single signal worth a human look, not an automatic block.
Low — no meaningful static signal; the gate passes it.

The gate is configurable: fail_on lets you choose whether to block on high only or also on review, so you tune the precision/recall trade-off to your own risk tolerance.

When we get it wrong

A false positive is a bug, and we treat it like one. If PkgRadar flags a package you believe is clean, send it to [email protected]— valid cases become new entries in the evaluation corpus above, so the same mistake can’t ship twice. In the meantime you can allowlist a specific package and version from your dashboard so your pipeline keeps moving.

For the detectors behind these numbers, see the detection methodology; for the real-world attacks they map to, see attacks we catch.

Precision, recall & false positives

How we measure it

Precision: 100% — zero false positives

Recall: 34% from static signals alone — and why

What a verdict means

When we get it wrong