PkgRadar

Benchmark · reproducible

PkgRadar vs GuardDog vs OSV — run on the same packages, in the open.

Most tools market unaudited accuracy claims. We publish a reproducible head-to-head: all three tools run over the identicalset of malicious and benign package artifacts. The defensible finding isn’t a recall race — it’s the operating point.

Run in June 2026, against a fixed set of 71 popular packages and PkgRadar’s held-out malicious corpus, at a CI-block threshold. Every number on this page is measured on that specific corpus and operating point — not a universal claim about any tool. See the methodology section below to reproduce it.

False positives on 71 popular, legitimate packages

ToolFalse-positive rate
PkgRadar0.0% (0 / 71)
GuardDog (≥1 threat-* rule)50.7% (36 / 71)

At a threshold you could safely wire into a CI gate or an install-time firewall, GuardDog flagged 36 of the 71 popular packages in this run(its threat-*rules fired on normal process-hooks, minified bundle code, and system-info reads); PkgRadar flagged none. That is the difference between “a signal for a human to triage” and “safe to auto-block.”

Measured on this specific corpus at a CI-block threshold — not a universal claim about GuardDog’s behavior on other packages or at other thresholds.

Recall — and why we caveat it

ToolRecall on the malicious set
PkgRadar (static)35%
GuardDog (static)84%
OSV.dev (database lookup)51%

PkgRadar’s 35% is the static-only recall on this held-out corpus, measured at this run’s threshold. It may differ from the live figure on /accuracy, which is recomputed against the current corpus.

Read this honestly: the malicious corpus is largely sourced from the public Datadog dataset, which is itself mostly found by GuardDog— so GuardDog’s recall here is inflated by construction. We don’t claim it means GuardDog “beats” PkgRadar at detection; it means a tool tends to re-find its own catches. OSV is a malware database (coverage/lead-time), not a static detector. The bias-robust comparison is the false-positive table above, where the operating points are unmistakably different.

Methodology & reproduction

This is the wedge we stand on: audited accuracy, a dated flagged-first-vs-OSV record, and a benchmark anyone can re-run — not a private “99%” slide. We’d rather show the caveats than hide them.