k-Anonymity audit

Pseudonymization replaces direct identifiers (name, email, ID). But a row can still be re-identifiable through a combination of indirect attributes — birth date + ZIP code + gender, for instance. The brume audit --anonymity subcommand measures that residual risk.

What is k-anonymity?

A dataset is k-anonymous with respect to a set of quasi-identifiers if every combination of those values appears in at least k rows.

Introduced by Sweeney (2002), the metric captures one specific risk: if your row is unique on (birth_date, zip_code, gender), anyone who knows those three values about you can pick you out of the dataset — even if your name has been replaced.

Quasi-identifiers

A quasi-identifier is a column whose value isn’t a direct identifier on its own, but contributes to identifying you when combined with others.

Common examples:

Demographic: birth_date, gender, nationality, zip_code, city
Behavioral: subscription_plan + created_at + last_login_at
Geographic: lat/lng rounded coordinates

You declare them per table when you run the audit.

Singleton classes

A singleton class is a combination of quasi-identifier values that appears in exactly one row. Singletons are the most re-identifiable rows in your dataset — the ones the audit highlights first.

k = 1 means at least one singleton exists. Common targets: k ≥ 5 (light protection) up to k ≥ 50 (high protection for sensitive datasets).

Running the audit

brume audit --anonymity \
  --quasi-id "users:birth_date,zip_code,gender" \
  --report-format markdown

Multiple tables can be audited in one run by repeating --quasi-id:

brume audit --anonymity \
  --quasi-id "users:birth_date,zip_code,gender" \
  --quasi-id "patients:age_bucket,city,blood_type" \
  --report-out audit-2026-q2.md

Reading the report

The report has three sections:

Summary — k_min, number of singletons, percentage of rows in equivalence classes below threshold.
Top exposed combinations — the smallest classes, sorted by k ascending.
Recommendations — concrete suggestions: generalize this column (birth_date → birth_year), suppress that one, or change its strategy in brume.yml.

The report is designed to be handed to your DPO as-is for decision support.

What the audit is not

The audit measures k-anonymity only. It does not measure:

l-diversity — whether sensitive attributes are diverse within each equivalence class.
t-closeness — whether the distribution of sensitive attributes inside a class matches the global distribution.
Cross-table risk — re-identification through joins between two of your tables.

These are roadmap items. For now, k-anonymity gives you the single most useful number to discuss with your DPO.

In CI/CD

The audit exits with code 3 if a threshold is breached, so you can gate runs:

brume audit --anonymity \
  --quasi-id "users:birth_date,zip_code,gender" \
  --min-k 5 \
  --report-out artifacts/k-anonymity.md

Use this to keep staging refreshes within an acceptable risk envelope.

GDPR & compliance — overall sign-off checklist.
Strategies — adjust per-column transformations to raise k.