Understand

Anonymization vs pseudonymization: what actually matters for the GDPR

"Anonymized" is one of the most overused words in data marketing. Under the GDPR, the difference between pseudonymization and anonymization decides whether your test environments stay within the scope of the regulation or not. Here are the three criteria the EDPB uses to settle the question, and how Anonyx gives you measurable evidence instead of a promise.

  • Pseudonymization: data remains personal, the GDPR still applies (art. 4(5))
  • Anonymization: re-identification reasonably impossible, outside GDPR scope (recital 26)
  • Three EDPB criteria: singling out, linkability, inference (Opinion 05/2014)
  • Anonyx measures the singling-out criterion with k-anonymity, on every run
  • HMAC-SHA256 hashing with an ephemeral key: mapping cannot be rebuilt after the run
  • Re-identification risk report per run: your evidence for the DPO

What the GDPR says

Pseudonymization is defined by GDPR article 4(5): processing data so that it can no longer be attributed to a person without additional information. Replacing a name with an identifier, hashing an email, masking a phone number: all of that is pseudonymization. The data remains personal data, and the GDPR keeps applying - including to your dev and test environments.

Anonymization is described by recital 26: data rendered anonymous in such a manner that the data subject is not or no longer identifiable, taking into account all the means reasonably likely to be used. Truly anonymous data falls outside the scope of the GDPR.

Pseudonymization is not a failure, though: it is a minimization and security measure explicitly recognized by articles 25 (data protection by design) and 32 (security of processing). For production copies in test environments, it already drastically reduces the risk in case of a leak.

The three EDPB criteria (Opinion 05/2014)

The European Data Protection Board (EDPB, formerly WP29) assesses the robustness of an anonymization with three attack criteria, evaluated on the whole dataset:

  1. Singling out. Can you isolate one person's record? Replacing the name is not enough: the combination zip code + birth date + gender is unique for a large share of the population. These harmless-looking columns are quasi-identifiers.
  2. Linkability. Can you link two records about the same person, within the same dataset or by crossing it with another one?
  3. Inference. Can you deduce, with significant probability, the value of a sensitive attribute from the others?

A tool that transforms values column by column - whichever it is - neutralizes none of these criteria on its own: you need guarantees computed across the entire dataset.

What Anonyx covers, concretely

Hash irreversibility. The hash strategy uses HMAC-SHA256 with an ephemeral key generated for each run and destroyed when it completes. Without the key, the mapping cannot be reconstructed, even with a dictionary attack on low-cardinality values.

Quasi-identifiers and k-anonymity. PII detection flags quasi-identifiers (zip code, birth date, gender, occupation…). You set a k threshold: a dataset is k-anonymous when every combination of quasi-identifiers is shared by at least k rows. On every run, Anonyx computes the equivalence classes and applies your policy: report classes below the threshold, generalize them (age brackets, truncated zip codes, date ranges - without deleting rows), suppress them, or fail the run in strict mode.

A deliberate trade-off. Anonyx preserves referential integrity: the same source value receives the same anonymized value everywhere it appears, so your joins and foreign keys survive. That consistent mapping maintains, by construction, a form of internal linkability within the dataset - it is the price of usable test data, and it is documented rather than hidden.

How to qualify your dataset

The anonymization / pseudonymization qualification is not something you declare in a marketing brochure: it is assessed dataset by dataset, ruleset by ruleset. That is why every Anonyx run produces a re-identification risk report: k threshold targeted and reached, quasi-identifiers handled, rows generalized or suppressed, strategy applied column by column.

That report is the evidence you bring to your DPO or data controller to decide. If k-anonymity is satisfied, you document the neutralization of the singling-out criterion used by the EDPB. If it is not, you know exactly which columns to adjust - or you settle for robust pseudonymization, which remains a measure recognized by articles 25 and 32.

The details of strategies and controls are described on the features page.

Measure the re-identification risk of your test data

Free plan for individual developers. Quasi-identifier detection, k-anonymity and risk report included.