Insights / Delivery & Technical

Record linkage and data matching

DataSupplier·13 min read

Linking records across datasets, without shared keys, is one of the most useful and most error-prone tasks in data work. This guide covers record linkage methods and how to do it well, including when privacy is at stake.

What record linkage is

Record linkage joins records that refer to the same entity across datasets that lack a common identifier. It underpins enrichment, deduplication and combining external sources.

Deterministic vs probabilistic

Deterministic linkage matches on exact agreement of chosen fields, simple and precise but brittle to variation. Probabilistic linkage scores the likelihood that records match across multiple fields, handling messy data at the cost of tuning and some uncertainty.

Blocking and scale

Comparing every record against every other is infeasible at scale, so blocking groups likely candidates first. Good blocking makes linkage tractable without missing true matches.

Evaluation

Linkage quality is measured by precision (are matches correct) and recall (are matches found). The right threshold depends on the cost of false matches versus missed ones for the use case.

Privacy-preserving linkage

When linking personal data across parties, techniques such as hashing and privacy-preserving record linkage allow matching without exposing identities, important under the GDPR.

In a managed model

A managed partner can perform linkage across sourced datasets, tuned to the use case and with privacy-preserving methods where personal data is involved, and document the approach.

Blocking, scoring and thresholds

At scale, comparing every record to every other is infeasible, so blocking groups plausible candidates first (by postcode, name prefix or similar) to make the problem tractable without discarding true matches. Within blocks, fields are compared and scored, and a threshold decides matches. Setting that threshold is a business decision: a higher bar reduces false matches but misses some true ones; a lower bar does the reverse. The right point depends on whether a false match or a missed match is more costly for the use.

Privacy-preserving linkage

When records must be linked across organisations without exposing identities, privacy-preserving record linkage, using hashing, Bloom filters or secure protocols, lets parties find common entities without revealing the underlying personal data. This is increasingly important under the GDPR when enriching or matching data held by different controllers, and it is a core technique behind data clean rooms.

Key takeaways

Record linkage joins records without a shared key.
Deterministic is precise but brittle; probabilistic handles messy data.
Blocking makes linkage scale; evaluate with precision and recall.
Privacy-preserving linkage matches personal data without exposing identities.

Sources & further reading

Academic literature on record linkage (Fellegi-Sunter model).
ENISA and EDPB: privacy-preserving techniques.
DAMA-DMBOK: data matching.
EUR-Lex: Regulation (EU) 2016/679 (GDPR).

Need datasets linked accurately?

We link records across sources with tuned matching and privacy-preserving methods. Get a no-obligation quote.

Request a Quote Book a 30-minute call

Related

Master data management and entity resolution →Anonymisation vs pseudonymisation vs aggregation →