Record linkage |
Record linkage refers to the task of finding identical entries in two or more files. The initial idea goes back to Halbert L. Dunn ( Record Linkage in: American Journal of Public Health, Vol. 36 (1946), 1412-1416). In the 1950s, Howard Borden Newcombe laid the probabilistic foundations of modern record linkage theory.
In 1969, Fellegi and Sunter formalized these ideas. Their pioneering work A Theory For Record Linkage is, still today, the mathematical tool for any record linkage application.
Mathematical Model
In an application with two files, A and B, denote the rows ( records ) by alpha (a) in file A and eta (b) in file B. Assign K characteristics to each record. The set of records that represent identical entities is defined by
M = left{ (a,b); a=b; a in A; b in B ight}
and the complement of set M, namely set U representing different entities is defined as
U = { (a,b); a eq b; a in A, b in B } .
A vector, gamma is defined, that contains the coded agreements and disagreements on each characteristic:
gamma left[ alpha ( a ), eta ( b ) ight] = { gamma^{1} left[ alpha ( a ) , eta ( b ) ight] ,..., gamma^{K} left[ alpha ( a ), eta ( b ) ight] }
where K is a subscript for the characteristics (sex, age, martial status, etc.) in the files. The conditional probabilities of observing a specific vector gamma given (a, b) in M, (a, b) in U are defined as
m(gamma) = P left{ gamma left[ alpha (a), eta (b) ight] | (a,b) in M ight} = sum_{(a, b) in M} P left{gammaleft[ alpha(a), eta(b) ight] ight} cdot P left[ (a, b) | M ight]
and
u(gamma) = P left{ gamma left[ alpha (a), eta (b) ight] | (a,b) in U ight} = sum_{(a, b) in U} P left{gammaleft[ alpha(a), eta(b) ight] ight} cdot P left[ (a, b) | U ight],
respectively.|
|
