Deduplication
The GraphLab Create deduplication tool ingests data in one or more SFrames and assigns an entity label to each row. Records with the same label likely correspond to the same real-world entity.
To illustrate usage of the deduplication tool, we use data about musical albums, downloaded originally from http://hpi.de/de/naumann/projects/data-quality-and-cleansing/annealing-standard.html#c123255. For this example, we have extracted a random sample of about 20% of the original data, and split it into four SFrames based on genre. The preprocessed data can be downloaded (and saved to your machine) from the Dato public datasets bucket with the following code. This download is about 7MB.
import os
import graphlab as gl
import graphlab.aggregate as agg
genres = ['rock', 'americana', 'classical', 'misc']
data = {}
for g in genres:
if os.path.exists('{}_albums'.format(g)):
print "Loading genre '{}' from local SFrame....".format(g)
data[g] = gl.load_sframe('{}_albums'.format(g))
else:
print "Downloading genre '{}' from S3 bucket....".format(g)
data[g] = gl.load_sframe(
'http://s3.amazonaws.com/dato-datasets/dedupe_albums/{}_albums'.format(g))
data[g].save('{}_albums'.format(g))
As usual, our first step is to look at the data:
data['rock'].print_rows(5)
+---------+--------------+-----------------+-------------------------------+
| disc_id | freedbdiscid | artist_name | disc_title |
+---------+--------------+-----------------+-------------------------------+
| 166 | 1075217 | Various | Mega Hits'80 S-03 |
| 719 | 33670401 | Dead Can Dance | Sambatiki |
| 829 | 34061313 | Alice Cooper | Anselmo Valencia Amphithea... |
| 1810 | 51495699 | Kasabian | Kasabian-Ulimate Version- |
| 2013 | 68222994 | Various Artists | Fear Candy14 |
+---------+--------------+-----------------+-------------------------------+
+-------------+---------------+-------------+--------------+---------------+
| genre_title | disc_released | disc_tracks | disc_seconds | disc_language |
+-------------+---------------+-------------+--------------+---------------+
| Rock | 1994 | 17 | 4200 | eng |
| Alternative | 1999 | 1 | 453 | |
| Hard Rock | 2003 | 1 | 1980 | |
| Rock | 2005 | 19 | 4547 | |
| Metal | 2005 | 18 | 4352 | eng |
| ... | ... | ... | ... | ... |
+-------------+---------------+-------------+--------------+---------------+
[23202 rows x 9 columns]
For this example, we define a distance that is a weighted sum of Euclidean distance on the album length (in seconds), weighted Jaccard distance on the artist name and album title, and Levenshtein distance on the genre. Note that if you pass a composite distance to the deduplication toolkit, there is no need to specify the 'features' parameter; the composite distance already defines the relevant features.
album_dist = [[('disc_seconds',), 'euclidean', 1],
[('artist_name', 'disc_title'), 'weighted_jaccard', 4],
[('genre_title',), 'levenshtein', 1]]
Grouping the data
In any dataset larger than about 10,000 records, grouping the records into smaller blocks (also known as "blocking") is critical to avoid computing the distance between all pairs of records.
In this example, we group on the number of tracks on each album ("disc_tracks"),
which means that the toolkit will first split the data into groups each of whose
members have the same number of tracks, then look for approximate matches only
within each group. As of GraphLab Create v1.4, grouping features are specified
with the grouping_features
parameter; previous versions used the standard
distance "exact" for this purpose, but this flag is no longer enabled.
Feature engineering
In the nearest_neighbors
toolkit, the weighted_jaccard
distance applies only
to dictionary-type features, but in our album_dist
we indicated that we want
to apply it to two string-type features ('artist_name' and 'disc_title'). The
deduplication toolkit does several feature engineering steps automatically so
you have less work to do manipulating the data and defining the composite
distance. In particular:
String features are cleaned by removing punctuation and extra white space, and converting all characters to lower case.
Strings are converted to dictionaries with 3-character shingling when used with dictionary-based distances (
jaccard
,weighted_jaccard
,cosine
, anddot_product
).String features specified for a single distance component are concatenated, separated by a space.
Missing values are imputed. Missing strings are imputed to be "", while missing numeric values are imputed to be the mean value for the appropriate feature within the exact match group (see previous section). Note that records with missing values in the "exact" match features are ignored in model training and assigned entity label "None".
The feature engineering that occurs within the deduplication toolkit does not alter the input data in any way.
Choosing a model
Currently the deduplication toolkit has only one model: "nearest_neighbor_deduplication", which labels a pair of records as duplicate if one record is a close neighbor of the other. To resolve the question of transitive closure---A and B are duplicates, B and C are duplicates, but A and C are not---this model constructs a similarity graph and finds the connected components. Each connected component corresponds to an entity in the final output.
In addition to the data and the distance function, the nearest neighbor
deduplication model takes two parameters. If k is specified, only the closest
k neighbors for a record are considered duplicates, while the radius
parameter indicates the maximum distance that a neighbor can be from a record to
still be considered a duplicate. The most typical usage leaves k
unspecified,
and uses a radius
that makes sense for the problem and the distance function.
m = gl.nearest_neighbor_deduplication.create(data, row_label='disc_id',
grouping_features=['disc_tracks'],
distance=album_dist,
k=None, radius=3)
If two datasets are known to have records that match one-to-one, then setting
k=2
can be very useful. The k
parameter is also useful to get preliminary
results when we have no prior intuition about the problem or the distance
function. In addition, the top level deduplication
create method hides the
parameters, in the expectation that future versions of the toolkit will
automatically choose the best modeling solution.
m2 = gl.deduplication.create(data, row_label='disc_id',
grouping_features=['disc_tracks'],
distance=album_dist)
Returning to our original model, the usual GraphLab Create toolkit functions give us information about model training and access to the output entity labels.
m.summary()
Class : NearestNeighborDeduplication
Schema
------
Number of input datasets : 4
Number of feature columns : 4
Number of neighbors per point (k) : None
Max distance to a neighbor (radius) : 3
Number of entities : 129632
Total training time (seconds) : 268.9886
Training
--------
Total training time (seconds) : 268.9886
Accessible fields :
entities : Consolidated input records plus entity labels.
The model's entities
attribute contains the deduplication results. All input
data rows are appended into a single SFrame, and the column __entity
indicates
which records correspond to the same entity. Because we specified the datasets
in a dictionary, the keys of that dictionary are used to identify which SFrame
each record comes from; if the data were passed as a list this would be the
index of the list.
Aggregating records
The entity labels are not very interesting by themselves; the deduplication problem typically involves a final step of aggregating records to produce a final clean dataset with one record per entity. This can be done straightforwardly with the SFrame groupby-aggregate tool.
In this example we define an aggregator that returns the number of records belonging to the entity, the mean length of album in seconds, the shortest album title and the compilation of all constituent genre and artist names.
# add disc title length in number of characters
entities = m['entities']
entities['title_length'] = entities['disc_title'].apply(lambda x: len(x))
# define the aggregation scheme and do the aggregation
album_aggregator = {
'num_records': agg.COUNT,
'disc_seconds': agg.MEAN('disc_seconds'),
'genres': agg.CONCAT('genre_title'),
'artist_name': agg.CONCAT('artist_name'),
'title': agg.ARGMIN('title_length', 'disc_title')
}
sf_clean = entities.groupby('__entity', album_aggregator)
# find the dupe IDs
entity_counts = m['entities'].groupby('__entity', agg.COUNT)
dupe_entities = entity_counts[entity_counts['Count'] > 1]['__entity']
# print the results for the dupes
dupes = sf_clean.filter_by(dupe_entities, '__entity')
dupes.print_rows(10, max_row_width=100, max_column_width=50)
+----------+------------------------------------+-------------+--------------+
| __entity | genres | num_records | disc_seconds |
+----------+------------------------------------+-------------+--------------+
| 76316 | [Rock, Rock] | 2 | 3154.5 |
| 16727 | [Speech, Speech] | 2 | 4770.0 |
| 73607 | [Rock, Rock] | 2 | 3498.0 |
| 17856 | [Jazz, Jazz] | 2 | 2817.0 |
| 25833 | [Synthpop, Synthpop] | 2 | 4202.0 |
| 16473 | [Country, Country] | 2 | 2055.0 |
| 76742 | [Native American, Native American] | 2 | 4217.0 |
| 80763 | [Country, Country] | 2 | 4520.5 |
| 26116 | [Rock, Rock] | 2 | 4640.0 |
| 43964 | [Jazz, Jazz] | 2 | 3357.0 |
+----------+------------------------------------+-------------+--------------+
+------------------------------------------+-------------------------------------------+
| artist_name | title |
+------------------------------------------+-------------------------------------------+
| [Cheap Trick, Cheap Trick] | Greatest Hits |
| [Neville Jason, Neville Jason] | The Lives Of The Great Artists |
| [Monster Magnet, Monster Magnet] | 4-Way Diablo |
| [Patricia Barber, Patricia Barber] | Split |
| [Various, Various] | Atmospheric Synthesizer Spectacular Vol.2 |
| [Various, Various] | The Power Of Country |
| [Paul Ortega, A. Paul Ortega-Two Worlds] | Three Worlds |
| [Marvin Rainwater, Marvin Rainwater] | Classic Recordings Disc1 |
| [Various, Various] | Rules Of Rock |
| [Woong San, Woomg San] | Close Your Eyes |
+------------------------------------------+-------------------------------------------+
[321 rows x 6 columns]