As a data scientist, you extract data from various sources: publicly available APIs, scraped from web pages, or directly asking for it. Data in the real world can be messy, and dealing with messy datasets can be painfully frustrating and time-consuming.
We can use fuzzy string matching to deal with the challenges of such datasets. First, let’s define fuzzy matching, also known as approximate string matching. It is the process of calculating the similarity between strings, based on parameters like Levenshtein distance, Jaccard distance, and cosine similarity.
Why Fuzzy Matching is Important
Most businesses – over 90% — have duplicate data. Most of these duplicates are non-exact matchings, which means they are not automatically detected. Using fuzzy matching and sophisticated matching techniques, it’s possible to uncover typos, incomplete information, and other kinds of non-exact matches. The technical terms are deduplication (matching similar entries, such as ‘yogurt’ and ‘yoghurt’, into one entity) and record linkage (joining record for ‘Tesla’ with the URL of that Wikipedia page). 
While the applications of both data deduplication and record linkage vary widely, they are two sides of the same coin. This is because the underlying technology used to identify matching records for both is the same. Where might we use fuzzy matching? Here are just a few areas:
- Identifying duplicate customer records
- Fraud analytics
- Price comparisons
- Cleaning messy data
- Integrating data sources
- Finding cross-sell and upsell opportunities
- Data enrichment and reference matching
There are many ways to perform fuzzy matching, but their performance takes a big hit as the size of the dataset increases. The reason is that each record is compared against all the other records in the dataset. As the size of our dataset increases, the time taken to perform fuzzy matching increases quadratically; this phenomenon is known as quadratic time.
The performance of fuzzy matching is also dependant on the length of the strings being compared. Therefore, we must think about scalability when using fuzzy matching. We can use TF-IDF (Term Frequency, Inverse Document Frequency) to deal with such problems. It filters useful words from the noise present in the text. This method generates features from the text by multiplying the frequency of a term in a document by the importance of the same term in an entire corpus.
TF-IDF is a very useful algorithm in text classification and text clustering, as it can be used to transform documents into numeric vectors that can easily be compared. You can also use n-grams to deal with the scalability issue. For further reading on fuzzy matching, TF-IDF, or n-grams, please see the links in the references section below.
Technical articles are published from the Absolutdata Labs group, and hail from The Absolutdata Data Science Center of Excellence. These articles also appear in BrainWave, Absolutdata’s quarterly data science digest.
Subscribe to BrainWave