We’ve all taken really well to having an online presence, but that in no way has diminished the importance of our physical addresses. In fact, when we interact with any company, we are often authenticated through our name and address. This information is provided to the agent on call, who in turn types and stores it, or we provide it through their chat platform. Like there are numerous ways to capture an address, there are also many ways to represent the same address. The challenge the companies face is to match the details provided by the customer with the details they already have in their database, especially if there are variations in the two. Companies want to validate these details as accurately as possible with minimum manual interference. When details are matched incorrectly, they can cause severe damage to the company’s repute. In fact, it was reported by 94% of the business that address errors impacted customer loyalty and performance. Over the next few sections, we’ll delve deeper into the problems mismatched addresses can create and ways of addressing them (excuse the pun).
Importance of Address Matching
Consider yourself the owner of a logistics company and you have the name and address details of your customers. Now due to a slight variation in those details, you might end up with multiple records for a single customer. Some address details might be missing, which can lead to delays in delivery. There might also be incorrect details that can lead to delivery at the wrong address. All this can be very detrimental for your business.
And this is not just if you are in the business of logistics. Practically speaking, all kinds of businesses need to maintain addresses. Address details are required for invoice generation, contacting in case of emergencies, improving overall customer experience and customizing your marketing or communication strategy.
There are multiple advantages when you have accurate address details:
- Cost Reduction: Like in the case of a logistics company, failed or incorrect delivery can cost the company a significant amount of money.
- Efficient and timely delivery: It can lead to better supply-chain management, route optimization etc.
- Customer Satisfaction: Customers love companies that deliver efficiently and within deadlines. And the better you know your customers, the better the experience you can offer them.
- Avoiding Churning of Customers: Customers develop a preference and loyalty to companies that do not make errors.
- Additional Analysis for company strategy: Many business processes are run based on the historical data of customers. Improvement and next steps are decided based on this history. Prediction algorithms used by these companies can only be as accurate as the data it is supplied with.
Challenges in Name/Address matching
The name and address matching problem is not something new. This data is generally noisy and contains a lot of information. The quality of labeled data is another challenge that we face while training our models. The goal is to find what kind of variations are present in the data and then proceed accordingly. Some of the variations that we found in the datasets we analyzed are:
- Abbreviations – LTD & LIMITED
- Varying prefixes and suffixes – AGILENT TECHNOLOGIES vs. AGILENT TECHNOLOGIES INC
- Misspelled words – MICROSOFT vs. MICROSFT
- Truncated, extra or missing whitespaces – JOHNSON&JOHNSON vs. JOHNSON & JOHNSON
- Semantic similarity – COMPANY vs. ORGANIZATION
- Order of words – BLUE SHEILD BLUE CROSS vs. BLUE CROSS BLUE SHEILD
- Phonetic variations – KOHL’S vs. COLES
To address these issues, many string-matching algorithms are available – the most common algorithmic approaches being:
- Edit-distance based: The most common approach used to tackle fuzzy matching is edit-distance based. The idea behind this is to calculate the character changes (insertions, deletions, substitutions, or transpositions) are required to convert one string to another. The metrics such as Levenshtiein distance, Jaro-Winkler distance and Jaccard similarity index belong to this category.
- Phonetic Similarity: The most popular approach is Soundex. This method uses phonetic algorithms that turn similar sounding names into the same key, thus identifying similar sounding names easier. There are similar approaches to Soundex like Metaphone and Double Metaphone.
- Statistical Method: A statistical approach takes numerous matching pairs and trains a model to recognize two similar strings based on a similarity score. It can give good results based on the training examples but is slow in execution time. Collecting a significant number of matching pairs is another challenge.
- Word embedding methods: It is often the case with organization names that they are semantically similar but syntactically (or phonetically) dissimilar. So, our edit-distance method or phonetically similar methods won’t work in these cases. Word embeddings are vector representations of a word’s semantic meaning. For example, if two words are semantically similar, they will lie in close proximity to each other in vector space.
When working with a real world problem, we found that not one single algorithm is able to resolve all the issues. An ensemble approach looks at the key challenges and then identifies the algorithms that will help us achieve our goal.
This concludes part 1 of the “Correcting faulty customer addresses with Machine Learning” blog series. Take a look at part 2 of this blog, where we talk about the solution Machine Learning offers in the sphere of address matching and the various Machine Learning approaches that can be used like supervised, unsupervised and active learning.