In part 1, we learned about importance of address matching, challenges and common algorithmic approaches for address matching. In part 2, we will talk about importance of Machine Learning in address matching and various Machine Learning approaches that can be used.
Importance of Machine Learning in Address Matching
To err is human, and especially so when it comes to routine and mundane tasks. In a traditional setup, massive amount of data also requires a proportionate amount of workforce to execute the task manually. This is not efficient, scalable or accurate.
That is why we try to teach machines to perform these tasks for us. There are many challenges in teaching machines to perform those tasks, but it is still more advantageous than doing them manually. A machine learning model performs better than us because it is efficient, as well as scalable. Over time, we enter into the data flywheel cycle. The idea is the more data we have, the better will be our model, which in turns will get more users to use our model.
Machine Learning approaches
We analyze the data and the task that we want to perform, in this case address matching. Based on the availability of labeled data, classification or clustering tasks, it can broadly be categorized into 3 categories.
- Supervised Learning:
Supervised learning, as the name suggests, is learning performed under supervision. In the case of machine learning, a model learns the output based on input-output pairs. The presence of output for a particular input is called a label, which is a necessary part of supervised learning. When organizations have a good amount of labeled data – for example, if the addresses from two or more sources are available and they are also labeled based on the extent of matching – this address data can then be used to train machine learning models to automatically perform the labeling task for them. To convert the text into vectors, the above mentioned approaches can be used.
- Unsupervised Learning:
This scenario is when labeled data is unavailable and we rely on the intelligence of algorithms to give us the underlying distribution and pattern present in the data. Here, there is no correct answer and no output that teaches the mode. This approach is widely used in case of record linkage and lack of labels in training data. Record linkage is the process of identifying records that points to the same real-world entity. In the case of address matching, we would want to map all the variations of addresses to a single address to avoid duplicate records in the database. This can be achieved through clustering, where each cluster represents an address. All the instances in a particular cluster are variations of the address. When these addresses are linked, they can be used to fill in the missing information or correct the incorrect information. Lack of labels definitely makes it a challenging problem.
- Active Learning:
Active learning, as described in one of my previous blogs, is a special case where the algorithm asks a query to an oracle (human) in order to get the correct output. In real-world scenarios getting high quality, labeled datasets is difficult. Also, creating a sufficiently sized dataset through manually labeling is an arduous task. The challenge here is to find those few data points in the entire data that, when labeled, will boost the self-learning process. Consider a dataset of addresses and the task involves comparing all the addresses in the datasets to find a match. Matching each row of one dataset with another row in the dataset is one way this can be achieved. However, it is not efficient. The reason being, the percentage of duplicates in a dataset is quite low and string operations are costly. For example, consider a dataset with only 100,000 rows and finding the similarity score for any of the two rows takes about 0.001 seconds. To compare all the rows, it would take around 58 days. Despite the fact that I have considered a small dataset and a very fast algorithm, it can still take a significant amount of time.
The idea here is to divide the data into blocks. Each block will be a subset of data and the blocks will be subject to the condition that they have some common fields. It is fair to assume that duplicates will almost always have something in common. The pairs in which we are interested are:
- If a pair is present in a block and our classifier predicts as non-duplicate
- If a pair is not in a block and our classifier predicts it as duplicate.
For the first case, for a particular pair, if our algorithm predicts as non-duplicate, it must be reviewed by a human. If it is a duplicate, then the algorithm must learn better and if it’s not a duplicate, then they must exist in different blocks. In the second case, if a pair is predicted as duplicate, then it must belong to the same block. In both cases, the human helps the algorithm with supervision to learn better. These pairs can then be presented to users to label. Based on the input from the user, the blocks can be updated.
We live in a world of enormous data. All organizations have a lot of information to deal with. It is incredibly inefficient to process this data manually. As the data volume is steadily rising, we need to adopt AI to save a lot of human hours, time and costs.
By adopting AI for a repetitive and error-prone task such as address matching, we can help organizations serve their customers better. The beauty of machine learning is that even when you don’t know your addresses well, machines can do it for you more accurately. This frees the human capital that one can invest in strategizing and creating better experiences.