Natural Language Processing (NLP) is the state-of-art technology in AI that enables machines to understand human language. NLP is what enables virtual assistants AI to understand the user inputs and respond accordingly. The market size of global Intelligent Virtual Assistants (IVA) is expected to reach $25.63 billion by 2025. It is expected to rise at a compound annual growth rate of 40.4% during the forecast period (2019-2025) across several domains including BFSI, retail, etc.
Whenever a user asks a question to a virtual assistant (say chatbot) in form of text, the words in the text undergoes several pre-processing techniques to capture the relevant information and come up with a likely response from the chatbot. One such pre-processing technique in NLP is Entity Extraction.
What is entity extraction?
According to the definition in Wikipedia, “Entity Extraction is a subtask of information extraction that seeks to locate and classify named entity mentions in unstructured text into pre-defined categories such as the person names, organizations, locations, etc.”
There are some open-source libraries like Standford NLP, spaCy which helps us in identifying the words in a text and classify them into these generic entity classes like Person, Organization, Location, Time, etc.
E.g. “Zensar was founded in the year 1991. Its headquarters is in Pune”.
The entities for the above text are:
Zensar – Organization
Pune – Location
In this way, we preprocess the unstructured text to a structured format that is machine-understandable and do further processing based on whether we want to retrieve information/facts or answer questions.
But for the business use case, we not only want to identify these generic entities but also the domain-specific entities and categorize them into custom-defined classes for a better understanding of a domain-specific text. For example, consider the text,
“How can I apply for home insurance.”
Here, the word “home” has high importance and we want it to be identified as special class called “Insurance type” entity.Thus, there is a need for identifying domain-specific entities from a text in addition to generic entities for better retrieval of information from unstructured text.
To date, numerous Named Entity Recognition (NER) systems have been built using openly available annotated datasets (like CoNLL, OntoNotes 5.0) and they have a very good record of performance. But these models fail to identify domain-specific words from the sectors like banking and insurance where the words can have a completely different meaning. In addition to this, many domains do not have large numbers of annotated labeled data available for training robust domain specific NER models.
A key approach towards this challenge is to adapt models that are trained on generic domains (like news domain) where large amounts of annotated training data are available, for target domains with scarce annotated data using transfer learning approaches.
We call the domain with large amounts of annotated data as the source domain and the one with the scarce annotated data as the target domain.
How domain adaptation works?
Domain adaptation is one of the approaches in transfer learning wherein a model is trained on source domain (“news domain” in this case) with sufficiently large amounts of labeled data, and the weights obtained in training the source model are used to initialize the weights for training the target domain model. Then, the transferred weights are fine-tuned with the limited available labeled target domain data. In this way, we can train a model robustly, despite having scarce data.
What are the features to be extracted?
The common features used for training the model are Global vectors for word representation (GloVE) which are domain-general pre-trained word embeddings. Assuming there is little domain shift of input feature spaces, the same word embeddings are used for both source and target domains.
This assumption becomes weak when there is a large shift in the domain. For example, consider the word “orange”. The word can be considered as color or fruit. We classify its class based on the context in which the word occurred. Giving a separate word representation for such words is necessary. In such cases, we need to use domain-specific word embeddings as mentioned in (1).
We use a Long Short-Term Memory (LSTM) based model to efficiently solve this sequence tagging problem. This model includes bidirectional LSTM (BLSTM) networks and bidirectional LSTM with a CRF layer (BLSTM-CRF). BLSTM-CRF model can efficiently use both past and future input features with the help of the bidirectional LSTM component. Whereas, the conditional random field (CRF) component uses a probabilistic approach to predict the output sequences, given the input sequences. This BLSTM-CRF model can produce state of the art (or close to) accuracy on NER datasets.
Let us come up with a robust cross-domain NER model architecture as mentioned in (6).
Source domain model: The pre-trained word embeddings of the source domain are used to train a BLSTM model using the available labeled source data.
The hidden states from this BLSTM are fed to the conditional random field (CRF) layer which predicts the word into entities in the source domain. The weights obtained in training this model are saved for further usage in training the target model.
Target domain model: This model has a series of adaptation layers. They are as follows:
Word Adaptation layer: If we use domain-specific word embeddings, they need to be projected to source domain to maintain homogeneous transfer learning. This projection is done using a Word Adaptation Layer. The word adaptation layer serves as a way to bridge the gap of heterogeneous input spaces, but it does so only at the word level and is context independent.
Sentence Adaptation layer:
To capture the contextual information at the sentence level, we augment a sentence adaptation layer. This is primarily a BLSTM layer right after the word adaptation layer which encodes the sequence of word embeddings to sentence level embeddings. The output states from this layer are fed to an BLSTM layer whose weights are already initialized by the weights saved in the source model. These weights are fine-tuned while training the model using the available labeled target data.
Output adaptation layer: The output labels from the source domain can differ from the target domain. Also, a word present in both source and target domains may belong to different entity classes depending on context (eg:”orange”). Thus, re-classifying the words with contextual information is necessary. In order to solve this problem, we augment a BLSTM layer right before the final CRF layer and name it as an output adaptation layer.
Finally, the hidden states from the output adaptation layer are fed into the CRF layer which classifies the output labels in the target domain.
Benefits of domain adaptable entity extraction:
1) This domain adaptable approach is highly flexible to sectors like banking, financial services, insurance, i.e. having trained a model on one generalized domain (news domain in general) we can adapt the model to these domains and use the NER functionality specific to a domain.
2) As this approach is more adaptable, this indirectly implies that there will be an improvement in the performance of the NER functionality. This is because our model can obtain the baseline model’s (manually training without transfer learning approach) performance with a limited number of training examples. If it uses all the training examples available, the performance certainly increases.
3) Many companies often face an issue of manually annotating the data as it involves much labor. With this approach, we can drastically decrease the time for training the model as our model requires less annotated data.
The approach described above is for a supervised setting where there is labeled data for both source and target domains. In many cases, adequately labeled target data is not available. As a result, the accuracy in precisely identifying the entities falls. Hence, there is a need for developing the solution for domain-specific entity extraction using unsupervised approaches.
In the unsupervised approach, we collect labeled data and unlabeled text from source domain, whereas we collect only unlabeled text in target domain. Then we generate the common feature representation using the unlabeled text from both domains. A detailed description of how the features are extracted in such a scenario is given in (2) and (3). A model is trained using these common features, based on the labeled source data available. The weights obtained in this model are used for predicting output labels for the target domain.
4) https://www.businesswire.com/news/home/20190822005478/en/Global-Intelligent-Virtual- Assistant-IVA-Market-2019-2025