Information extraction from attending physician statements (APS) for underwriting
For the evaluation of applicants for life insurance, medical records are evaluated, among other things, as they provide many clues for risk assessment. However, the reports of the treating physicians are available in different formats and terminologies. They contain various medical documentation, such as laboratory results, pathology reports, etc. This makes it difficult for the responsible underwriter to find the most relevant information in a short time and thus efficiently provide a good risk assessment.
With the help of an efficient information extraction system that draws on a wide variety of documents, the most relevant parts can be found easily. The complexity of a case is also determined, thus an underwriter needs less time for simple cases. The more complex cases can be assigned to experienced staff, both increasing customer satisfaction and making risk assessment more efficient.
The most crucial part in natural language processing (NLP) projects is the proper choice of the optical recognition (OCR) tool. It reads out the texts from the scanned and/or photographed documents so that they can be analyzed by a computer. However, medical reports and documents come in different types, such as handwritten or machine-generated reports, or information summarized in tabular form or as continuous text. Handwritten documents, in particular, are still a major challenge due to the variety of handwriting styles.
The next step is a machine learning (ML) model that analyzes the digitized documents and classifies the most relevant information. This allows the documents to be sorted and searched depending on their medical information content. However, this requires a sufficiently large training dataset in which the relevant information is labeled. This ensures that the model is able to evaluate the different input formats so that no important information is overlooked.
Potential solution approaches
For text extraction from handwritten documents, the OCR tool Google Cloud Vision is currently the only viable option. If the OCR tool delivers satisfactory results, the text data can be further analyzed.
In order to classify the text, NLP techniques, such as naive Bayes classifiers, TF-IDF or LSTM algorithms, must be applied. They can detect the relationships between words and their respective contexts based on medical terms. For simpler information extractions, rule-based approaches might also be sufficient. This allows to identify medical terms and their positions in the documents, as well as to determine the corresponding medical specialty of a section.
In addition, one can also implement a semantic search algorithm that is able to abstract the meaning of a search term. This makes navigation through the documents much easier. The BERT model developed and pre-trained by Google can be used for this purpose. This is then specialized for the specific use case by training it with the labeled data. In addition, the domain-specific BioBERT model already exists for the biomedical language, which further simplifies the implementation of a semantic search algorithm.