A company specializing in healthcare services and technology set out to improve how medical records are reviewed and used for reimbursement processes. One of their major AI-driven tools, CAVO, provides an interactive platform that simplifies reviewing complex medical records for both insurers and payers. To make sense of the vast data hidden within scanned documents, the company’s data science team uses optical character recognition (OCR) and natural language processing (NLP), particularly entity extraction, to unlock valuable insights.
Medical records often arrive in bulky, inconsistent formats, some stretching over a thousand pages and come from a wide mix of hospitals, clinics, and insurance providers. These documents include everything from scanned faxes to PDFs, all with different structures and layouts. The team’s challenge was to turn this unstructured data into something usable for their clinical machine learning models. Their models assess whether the documentation supports the treatment paths taken by patients.
To streamline the process of preparing this data for machine learning, the company turned to Databrewery. Using Databrewery’s platform, they were able to quickly validate the accuracy of entity extraction tasks through a mix of external labeling partners and their own internal experts. This allowed them to build a model that could automatically tag important sections of medical records such as emergency department reports or discharge summaries which reviewers could then verify for correctness inside the Databrewery interface.
They also set up active learning pipelines to identify parts of the data where their models were underperforming. Their internal experts reviewed not only the confident predictions to catch potential issues but also focused on uncertain outputs where the model needed more training. As they built out this system, they continuously added new types of medical documents to teach their models how to handle claims, appeals, and billing more effectively.
To guide their sampling strategy, they analyzed model predictions across unlabeled data by calculating entropy, a measure of uncertainty. They created two groups: one where the model was confident (low entropy) and one where it wasn’t (high entropy). They then pulled most of their training samples from the uncertain group, ensuring their model learned from the most ambiguous and challenging examples.
During this process, they found a class imbalance in the dataset. Some document types made up 25% of the data, while others were barely represented. To fix this, they used an earlier version of their model to help rebalance the dataset in a semi-supervised way, leading to stronger overall performance.
Since adopting Databrewery, the company has seen major gains in efficiency. Labeling each page of a medical record used to take around 13 seconds. Now, with model-assisted workflows and automation, they’ve cut that down to 8 seconds per label. This improvement has saved them more than 25 hours per labeling task and made the entire process smoother for their team, thanks to a software-first approach.