Accelerating Medical Record Analysis with Smarter Training Data Workflows

The Challenge

Processing medical records for machine learning is notoriously complex. Each document can span hundreds or even thousands of pages and arrive in inconsistent formats from various sources like hospitals, payers, and clinics. These documents often include a mix of scanned PDFs, faxes, and handwritten notes, making extraction of structured data both time-intensive & costly.

The Approach

To overcome this, the team turned to DataBrewery’s platform to build a smarter pipeline using active learning and model-assisted labeling. With built-in QA tools and human review loops, they were able to combine automation with oversight, ensuring each annotation improved model accuracy. Internal domain experts and external annotators worked within a shared environment, speeding up page classification and boosting collaboration across teams.

The Outcome

By shifting from a fully manual review process to one powered by automation and smart prioritization, the team reduced labeling time per page from 13 seconds to just 8. This efficiency enabled them to scale faster and maintain higher quality across millions of records.

A company specializing in healthcare services and technology set out to improve how medical records are reviewed and used for reimbursement processes. One of their major AI-driven tools, CAVO, provides an interactive platform that simplifies reviewing complex medical records for both insurers and payers. To make sense of the vast data hidden within scanned documents, the company’s data science team uses optical character recognition (OCR) and natural language processing (NLP), particularly entity extraction, to unlock valuable insights.

Medical records often arrive in bulky, inconsistent formats, some stretching over a thousand pages and come from a wide mix of hospitals, clinics, and insurance providers. These documents include everything from scanned faxes to PDFs, all with different structures and layouts. The team’s challenge was to turn this unstructured data into something usable for their clinical machine learning models. Their models assess whether the documentation supports the treatment paths taken by patients.

To streamline the process of preparing this data for machine learning, the company turned to Databrewery. Using Databrewery’s platform, they were able to quickly validate the accuracy of entity extraction tasks through a mix of external labeling partners and their own internal experts. This allowed them to build a model that could automatically tag important sections of medical records such as emergency department reports or discharge summaries which reviewers could then verify for correctness inside the Databrewery interface.

They also set up active learning pipelines to identify parts of the data where their models were underperforming. Their internal experts reviewed not only the confident predictions to catch potential issues but also focused on uncertain outputs where the model needed more training. As they built out this system, they continuously added new types of medical documents to teach their models how to handle claims, appeals, and billing more effectively.

To guide their sampling strategy, they analyzed model predictions across unlabeled data by calculating entropy, a measure of uncertainty. They created two groups: one where the model was confident (low entropy) and one where it wasn’t (high entropy). They then pulled most of their training samples from the uncertain group, ensuring their model learned from the most ambiguous and challenging examples.

During this process, they found a class imbalance in the dataset. Some document types made up 25% of the data, while others were barely represented. To fix this, they used an earlier version of their model to help rebalance the dataset in a semi-supervised way, leading to stronger overall performance.

Since adopting Databrewery, the company has seen major gains in efficiency. Labeling each page of a medical record used to take around 13 seconds. Now, with model-assisted workflows and automation, they’ve cut that down to 8 seconds per label. This improvement has saved them more than 25 hours per labeling task and made the entire process smoother for their team, thanks to a software-first approach.