How A Company Built a Scalable Data Engine to Accelerate Computer Vision Development

The Challenge

As the company expanded its AI tools especially in precision agriculture and visual analysis it hit a bottleneck: ML iteration cycles were too slow. Improvements that should’ve taken hours stretched into weeks due to a bloated, manual data pipeline. The root issues were clear: engineers were overwhelmed managing infrastructure, and data curation became unmanageable as datasets grew beyond a billion assets.

The Approach

The team overhauled the ML workflow by building a unified data engine to streamline the entire lifecycle from storage to deployment. It integrated with top-tier image and video tools and used Databrewery for smart curation and human-in-the-loop labeling. This system helped surface high-quality training data faster while cutting redundant engineering work.

The Outcome

With the new setup, ML teams stopped wasting time on pipeline maintenance or cleaning data. Using Databrewery Catalog, they could instantly access ready-to-train datasets and focus on model development. What once took weeks now took hours enabling faster, sharper iteration across their AI initiatives.

The company, a leader in applying computer vision and machine learning to agriculture, forestry, and construction, has developed advanced tools that are reshaping how machines interact with the natural world. One of their flagship technologies, See & Spray, uses real-time vision models to identify and selectively treat weeds helping farmers dramatically reduce chemical usage, saving costs, and minimizing environmental impact. As this solution was rolled out across more machines and more terrain, the volume of data being captured began to grow at an exponential pace.

With the scale of operations increasing, a core challenge emerged: model training cycles were far too slow. What once took weeks to complete became an unsustainable bottleneck, especially as they accumulated petabytes of new data across diverse conditions. Legacy systems couldn’t keep up. The process of finding the right data, labeling it accurately, and managing pipelines had grown into a sprawling, manual burden. Curating training data took too long, annotation was labor-intensive and expensive, and collaboration between ML engineers and data scientists was being dragged down by infrastructure overhead.

To fix this, the team rebuilt their entire approach. They transitioned from a traditional data lake-based setup into a unified, modern machine learning platform that gave ML teams more autonomy and significantly cut down friction in daily workflows. Their new stack was built on top of a robust data foundation, integrated with Kubeflow and Databricks for flexibility, and backed by embedded tools that handle curation, annotation, and quality control automatically.

At the heart of this transformation was Databrewery, which provided the data curation, labeling automation, and human-in-the-loop tools necessary to keep up with the pace of model iteration. By adopting Databrewery’s model-assisted labeling system, they were able to cut both annotation time and cost by half. They also layered on an intelligent QA system that cross-checks human and machine-generated labels, flagging inconsistencies in a centralized audit dashboard so teams can catch errors quickly and improve dataset integrity.

For the data selection side, Databrewery Catalog allowed them to shift away from broad, manual searches toward intelligent dataset assembly. Features like natural language search, metadata filtering, and visual similarity search made it possible to quickly isolate the most relevant examples no matter how massive the dataset. With rules and filters applied dynamically, new images that meet dataset criteria are automatically added, keeping training sets current without manual review.

Thanks to this platform, teams that once waited days for curated data can now access targeted, ready-to-train datasets in minutes. Even with over a billion images in play, machine learning engineers and scientists now spend more time fine-tuning models and less time wrangling infrastructure. Looking ahead, the company plans to expand the platform further scaling new computer vision initiatives while continuing to streamline the way data and models are built, evaluated, and improved.