From hallucinations, to hardware: Lessons learned from a real computer vision project that went sideways

June 28, 2025 12:05 PM

VentureBeat/Midjourney

Join the event trusted for over two decades by business leaders. VB Transform brings the people who are building enterprise AI strategies together.

Learn more

Computer-vision projects rarely go as planned. This one was no different. The idea was simple. Build a model to analyze a photo of laptops and identify physical damage, such as cracked screens, keys missing or broken hinges. It was a simple use case that seemed to be a good fit for large language models and image models, but it quickly became more complex.

We ran into problems with hallucinations and unreliable outputs, as well as images that weren’t even laptops. We applied an agentic framework to solve these issues in a unique way – not for task automation but rather to improve the performance of the model. In this post we will go through what we tried, how it didn’t work, and how a combination helped us build a reliable system.

Where we began: Monolithic prompting (19659009) Our initial approach was standard for a model that uses multiple modes. We used a large, single prompt to send an image to an LLM that could handle images and asked it to identify visible damages. This monolithic prompting is easy to implement and works well for tasks that are clearly defined. The data in the real world is rarely as predictable.

We encountered three major problems early on:

The model would sometimes create damage that didn’t exist or mislabel the things it was seeing.Junk image detection: The model had no reliable way of flagging images that weren’t laptops. For example, pictures of walls, desks or people would sometimes slip through and receive nonsensical reports.

Inconsistent Accuracy: These problems combined made the model unreliable enough to be used in operational settings.

At this point, it became obvious that we would have to iterate.

First fix: Mixing images resolutions

We noticed that the output of the model was affected by image quality. Users uploaded images that ranged from sharp, high-resolution images to blurry. This led us refer to

Research highlights how image resolution affects deep learning models.

Using a mixture of high- and low-resolution images, we trained and tested the deep learning model. The idea was to make it more resilient to the diverse range of image quality that would be encountered in practice. The model was made more consistent, but the core problems of hallucination & junk image handling remained.

The multimodal detour – Text-only LLMs go multimodal

Recent experiments in combining text-only LLMs with image captioning — like the technique described in — have encouraged recent experiments.

We decided to give The Batch(where captions are generated by images and then interpreted using a language model) a try.

This is how it works:

First, the LLM generates multiple captions for a given image.
A multimodal embedding modeling model checks whether each caption fits with the image. In this case, SigLIP was used to score the similarity of the image and text. The system retains the top few captions according to these scores.
Using the top captions, the LLM writes new ones to try to get as close to what an image actually shows.
The LLM repeats this until the captions no longer improve, or it reaches a certain limit.

Although clever in theory, it introduced new problems to our use case.

Consistent hallucinations () : The captions included imaginary damage which the LLM confidently reported.
[Incomplete coverage]: Even when multiple captions were used, some issues remained unreported.

Complexity increased, little benefit: The additional steps made the system more complex without delivering a reliable performance over the previous setup.

This was an interesting experiment but not a solution.

A creative use

of agentic frameworks. This was the turning-point. We wondered if breaking the image interpretation task down into smaller, specialized agent might help.

Our agentic framework looked like this:

The orchestrator agent: It inspected the image and identified what laptop components were visible (screens, keyboards, chassis, ports). Component agents: Agents were dedicated to inspecting each component for specific types of damage. For example, one agent would check for cracked screens and another for missing keys.
Junk detector agent: A separate agent flagged if the image was a laptop at all.

The modular, task-driven method produced more precise and comprehensible results. The number of hallucinations decreased dramatically, the quality of images was improved and each agent had a simple and focused task.

It was not perfect. There were two main limitations:

An increase in latency: Running multiple sequential agents increased the total inference time.

Gaps in coverage: Agents were only able to detect issues that they were explicitly programmed for. If an image revealed something unexpected, but no agent was assigned to identify it, it would be overlooked.

It was important to find a balance between precision and coverage.

The hybrid solution: Combining monolithic and agentic approaches

We created a hybrid system to bridge the gaps:

First, we ran an agentic framework,which handled precise detection of known types of damage and junk images. We reduced the number of agents in order to improve latency.
Next, a monolithic image LLM prompt scanned the image to see if there was anything else that the agents may have missed.
Lastly, we refined the model by using a curated collection of images for high priority use cases, such as frequently reported damage scenarios, in order to further improve accuracy.

The combination of agentic setup and monolithic prompting, as well as the confidence boost from targeted fine-tuning, gave us the precision of the agentic setup and the explainability.

What we learned

We learned that agentic frameworks were more versatile than we thought : Although they are typically associated with workflow management we found that they could boost model performance when used in a structured and modular way.

Blending approaches is better than relying on one: The combination of precise agent-based detection with the broad coverage provided by LLMs and a little fine-tuning in the areas that mattered gave us more reliable results than any single approach.

Visual model are prone hallucinations.: Even more advanced setups are susceptible to seeing things that aren’t there. To keep these mistakes in check, a thoughtful design of the system is needed.

Image variety is important: Testing and training with both high-resolution, clear images and everyday, low-quality photos helped the model remain resilient when faced by unpredictable, real-world photographs.

You must have a way to detect junk images: One of the easiest changes we made had a huge impact on system reliability.

Final thoughts

A simple idea of using an LLM prompt for detecting physical damage in laptop pictures quickly evolved into a deeper experiment combining different AI methods to tackle unpredictable real-world problems. We discovered that some of the best tools weren’t originally designed for this kind of work.

Agentic Frameworks, which are often viewed as workflow utilities, have proven to be surprisingly effective for tasks such as structured damage detection and images filtering. They helped us create a system which was not only more accurate but also easier to understand and maintain in practice.

Shruti Tiwari works as an AI product manager for Dell Technologies. Vadiraj is a data scientist for Dell Technologies.

Daily insights into business use cases from VB Daily

Want to impress your boss? VB Daily can help. We provide you with the inside scoop about what companies are doing to maximize ROI, from regulatory changes to practical deployments.

Read our privacy policy

Thank you for subscribing. Click here to view more VB Newsletters.

An error occured.

From hallucinations, to hardware: Lessons learned from a real computer vision project that went sideways

Get in Touch

Get in touch

Email

Phone

Social media

Find us

From hallucinations, to hardware: Lessons learned from a real computer vision project that went sideways

First fix: Mixing images resolutions

The multimodal detour – Text-only LLMs go multimodal

A creative use

The hybrid solution: Combining monolithic and agentic approaches

What we learned

Final thoughts

Related articles

Civil society: Police facial recognition must be strictly limited

Canada’s farms are turning to robotics for efficiency and the employment gap

Astrobotic says sale to Voyager will allow it to scale up

NDPC to review Data Protection Act to address AI, robotics, big data

Recent articles

Civil society: Police facial recognition must be strictly limited

Canada’s farms are turning to robotics for efficiency and the employment gap

Astrobotic says sale to Voyager will allow it to scale up

NDPC to review Data Protection Act to address AI, robotics, big data