Quick Summary
- 1A local ~3B parameter LLM successfully completed a full Amazon shopping flow using only structural page data and deterministic assertions, achieving a 7/7 success rate.
- 2The local model stack, while slower, operated with zero incremental cost and required no vision capabilities, contrasting with expensive cloud API calls.
- 3The key innovation involved splitting reasoning from acting and implementing a verification loop with per-step assertions to ensure reliability.
- 4The study concludes that constraining the state space and making success explicit through verification is more effective than simply scaling model size for browser agent reliability.
The Reliability Paradox
The pursuit of more powerful AI often leads to larger, more expensive cloud models. However, a recent experiment challenges this conventional wisdom by demonstrating that smaller, local models can achieve superior reliability in complex web automation tasks.
Researchers tested a common automation scenario: completing a full shopping flow on Amazon. The goal was to navigate from search to checkout, a sequence involving multiple steps and dynamic page elements. The results revealed a surprising contradiction to the industry's prevailing approach.
The study compared a high-capacity cloud model against a compact local model, measuring success rates, token usage, and cost. The findings suggest that architectural innovation may outweigh raw computational power when building dependable AI agents.
The Amazon Challenge
The experiment focused on a standardized task: search → first product → add to cart → checkout. This flow tests an AI's ability to interpret dynamic web pages, make decisions, and execute precise actions without visual input.
Two primary systems were compared. The cloud baseline used a large, vision-capable model (GLM‑4.6). The local autonomy stack relied on a combination of a reasoning planner (DeepSeek R1) and a smaller executor model (Qwen ~3B), both running on local hardware.
The performance metrics revealed stark differences:
- Cloud Model: Achieved 1 success in 1 run, using 19,956 tokens at an unspecified API cost.
- Local Model: Achieved 7 successes in 7 runs, using 11,114 tokens with zero incremental cost.
While the local stack was significantly slower (405,740ms vs. 60,000ms), its perfect success rate and cost efficiency highlighted a critical trade-off between speed and reliability.
"Reliability in agents comes from verification (assertions on structured snapshots), not just scaling model size."— Study Findings
Architectural Innovation
The local model's success was not accidental; it resulted from a redesigned control plane. The system employed three key strategies to constrain the problem and ensure deterministic outcomes.
First, it pruned the DOM to reduce complexity. Instead of feeding the entire page or screenshots, the system generated a compact "semantic snapshot" containing only roles, text, and geometry, pruning approximately 95% of nodes.
Second, it split reasoning from acting. A planner model determined the intent and expected outcomes, while a separate executor model selected concrete DOM actions like CLICK or TYPE. This separation of concerns improved precision.
Third, every step was gated by Jest-style verification. After each action, the system asserted state changes—such as URL updates or element visibility. If an assertion failed, the step would fail and trigger bounded retries, ensuring the agent never proceeded on a false assumption.
From Smart to Working
The logs revealed how this verification layer transformed the agent's behavior. In one instance, the system used a deterministic override to enforce the "first result" intent, ensuring the correct product link was clicked.
Another example involved handling a dynamic drawer. The system verified the drawer's appearance and forced the correct branch, logging a clear "PASS | add_to_cart_verified_after_drawer" result.
These were not post-hoc analytics; they were inline gates. The system either proved it made progress or stopped to recover. This approach moves beyond probabilistic guessing to provable execution.
Reliability in agents comes from verification (assertions on structured snapshots), not just scaling model size.
The takeaway is clear: the highest-leverage move for reliable browser agents isn't a bigger model. It's constraining the state space and making success explicit with per-step assertions.
The Verification Imperative
This case study demonstrates that verification is the cornerstone of reliable AI automation. By implementing a rigorous assertion layer, a modest local model achieved a perfect success rate where a more powerful cloud model faltered.
The implications extend beyond e-commerce. Any domain requiring precise, repeatable actions—such as data entry, form processing, or system administration—can benefit from this architectural shift. The focus moves from model size to system design.
As AI agents become more integrated into daily workflows, the demand for dependability over raw power will only grow. This experiment provides a blueprint for building agents that work, not just those that look smart.
Frequently Asked Questions
The study found that a smaller, local language model (~3B parameters) achieved a perfect 7/7 success rate in completing a complex Amazon shopping flow, outperforming a larger cloud model that only succeeded once. The local model also used fewer tokens and incurred zero incremental cost, demonstrating that architectural design can trump raw computational power.
The system used a three-part architecture: it pruned the DOM to reduce complexity, split reasoning from acting between two specialized models, and implemented a verification loop with per-step assertions. This ensured the agent could only proceed after proving each action was successful, eliminating guesswork.
The results suggest that for reliable automation, developers should focus on constraining the problem space and implementing rigorous verification checks rather than simply using larger models. This approach reduces costs, improves success rates, and makes agent behavior more predictable and trustworthy.
Yes, the local model stack was significantly slower, taking about 405 seconds compared to the cloud model's 60 seconds. However, the local model's perfect success rate and zero cost made it more practical for scenarios where reliability is prioritized over speed.










