Local LLMs Beat Cloud Models in Amazon Shopping Test

📋

Key Facts

✓ A local ~3B parameter LLM successfully completed a full Amazon shopping flow with a 7/7 success rate using only structural page data.
✓ The local model stack operated with zero incremental cost and required no vision capabilities, contrasting with expensive cloud API calls.
✓ The system reduced input complexity by pruning approximately 95% of DOM nodes, creating a compact semantic snapshot for the model.
✓ The local model used 11,114 tokens compared to the cloud model's 19,956 tokens, demonstrating greater efficiency in token usage.
✓ The verification layer implemented Jest-style assertions after every action, ensuring the agent could only proceed after proving state changes.
✓ The experiment concluded that constraining the state space and making success explicit through verification is more effective than scaling model size.

The Reliability Paradox

The pursuit of more powerful AI often leads to larger, more expensive cloud models. However, a recent experiment challenges this conventional wisdom by demonstrating that smaller, local models can achieve superior reliability in complex web automation tasks.

Researchers tested a common automation scenario: completing a full shopping flow on Amazon. The goal was to navigate from search to checkout, a sequence involving multiple steps and dynamic page elements. The results revealed a surprising contradiction to the industry's prevailing approach.

The study compared a high-capacity cloud model against a compact local model, measuring success rates, token usage, and cost. The findings suggest that architectural innovation may outweigh raw computational power when building dependable AI agents.

The Amazon Challenge

The experiment focused on a standardized task: search → first product → add to cart → checkout. This flow tests an AI's ability to interpret dynamic web pages, make decisions, and execute precise actions without visual input.

Two primary systems were compared. The cloud baseline used a large, vision-capable model (GLM‑4.6). The local autonomy stack relied on a combination of a reasoning planner (DeepSeek R1) and a smaller executor model (Qwen ~3B), both running on local hardware.

The performance metrics revealed stark differences:

Cloud Model: Achieved 1 success in 1 run, using 19,956 tokens at an unspecified API cost.
Local Model: Achieved 7 successes in 7 runs, using 11,114 tokens with zero incremental cost.

While the local stack was significantly slower (405,740ms vs. 60,000ms), its perfect success rate and cost efficiency highlighted a critical trade-off between speed and reliability.

"Reliability in agents comes from verification (assertions on structured snapshots), not just scaling model size."
— Study Findings

Architectural Innovation

The local model's success was not accidental; it resulted from a redesigned control plane. The system employed three key strategies to constrain the problem and ensure deterministic outcomes.

First, it pruned the DOM to reduce complexity. Instead of feeding the entire page or screenshots, the system generated a compact "semantic snapshot" containing only roles, text, and geometry, pruning approximately 95% of nodes.

Second, it split reasoning from acting. A planner model determined the intent and expected outcomes, while a separate executor model selected concrete DOM actions like CLICK or TYPE. This separation of concerns improved precision.

Third, every step was gated by Jest-style verification. After each action, the system asserted state changes—such as URL updates or element visibility. If an assertion failed, the step would fail and trigger bounded retries, ensuring the agent never proceeded on a false assumption.

From Smart to Working

The logs revealed how this verification layer transformed the agent's behavior. In one instance, the system used a deterministic override to enforce the "first result" intent, ensuring the correct product link was clicked.

Another example involved handling a dynamic drawer. The system verified the drawer's appearance and forced the correct branch, logging a clear "PASS | add_to_cart_verified_after_drawer" result.

These were not post-hoc analytics; they were inline gates. The system either proved it made progress or stopped to recover. This approach moves beyond probabilistic guessing to provable execution.

Reliability in agents comes from verification (assertions on structured snapshots), not just scaling model size.

The takeaway is clear: the highest-leverage move for reliable browser agents isn't a bigger model. It's constraining the state space and making success explicit with per-step assertions.

The Verification Imperative

This case study demonstrates that verification is the cornerstone of reliable AI automation. By implementing a rigorous assertion layer, a modest local model achieved a perfect success rate where a more powerful cloud model faltered.

The implications extend beyond e-commerce. Any domain requiring precise, repeatable actions—such as data entry, form processing, or system administration—can benefit from this architectural shift. The focus moves from model size to system design.

As AI agents become more integrated into daily workflows, the demand for dependability over raw power will only grow. This experiment provides a blueprint for building agents that work, not just those that look smart.