Tools like Manus / GPT Agent Mode / BrowserUse / Claude’s Chrome control typically make an LLM call per action/decision. That piles up latency, cost, and fragility as the DOM shifts, sessions expire, and sites rate-limit. Eventually you hit prompt-injection landmines or lose context and the run stalls.
I am approaching browser agents differently: record once, replay fast. We capture HTML snapshots + click targets + short voice notes to build a deterministic plan, then only use an LLM for rare ambiguities or recovery. That makes multi-hour jobs feasible. Concretely, users run things like:
Recruiter sourcing for hours at a stretch
SEO crawls: gather metadata → update internal dashboard → email a report
Bulk LinkedIn connection flows with lightweight personalization
Even long web-testing runs
A stress test I like (can share code/method): “Find 100+ GitHub profiles in Bangalore strong in Python + Java, extract links + metadata, and de-dupe.” Most per-step-LLM agents drift or stall after a few minutes due to DOM churn, pagination loops, or rate limits. A record-→-replay plan with checkpoints + idempotent steps tends to survive.
I’d benchmark on:
Throughput over time (actions/min sustained for 30–60+ mins)
End-to-end success rate on multi-page flows with infinite scroll/pagination
Resume semantics (crash → restart without duplicates)
Selector robustness (resilient to minor DOM changes)
Cost per 1,000 actions
Disclosure: I am the founder of 100x.bot (record-to-agent, long-run reliability focus). I’m putting together a public benchmark with the scenario above + a few gnarlier ones (auth walls, rate-limit backoff, content hashing for dedupe). If there’s interest, I can post the methodology and harness here so results are apples-to-apples.