Apps Were Never Smart
← back to blog
2026-03-06·9 min read·aiagentsmobile

Apps Were Never Smart

In May 2024, the best AI agent completed 30% of tasks on an Android phone. Eighteen months later, 91.4%. The benchmark is dead. Your apps are next.


I built HECCO. An AI health platform with voice agents, wearable integration, the works. React Native frontend, NestJS backend, Python AI service. Seven agents running in production. Play Store live. Real users.

I am not a neutral observer. I have skin in the game. I built the kind of app that is about to become unnecessary. And I think that's a good thing. Here's why.


The scoreboard nobody is watching

In May 2024, Google DeepMind published AndroidWorld. 116 real tasks across 20 real Android apps. Not toy tasks. Create a calendar event. Navigate to an address. Manage contacts across applications. Average task length: 14.3 steps. And critically, the benchmark doesn't ask an LLM judge if the task "looks done." It checks the actual SQLite database. Did the row get created? Is the contact there?

This matters because earlier benchmarks were participation trophies. B-MoCA covered six tasks. Six. My morning skincare routine has more steps. Mobile-Env gave you 13 task templates for a single app. Calling these "benchmarks" was generous.

AndroidWorld was the first benchmark that felt like using an actual phone. Google's best agent scored 30.6%.

The human baseline was 80%.

So in May 2024, the best AI agent on the planet could complete roughly one in three tasks that a bored teenager could do without looking up from their other phone.

Here is what happened in the next eighteen months.

[chart]The Eighteen-Month Climb

AndroidWorld task completion rate by model. Dashed line = human baseline (80%). Every bar represents progress from "can barely use a phone" to "better than you."

30.6%M3Project
34.5%GPT-4o
46.6%UI-TARS
65.0%Claude 3.7 Sonnet
75.0%D-Artemis
87.1%Surfer 2
91.4%DroidRun
human baseline (80%)
May 2024 to Oct 2025

Read that again. Thirty percent to ninety-one percent. The final entry, DroidRun, completed 106 of 116 tasks. It didn't just pass the human baseline. It buried it.

I watched this chart go up while maintaining a production app and I had a very specific thought: I am building the last generation of this thing.


How they got good

Three breakthroughs. Not one genius model. Three compounding advantages that turned "can barely use a phone" into "better than you at using a phone."

They stopped treating phones like random images

General vision-language models train on the entire internet. Phone screenshots are a rounding error in that data. The first models looked at an Android home screen the way I look at abstract art. Interesting. Confusing. No idea what to tap.

ByteDance, Alibaba, and others built massive mobile-specific training sets. Millions of screenshots with parsed metadata. Bounding boxes around every UI element. Spatial relationships mapped. ByteDance's UI-TARS went fully end-to-end. Screenshots in, actions out. No hand-crafted prompts. The 72B variant hit state-of-the-art on more than ten GUI benchmarks simultaneously. That's not a phone agent. That's a phone native.

They learned to fail

This is the one I find most interesting as a builder.

UI-TARS trained on mistakes and the recovery steps that followed. Previous models only trained on correct actions. UI-TARS trained on screw-ups and how to come back from them. It also trained on simulated scenarios where things go wrong and the agent has to figure out what happened.

The model doesn't just know what to do. It knows what to do when it messes up. Which, if we're being honest, is the skill that separates decent software users from the ones who call IT because their monitor is off.

I have spent seven years building mobile apps. I can tell you with absolute certainty: the apps that survive are the ones that handle failure gracefully. Turns out, the same is true for the agents replacing them.

They built a management layer

Early agents ran a flat loop. Look at screen. Pick action. Execute. Repeat. No self-monitoring. No awareness of whether the task was going well or catastrophically sideways. I have worked with junior developers who operated the same way.

MobileUse introduced hierarchical reflection. The agent audits itself at three levels: individual actions, overall trajectory, and global task state. It's constantly asking: "Was that tap right?" "Am I still headed toward the goal?" "Does the world state make sense?"

Alibaba took it further with Mobile-Agent-v3. Four dedicated roles: Manager, Worker, Reflector, Notetaker. A bureaucracy, yes. But a bureaucracy that scores 75% on AndroidWorld. I have seen engineering teams with four roles that score lower.

[interactive]Agent Architecture: Then vs. Now

2023: flat loop

look at screen
pick action
repeat

2025: hierarchical agent

action reflection
trajectory audit
global state check
execute

So now the agent can use your phone better than you can

Which raises a question nobody in the App Store wants to hear.

Why does the agent need your app at all?

Strip away the colors, the animations, the onboarding flows. What is a mobile app? A CRUD database with business logic and a coat of paint. Your banking app is a form that writes to a ledger. Your food delivery app is a search query, a cart, and a payment token fired at an API. Your calendar is rows in a SQLite table. The intelligence has always lived on the server. The app was always the middleman.

I know this because I built one. HECCO has 47 backend modules. The mobile app exists because a human needs buttons to tap and screens to read. The backend does the actual work. If something could talk directly to those 47 modules, the app becomes a display case for a business that moved online.

This was a reasonable arrangement when humans were the only ones capable of navigating a visual interface. We needed the buttons because we needed something to tap. We needed the onboarding flows and tooltips because translating "what I want" into "what the computer needs" required a friendly layer in between.

Think about what an onboarding funnel actually is. It exists because the software cannot infer your intent. SaaS onboarding dropout rates run 30 to 50%. Half your users bail before they experience any value. That's dumb software trying to understand smart humans through a twenty-screen questionnaire.

An agent doesn't need the funnel. It already knows your context. There is no cold start problem when the AI has full insight into why you showed up.

The logical endpoint isn't "AI that uses apps better than you." It's "AI that calls the APIs directly and the app never loads." The translator becomes unemployed when both sides speak the same language.

The market already knows. Google Play lost 47% of its apps between early 2024 and April 2025. From 3.4 million down to 1.8 million. The average American has 80+ apps installed. 62% don't get opened in any given month. Users spend 80% of their app time in just three apps. The rest is digital furniture you scroll past.


But the phone is not the app

This is where it gets interesting. And personal.

If apps are just buttons, you could run agents on a laptop and call it a day. But laptops don't have what phones have.

The camera is the agent's eyes. Not a webcam. A camera in your pocket, pointed at whatever you're looking at. Scan a receipt and file the expense. Read a medicine label and check interactions. I built health features in HECCO that use the camera for food logging. An agent with camera access doesn't need my UI. It needs my backend.

GPS is spatial awareness. The agent knows you're at the airport, the office, a restaurant. Context that changes what "remind me later" means. Context no laptop has unless you type it.

Biometrics are the trust layer. Face ID and fingerprint aren't just security. They're identity primitives. An agent that authenticates as you, on your device, without sending credentials over a network. That's infrastructure for autonomous transactions.

NFC bridges the physical world. Tap to pay. Tap to open a door. An agent with NFC access can transact in meatspace without you pulling the phone out.

The microphone is always-on intent capture. "Book me a table at that Italian place for Friday." The agent parsed the intent. The reservation is made. The app never loaded.

A laptop is a tool you sit down at. A phone is a body extension with cameras, ears, a compass, and a fingerprint. When AI lives on that hardware, it's not a chatbot. It's an autonomous agent with senses.


300,000 people didn't care about the risks

While the benchmarks were climbing, Peter Steinberger shipped OpenClaw. A local-first autonomous agent that runs on your hardware. You message it from WhatsApp or Telegram. It does the thing.

68,000 GitHub stars. 300,000+ users. 13,700 skills in the community registry. People running it on $50 Android phones via Termux.

I run it on a dedicated phone. Separate Google account. Nothing connected to anything I'd miss losing.

Here's why I take precautions. In January 2026, Kaspersky audited OpenClaw. 512 vulnerabilities. Eight classified as critical. The attack surface: email, calendar, messaging, file system, and shell access. When they sent Steinberger the report, his response was: "This is a tech preview. A hobby. If you wanna help, send a PR."

300,000 people looked at a tool with 512 known vulnerabilities and said "yes, I want this." Some of them, I guarantee, did not set up a separate account.


The phone wakes up

Command line. Graphical interface. Touch screen. Each transition made computing accessible to more people by hiding the previous layer's complexity. Each time, the old interface didn't vanish. It moved down the stack, became the plumbing that the new thing sat on.

We are in the fourth transition. You don't open an app to pay for parking. You say "pay for an hour's parking here" and the agent uses your location, queries the API, completes payment via biometrics, and moves on. The screen never lit up. The intent was enough.

The benchmarks proved the capability. 30% to 91% in eighteen months.

OpenClaw proved the demand. 300,000 users despite 512 known vulnerabilities.

The phone has what no other device has. Camera, GPS, accelerometer, biometrics, NFC, microphone. Always on, always with you.

Samsung, Google, and Apple are building agents into the OS. 912 million AI-capable phones shipping annually by 2028. The standalone app is living on borrowed time. Not because apps disappear overnight. They won't. But because the app was always a translator between human intent and machine action. When the machine understands the intent directly, the translator becomes legacy.

I built a production mobile app with everything in it. I'd do it differently now. Not because the code was wrong. Because the container was. The intelligence should have lived closer to the user's intent and further from the screen. That's the lesson I'm taking into whatever I build next.

30.6% in May 2024. 91.4% in October 2025. The benchmark is dead. The successor is already harder. The agents will solve that one too.

Eighteen months. That's all it took. That's all it ever takes.

>_

Ashutosh Makwana

10+ years engineering. AI-native since 2022. Building things that think.