give it a goal in plain english. it reads the screen, thinks about what to do, taps and types via adb, and repeats until the job is done.
$ bun run src/kernel.ts enter your goal: open youtube and search for "lofi hip hop" --- step 1/30 --- think: i'm on the home screen. launching youtube. action: launch (842ms) --- step 2/30 --- think: youtube is open. tapping search icon. action: tap (623ms) --- step 3/30 --- think: search field focused. action: type "lofi hip hop" (501ms) --- step 4/30 --- action: enter (389ms) --- step 5/30 --- think: search results showing. done. action: done (412ms)
every step is a loop. dump the accessibility tree, filter interactive elements, send to an llm, execute the action, repeat.
captures the screen via uiautomator dump and parses the accessibility xml into tappable elements with coordinates and state.
sends screen state + goal to an llm. the model returns think, plan, action - it explains its reasoning before acting.
executes the chosen action via adb - tap, type, swipe, launch, press back. 22 actions available.
if screen doesn't change for 3 steps, stuck recovery kicks in. empty accessibility tree falls back to screenshots.
type a goal, chain goals across apps with ai, or run deterministic steps with no llm calls.
run it and describe what you want. the agent figures out the rest.
$ bun run src/kernel.ts enter your goal: send "running late, 10 mins" to Mom on whatsapp
chain goals across multiple apps. natural language steps, the llm navigates.
{
"name": "weather to whatsapp",
"steps": [
{ "app": "com.google...",
"goal": "search chennai weather" },
{ "goal": "share to Sanju" }
]
}
fixed taps and types. no llm, instant execution. for repeatable tasks.
appId: com.whatsapp name: Send WhatsApp Message --- - launchApp - tap: "Contact Name" - type: "hello from pocketagent" - tap: "Send"
delegate to on-device ai apps, control phones remotely, turn old devices into always-on agents.
open google's ai mode, ask a question, grab the answer, forward it to whatsapp. or ask chatgpt something and share the response to slack. the agent uses apps on your phone as tools - no api keys for those services needed.
install tailscale on phone + laptop. connect adb over the tailnet. your phone is now a remote agent - control it from anywhere. run workflows from a cron job at 8am every morning.
# from anywhere: adb connect <phone-tailscale-ip>:5555 bun run src/kernel.ts --workflow morning.json
that android in a drawer can now send standups to slack, check flight prices, digest telegram channels, forward weather to whatsapp. it runs apps that don't have apis.
unlike predefined button flows, the agent actually thinks. if a button moves, a popup appears, or the layout changes - it adapts. it reads the screen, understands context, and makes decisions.
across any app installed on the device.
22 actions + 6 multi-step skills. here's the reality.
one command. installs bun and adb if missing, clones the repo, sets up .env.
curl -fsSL https://pa.rpaby.pw/install.sh | sh
or do it manually:
# install adb brew install android-platform-tools # install bun (required — npm/node won't work) curl -fsSL https://bun.sh/install | bash # clone and setup git clone https://github.com/unitedbyai/pocketagent.git cd pocketagent && bun install cp .env.example .env
edit .env - fastest way to start is groq (free tier):
LLM_PROVIDER=groq GROQ_API_KEY=gsk_your_key_here # or run fully local with ollama (no api key) # ollama pull llama3.2 # LLM_PROVIDER=ollama
| provider | cost | vision | notes |
|---|---|---|---|
| groq | free | no | fastest to start |
| ollama | free (local) | yes* | no api key, runs on your machine |
| openrouter | per token | yes | 200+ models |
| openai | per token | yes | gpt-4o |
| bedrock | per token | yes | claude on aws |
download and install the companion app on your android device.
enable usb debugging in developer options, plug in via usb.
adb devices # should show your device cd pocketagent && bun run src/kernel.ts
| key | default | what |
|---|---|---|
| MAX_STEPS | 30 | steps before giving up |
| STEP_DELAY | 2 | seconds between actions |
| STUCK_THRESHOLD | 3 | steps before stuck recovery |
| VISION_MODE | fallback | off / fallback / always |
| MAX_ELEMENTS | 40 | ui elements sent to llm |
ready to use. workflows are ai-powered (json), flows are deterministic (yaml).
kernel.ts main loop actions.ts 22 actions + adb retry skills.ts 6 multi-step skills workflow.ts workflow orchestration flow.ts yaml flow runner llm-providers.ts 5 providers + system prompt sanitizer.ts accessibility xml parser config.ts env config constants.ts keycodes, coordinates logger.ts session logging