how zarie works under the hood

i wrote earlier about building and pausing zarie. this post is different. it is a technical deep dive into how the system actually works, the architecture decisions we made, and three lessons i learned the hard way about building AI agents.

zarie is an AI assistant that runs on telegram and slack. you tell it to remind you about things, manage tasks, sync calendars, or search the web. we built it between october and december 2025. this post explains how the system works: a dual‑agent architecture, a custom ReAct loop built on LiteLLM (no agent frameworks), a scheduler for recurring events, and postgres for all state.

the architecture

at a high level, zarie has three layers: the messaging layer (telegram and slack bots), the agent layer (a main agent and worker agents), and the persistence layer (postgres for everything).

high‑level architecture

telegram bot

message buffering, whitelist

slack bot

MPIM, thread support

main agent (donna)

user‑facing ReAct loop, tool calling, conversation streaming

worker agents
background tasks, reminders

scheduler
60s cron, RRULE engine

event manager
time events, status locking

postgresql

users, conversations, worker state, time events, google creds

when a user sends a message on telegram or slack, the bot buffers it for 5 seconds (to batch rapid‑fire messages), then hands it to the main agent. the main agent runs a ReAct loop: it thinks, optionally calls tools, gets results, and loops until it has a final response. one of those tools is invoke_worker_agent, which delegates background work to a separate agent with its own conversation context.

the data model

before going deeper into the agents, it helps to see what the database looks like. everything lives in postgres. no redis, no vector database, no external stores.

data model

users

telegram_id PK

name, platform, timezone

1 : many

chats_context

user_id FK

message_sequence

role, content, tool_calls

thread_ts, author_id

is_summarised

time_events

user_id FK

next_trigger_timestamp

recurrence_rule (RRULE)

status: ACTIVE | PROCESSING | DISABLED

1 : many

worker_agent_directory_v2

agent_name, user_id

purpose, running_summary

worker_agent_context

same schema as chats_context

scoped to worker agents

the chats_context table is the most important one. every message, tool call, and tool result is stored as a row with a monotonically increasing message_sequence per user. when we rebuild context for an LLM call, we query by user_id, order by sequence, and have the full conversation. for slack MPIMs (multi‑party DMs), thread_ts and author_id track who said what within a group conversation.

worker agents get their own tables. the directory stores metadata (name, purpose, a running summary of state), and worker_agent_context mirrors the main conversation table but is scoped per worker. this separation means the main agent and workers never pollute each other's context.

the ReAct loop

both agents (main and worker) run the same pattern. the loop is simple and this is the entire thing, no framework required:

ReAct loop

build context

fetch history from postgres. summarize if >100 messages. prepend system prompt + running summary.

call LLM via LiteLLM

messages + tool definitions, tool_choice="auto"

tool calls in response?

yes

execute tools

run tool, store call + result in DB

↑ loop back to LLM call

stream response

chunk by paragraphs, send to user, store in DB. done.

each iteration of the loop is one LLM call. the model sees the full conversation including prior tool calls and their results. this means if the model calls set_time_event and it fails, it sees the error in the next iteration and can retry or adjust. there is no retry logic built into the application. the LLM handles it naturally.

the dual‑agent system

the main agent handles direct user conversations. but some tasks need to happen in the background: recurring reminders, periodic check‑ins, event‑triggered actions. those get delegated to worker agents.

when the main agent calls invoke_worker_agent, it either creates a new worker or resumes an existing one by name. each worker has its own conversation history stored in a separate table (worker_agent_context). workers carry a running_summary that captures their purpose and state, so they maintain continuity across invocations.

workers have a different tool set than the main agent. they can set and delete time events, access google calendar, and search the web, but they cannot invoke other workers. the separation keeps things clean: the main agent talks to the user, workers handle background jobs.

the scheduler

a separate process (run_scheduler.py) runs every 60 seconds and checks for due events. the flow looks like this:

scheduler flow (runs every 60s)

query due events

SELECT * FROM time_events WHERE next_trigger_timestamp <= NOW() AND status = 'ACTIVE'

lock to PROCESSING

prevents duplicate triggers from concurrent runs

invoke worker agent

worker decides what to tell the user (or returns No_Response_Needed to stay silent)

stream response to user

via main agent, on telegram or slack

update next trigger

parse RRULE, calculate next occurrence, set status back to ACTIVE (or DISABLED if one‑time)

race conditions are a real concern with time events. if the scheduler fires twice before the first run finishes, you get duplicate reminders. we handle this by immediately locking every due event to PROCESSING status before doing anything else, and processing events for the same user sequentially. events have three states: ACTIVE, PROCESSING, and DISABLED.

for recurring events, we use RRULE strings (the same format used in iCalendar). the event manager parses the rule, calculates the next trigger time using dateutil.rrule, and updates the record. all timestamps are stored in UTC and converted to the user's timezone only at display time.

workers also have a special token: No_Response_Needed. when a worker returns this, the entire response is suppressed. the user sees nothing. this is important for background operations like silently checking whether a recurring task is still relevant before bothering the user about it.

three things i learned

1. ditch the framework, build the loop yourself

we started with LangChain and LangGraph. the pitch was appealing: pre‑built agent loops, tracing via LangSmith, a graph‑based execution model. in practice, it was a mess.

the abstractions were leaky. debugging required understanding not just your code but the framework's internal state machine. LangSmith's tracing was useful when it worked, but the setup overhead was significant. every time we wanted to change something about the agent loop (how tools were retried, how streaming worked, how context was managed), we were fighting the framework instead of writing code.

framework vs. custom

langchain / langgraph

pre‑built agent loops. graph‑based execution. langsmith tracing. large dependency tree. abstractions that are hard to customize or debug.

abandoned after 3 weeks

litellm + custom ReAct loop

one dependency for LLM calls. 100 lines of code for the full agent loop. full control over streaming, retries, context management, tool execution.

what we shipped

we ripped it all out and wrote our own ReAct loop on top of LiteLLM. LiteLLM is a thin wrapper that normalizes the API across model providers (OpenAI, Anthropic, Google, DeepSeek, etc.) without trying to be an agent framework. our entire agent loop is about 100 lines of Python. we can read every line, debug every edge case, and change anything we want without fighting abstractions.

the tracing we lost from LangSmith? we replaced it with Poirot, a simple internal tool that logs every LLM call, tool invocation, and response for a given user session. it was not fancy, but it showed us exactly what we needed.

my takeaway: agent frameworks add value when your use case maps cleanly onto their assumptions. if it does not (and for most real products, it will not), you are better off with a thin LLM wrapper and your own loop.

2. vector search lost to "just keep everything in context"

early on, we built a notes system where the AI could create, read, update, and delete notes using vector search. the idea was that users would accumulate information over time, and the AI would retrieve relevant notes via embeddings when needed.

it did not work well. retrieval was unpredictable. the AI would sometimes miss relevant notes or pull in irrelevant ones. edits were a nightmare: when a user said "update my grocery list," the AI had to figure out which note to edit, merge the old and new content, and re‑embed it. failure modes were subtle and hard to debug.

memory strategy: what we tried vs. what worked

vector search (CRUD notes)

AI creates and retrieves notes via embeddings. requires explicit CRUD operations. retrieval is similarity‑based and unpredictable. edits are error‑prone.

dropped after testing

full message history + summarization

keep all messages in the conversation thread. when context exceeds ~100 messages, summarize older ones into a structured running summary. LLM has full context.

worked much better

what actually worked was dumb and simple: keep every message in the conversation thread and let the LLM see all of it. when context got too long (over 100 messages), we ran a summarization pass that extracted key facts into a structured running summary (user preferences, contacts, recurring patterns) and marked the old messages as summarized. the last 6 user interactions stayed in full, so the model always had recent context.

this worked because the LLM does not need vector search to find a note. it already has the context. if a user said "add eggs to my grocery list" three messages ago, the model remembers. even after summarization, the structured summary captures "grocery list: milk, bread, eggs" in plain text. no embedding, no similarity threshold, no retrieval failures. and if the user updates the list, the new message is simply in context. the model merges it naturally.

the lesson: vector search is great for searching over large external corpora where the content was never in the conversation. but for user‑generated context within an ongoing conversation, just keeping the messages and summarizing when needed is simpler and more reliable.

3. prompt tuning creates model lock‑in

this one surprised us. we built zarie between october and december 2025, and started with DeepSeek V3 hosted in China. we spent weeks tuning our system prompt for it. the prompt is detailed: it defines an instruction hierarchy, specifies how to handle timezones, how to calculate relative times, how to format responses, and dozens of behavioral guidelines. DeepSeek V3 followed it well. the product felt good.

then we wanted to move off China‑hosted servers for latency and reliability reasons. we tried the same model through OpenRouter and Together AI. same model name, same prompt. the results were noticeably worse. responses were less consistent, tool calling was flakier, and the overall quality dropped. our best guess is that these providers use some form of quantization that subtly changes the model's behavior. the prompt was tuned for the full‑precision model, and the quantized version did not follow it the same way.

switching to a different model entirely (Claude, GPT‑4, Gemini) was harder still. each model has different strengths in instruction following, different ways of interpreting ambiguous prompts, different tendencies around tool calling. a prompt tuned for DeepSeek that says "never convert to IST unless the user's timezone IS IST" might be followed exactly by one model and loosely by another.

we eventually settled on Gemini 2.5 Flash, which gave us the best balance of cost, speed, and instruction following. it handled our existing prompt well enough that we did not have to rewrite it from scratch, but we still needed meaningful tuning to get it to the same quality level.

we discovered this the hard way, but it turns out to be a known problem. Cursor sends meaningfully different system prompts to GPT‑5 and Claude Sonnet. Aider, the open‑source coding assistant, maintains separate prompt formats per model. at the time we were building zarie (late 2025), none of these were well‑documented references for us. we just learned it by running into it.

the takeaway: a well‑tuned prompt is not portable. it is tuned for a specific model, and sometimes for a specific hosting provider. if you plan to switch models (and you should plan for it), invest in evaluation infrastructure early. we built a simple eval suite that runs test conversations and checks for specific behaviors. without it, switching models would have been guesswork.

in retrospect

zarie's architecture is not novel. dual‑agent systems, ReAct loops, and postgres‑backed state are well‑trodden ground. what i value most from the experience is learning where the conventional wisdom is wrong. agent frameworks add complexity that most products do not need. vector search is not the default answer for AI memory. and your prompt is more coupled to your model than you think.

the code is open source if you want to look at the implementation.

sankalp phadnis