Issue #14

Open-Source Agent Frameworks: What's Worth Your Time

Use an escalation ladder, not a hype ladder: stay in plain code longer than the market wants you to, move to a workflow framework when state and recovery become real, and reach for multi-agent coordination only when the job genuinely needs it.

Most teams compare agent frameworks too early. The better decision rule is to climb from plain code to workflow orchestration to multi-agent coordination only when runtime burden, recovery needs, and operator responsibility become real.

Open-Source Agent Frameworks: What's Worth Your Time - 60-second summary

Most teams evaluating agent frameworks are asking the question in the wrong order.

They start by comparing frameworks to each other before answering the more important question: do they need an agent framework at all?

If your application is basically a model call plus a few tool invocations, plain code is often the right answer. If the real problem is state, retries, branching, approvals, and long-running work, you may need a workflow framework. If the job truly requires multiple specialized agents coordinating as peers, then a multi-agent framework starts to make sense.

That is the escalation ladder that matters: plain code, then workflow framework, then multi-agent framework.

Most teams should climb it slowly, because each step adds runtime and operator burden. The useful filter is not which framework looks most capable in a demo. It is which abstraction matches the failure modes, recovery needs, and operating burden your team is actually about to own.

The primary sources support a simpler rule than the usual ranking list: pick the lightest abstraction that still fits your runtime reality, then climb only when the operational burden becomes concrete.

Step 1: stay in plain code longer than the market wants you to

A surprising number of "agent" applications are still just structured software with a model in the loop.

If you have a small number of tools, a short-lived request, limited state, and no real need for resumability or human interruption, plain code is often enough. You may need typed inputs and outputs, a little provider portability, and good tracing. You do not necessarily need an agent runtime.

This is where the framework conversation often gets distorted. Teams adopt orchestration before they have orchestration problems. Then they inherit abstractions for state, workflows, and multi-agent coordination that the application did not actually need.

That is not sophistication. It is premature operational debt: more runtime to understand, more failure paths to inspect, and more framework behavior to govern before the application has earned it.

Step 2: move to a workflow framework when the hard part becomes state and recovery

Once the system needs to preserve progress, recover after interruption, pause for humans, or manage long-running work, plain code starts to get brittle. This is where a workflow-oriented framework earns its keep.

LangGraph is the clearest example in the current open-source stack. Its own overview describes it as a low-level orchestration framework and runtime for building, managing, and deploying long-running, stateful agents. The official docs explicitly center durable execution, human-in-the-loop, memory, and debugging and visibility.

That is an important signal. LangGraph is not trying to hide the fact that production agent systems become workflow systems. It is designed for teams that already know the hard part is not getting a demo to work once. It is governing a stateful system after the happy path breaks, resumes, or needs intervention.

If your real problem is resumability, recovery, human approvals, or operator intervention, LangGraph is worth the extra complexity. If your real problem is still "how do I call two tools and summarize the result," it probably is not.

CrewAI also belongs in this part of the ladder, but at a different point on it. Its Flows docs position the framework around structured, event-driven workflows that connect tasks, manage state, and control execution flow. The same docs now also include built-in persistence, state recovery, and restart behavior. That makes CrewAI appealing for teams that want workflow packaging faster and with more opinionated ergonomics.

The advantage is speed. You can move from agent idea to operational workflow more quickly. The tradeoff is that ergonomic abstractions can postpone the moment when you confront replay, side effects, and failure handling explicitly. For some teams that is the right trade. For others it becomes the reason the system is harder to reason about later, especially once the workflow matters more than the demo.

So if LangGraph fits teams that want explicit runtime control, CrewAI fits teams that want workflow structure with a more packaged developer experience.

Step 3: only reach for a multi-agent framework when the job really needs multiple agents

This is where the market gets especially noisy.

A lot of builders are being pushed toward multi-agent architecture before they have evidence they need it. But multiple agents only help when the job genuinely benefits from decomposition into separate roles, tools, or control boundaries. Otherwise, you are mostly adding coordination cost.

That is why the question is not "which multi-agent framework is coolest?" It is "what coordination problem do I have that one well-structured workflow cannot solve cleanly?"

CrewAI often gets evaluated here because its crews story is visible and easy to demo. That is fine, as long as teams understand what they are buying: more coordination abstraction, not just more capability.

AutoGen also still shows up in these conversations because it helped define the modern agent-framework category. But the most important fact about AutoGen right now is not architectural history. It is maintainer direction. The official GitHub README says AutoGen is in maintenance mode, community managed going forward, and recommends Microsoft Agent Framework for new projects.

That does not make AutoGen useless. It does make it much harder to recommend for greenfield adoption. A framework can be influential and technically serious and still stop being the right place to invest net-new team time if the maintainers are signaling that the future is elsewhere.

For inherited codebases, AutoGen may still be worth understanding. For new builds, maintenance mode changes the answer.

Where PydanticAI fits

PydanticAI is interesting because it gives teams a disciplined middle path.

Its value is not theatrical orchestration. It is software-shaped agent construction: typed agents, structured output, dependencies, provider portability, MCP support, and an explicit durable execution story. The current durable-execution docs support Temporal, DBOS, Prefect, and Restate, which is a clearer and more defensible posture than pretending every team needs a heavyweight workflow runtime on day one.

That makes it a strong fit for Python teams that want better agent primitives without committing to heavyweight orchestration on day one. In the escalation ladder, PydanticAI fits closest to the bottom and middle: stronger than plain code alone, lighter than a full orchestration worldview, and compatible with moving upward only when the workload proves it deserves that complexity.

It is a useful reminder that not every serious agent stack has to begin with a full workflow runtime. Sometimes the right first move is to make plain code more disciplined before making it more distributed.

That restraint matters. A good framework should not force you to buy tomorrow's operational burden before today's application has earned it.

Quick comparison

Framework	Best fit	Operational upside	Main caution
LangGraph	Teams that already need durable state, resumability, intervention, and recovery	Explicit orchestration and clearer control over the state machine	More runtime and operator surface than lighter use cases need
CrewAI	Teams that want faster workflow packaging with more opinionated ergonomics	Quicker path from agent idea to structured workflow, with built-in state and persistence primitives	Framework opinion can hide replay, side effects, and failure handling until later
PydanticAI	Python teams that want typed agent construction without committing to heavyweight orchestration yet	Disciplined primitives, provider portability, MCP support, and native integrations for durable execution when needed	It does not remove the need to own real state and recovery design once long-running work becomes central
AutoGen	Inherited systems or teams studying the category's architecture history	Useful legacy context and still technically instructive	Hard to recommend for greenfield adoption because the project is in maintenance mode and points new users elsewhere

A good shortcut is to map each tool to the burden it is helping you buy on purpose: PydanticAI buys discipline, LangGraph buys explicit orchestration, CrewAI buys packaged workflow velocity, and AutoGen mostly buys legacy context right now.

That framing matters because it turns a fuzzy framework debate into a cleaner operator question: are you buying type discipline, workflow control, packaging speed, or just historical familiarity?

A practical selection rule

If you are choosing an open-source agent framework in 2026, use this order:

Start with plain code if the application is short-lived, tool-light, and not meaningfully stateful.
Move to a workflow framework when state, retries, resumability, approvals, or recovery become first-class concerns.
Reach for multi-agent coordination only when one agent or one workflow is no longer a clean fit for the job.
Check maintainer direction before you commit, because architecture quality matters less if the project itself is no longer where new investment is going.

That rule produces clearer choices.

If you need explicit durability, state control, recovery, and human intervention, start with LangGraph.

If you need workflow packaging and stateful automation faster, and you are comfortable with more framework opinion, CrewAI deserves a hard look.

If you want typed Python ergonomics and the option to add durability only when the workload justifies it, PydanticAI is one of the most sensible starting points.

If you cannot name the concrete runtime burden you are solving yet, that is usually a sign to stay lower on the ladder.

If you are evaluating AutoGen for a new build, the default answer should be caution, and usually no.

A simple test helps: if the pain is mostly code quality, reach for discipline; if it is mostly state and recovery, reach for orchestration; if it is mostly delivery speed on a structured workflow, reach for packaging; if it is mostly curiosity about agent patterns, do not mistake that for a production requirement.

The real filter is operator burden

The wrong framework choice is rarely wrong because it lacked agent features.

It is wrong because it either hid the operating burden that was coming, or imposed an operating burden the team never needed.

That is the throughline across the current primary sources. LangGraph emphasizes durable state and intervention. CrewAI emphasizes structured workflow execution plus persistence. PydanticAI emphasizes disciplined agent construction plus native durable-execution integrations when the workload earns them. AutoGen's own maintainers are signaling caution for new projects.

Those are different product shapes, but they point to the same default recommendation for most teams: stay lower on the ladder until your runtime burden proves you need more. In practice, that usually means plain code first, PydanticAI when you want stronger typed discipline, LangGraph when recovery and intervention are truly first-class, CrewAI when faster packaged workflow assembly matters enough to accept more framework opinion, and AutoGen mainly for legacy understanding rather than new adoption.

That is why the best evaluation question is not "which framework is best?" It is "which abstraction level matches the failure modes, state model, and operator responsibility this application is actually about to create?"

That question is stronger than a feature checklist because it forces the team to name the burden first and the framework second.

Framework rankings age fast. Runtime burden, recovery needs, and maintainer direction age more slowly. Those are the things that usually decide whether the adoption still looks smart six months later.

So the practical takeaway is straightforward: do not ask which framework won the week on social media. Ask which one matches the state model, recovery needs, and operator burden you will still own after the demo.

The category is maturing. That means the useful question is no longer which framework looks most agentic. It is where on the escalation ladder your application really belongs.

Right now, that is the filter worth your time.