Finding the right tools for AI applications through painful trial and error

Building RAG (Retrieval-Augmented Generation) applications in 2024 feels like navigating a minefield of framework choices. After months of experimentation and some production incidents I'd rather forget, we've finally landed on a stack that works. Here's our journey through the LLM framework landscape and why we ultimately abandoned Django for FastAPI.

The LlamaIndex Maze

We started with LlamaIndex. It seemed like the obvious choice—tons of integrations, active community, comprehensive documentation. Our first prototype was promising: load documents, create an index, query it. Simple.

Then we tried to customize it for our use case.

We needed custom document parsing, specific chunking strategies, and integration with our existing PostgreSQL database. Suddenly we were drowning in abstractions. Every simple requirement required diving through layers of inheritance. Want to change how documents are chunked? That's buried three levels deep. Need custom embedding logic? Hope you enjoy reading framework source code.

The breaking point was when we spent a week trying to debug why certain documents weren't being indexed properly, only to discover it was due to an undocumented interaction between two internal components.

LangChain's Dependency Hell

Next, we tried LangChain. The promise of "chains" and "agents" seemed perfect for our workflow orchestration needs. The initial setup was cleaner than LlamaIndex, and the abstractions made more sense.

But then came deployment day. Our Docker builds started failing randomly. The culprit? LangChain's dependency tree. Thirty-plus dependencies, each with their own version requirements. Version conflicts everywhere. Update one package, break three others. We had a critical security patch for a core library that we couldn't apply because LangChain pinned an older version.

The final straw was when a minor LangChain update completely changed the API for document loaders, breaking our production pipelines. No deprecation warnings, just broken code.

Enter Agno

After these experiences, we wanted something minimal and predictable. Agno fit the bill. No magic. No deep inheritance hierarchies. Just functions that do what they say.

The trade-off? Agno doesn't have batch embeddings out of the box. For our scale, this was a problem. But unlike the other frameworks, extending Agno was straightforward. We wrote our own batch embedding logic in about 50 lines. Not elegant, but it works and we understand every line of it.

Why Django Had to Go

Here's the uncomfortable truth: Django and Django REST Framework are fantastic for traditional web applications. For LLM applications? They're actively harmful.

The Sync/Async Nightmare

LLM operations are inherently async. API calls to OpenAI, embedding generation, vector searches—all benefit from concurrent execution. Django's historical baggage around async is painful. You end up with awkward adapter functions everywhere, blocking threads, and performance bottlenecks.

The Streaming Problem

Modern LLM applications need streaming responses. Users expect to see tokens as they're generated, not wait 30 seconds for a complete response. Django's response model wasn't built for this. You can hack it in, but it feels like forcing a square peg into a round hole.

The Weight of the Framework

For LLM apps, we needed:

  • Fast async request handling
  • WebSocket support for real-time interactions
  • Efficient streaming responses
  • Minimal overhead for high-throughput embedding operations

What we didn't need:

  • Django's ORM (we use vector databases)
  • Django's admin (useless for LLM data)
  • Django's forms and templates
  • Session handling, CSRF, middleware stack

We were using 10% of Django and fighting the other 90%.

FastAPI: Built for the AI Era

Moving to FastAPI was like taking off ankle weights. Everything that was painful in Django just worked. Async operations are first-class citizens. Streaming responses are built-in. WebSockets just work. No fighting the framework. No sync/async adapters. No ceremonial boilerplate.

The performance difference was immediate. Request handling became faster. Memory usage dropped. Our embedding pipeline could finally run concurrently without hacks.

The Current Stack

Our LLM stack now:

  • FastAPI for all API endpoints
  • Agno for LLM operations (with our batch embedding patch)
  • PostgreSQL + pgvector for vector storage
  • Redis for caching embeddings
  • Modal for GPU-intensive operations

It's not perfect, but it's predictable. When something breaks, we can fix it. When we need to extend functionality, we can do it without archaeology.

Lessons Learned

  1. Framework complexity compounds - In LLM apps, you're already dealing with prompt engineering, context windows, and model quirks. Don't add framework complexity on top
  2. Async-first is non-negotiable - LLM operations are I/O bound. If your framework fights async, find a new framework
  3. Minimal > Magical - We'd rather write 100 lines of clear code than configure 1000 lines of framework magic
  4. Django's era is ending for certain workloads - It's still great for traditional web apps, but the future is async

The LLM tooling landscape is evolving rapidly. What works today might be obsolete tomorrow. The key is choosing tools that you can understand and modify when (not if) you need to.