By Tom Watson in ai — 18 Nov 2025

Which LLM? And how do YOU decide?

Thinking out loud on assessing which LLM models to use. Starting to build a framework based on real life applications

Which LLM? Some out loud thinking

At the risk of getting the AI tech bro comment overload 🤫

Lately I’ve been thinking about how we choose which LLM models to use not just which vendor, but which models, where, and why.

Most of the conversation I see focuses on benchmarks and “this model scored X on Y leaderboard”. Useful, but pretty shallow if you’re actually trying to build something that has to work in the real world, with real constraints.

To help myself, and maybe others I’m currently sketching out a simple framework to help me think this through, and I’d love input / critique.

Here's some of the dimensions I’m playing with:

Accuracy & task fit - Less “who’s best on a benchmark?” and more “does this model behave reliably on our jobs?” - e.g. summarising reports, extracting structured data, reasoning about policy, writing code, supporting casework, etc. Also: does it stay grounded when we use retrieval, or does it still hallucinate?

Speed & experience - Not just raw tokens-per-second, but first-token latency, streaming, and whether a “slow but deep thinking” mode is actually worth it in the user interface. For some tasks, fast + “good enough” is the right call; for others, you want the slower, careful model.

Context & retrieval - How well does it actually use long contexts, and how nicely does it play with a RAG setup? I care less about theoretical 1m token windows and more about “can it handle a handful of well-chosen chunks reliably?”

Model weight & deployment - Let's just use the biggest newest model right!? Big models are impressive, but they’re also expensive and energy-hungry. Small models are more viable for self-hosting and edge use. I’m interested in “tiered” approaches: small model by default, larger model only when the question really demands it.

Openness & licensing- Fully open, semi-open, or closed API? What does that mean for lock-in, governance and the ability to move if pricing or policies change?

Environmental and cost trade-offs- We almost never talk about these separately. A lot of the time, the same choices that reduce cost also reduce energy use (smaller models, caching, better retrieval, not over-generating text).

Access & data governance - Self-host or public API? Regional hosting? How does that intersect with internal policies, principles, funder requirements, or things like GDPR?

Output structuring & tools - This is becoming central for me. Can the model reliably produce strict JSON that matches a schema? Can it call tools cleanly? Can I trust it to stick to a format that other systems depend on?

Training & adaptation - How easy is it to adapt the model to our language, our tone, our taxonomies? That might be full fine-tuning, or just clever use of retrieval and system prompts - but it affects what models are actually viable.

In practice

I’ve been using these categories while building Open Recommendations which extracts insights and recommendations from reports, stores them in strict JSON schemas, and make them searchable/chat-able. There, I’m already had to explore questions like:

Do we use one model for extraction and tagging, and another for interactive chat and sense-making?
How strict can we be about JSON schemas without making the system fragile?
When is it worth paying (in money and energy) for a larger model vs chaining together smaller ones?
What's the trade-off between speed (for the user) and accuracy?

So, I’m curious:

If you’ve chosen models for real systems, what else do you consider beyond benchmarks?
Have you found any underrated dimensions (e.g. observability, ecosystem, support, safety profile)?
Any good examples or horror stories of picking a model for the “wrong” reason?

Would love to hear how others are thinking about this - especially from people working in social impact, public sector, research and other contexts where accuracy, accountability and constraints all matter.

In practice

Subscribe to Tomcw.xyz