AI agents & chatbots

Custom AI that actually answers — not vague chat that escalates everything.

Most chatbots fail because they're stitched together from a script and a vague prompt. We build agents grounded in your real documents, evaluated against your real questions, with the guardrails that stop them embarrassing you in front of a customer.

The problem

The chatbot you tried last year was a glorified contact form.

Most off-the-shelf chatbots do one of two things badly. They either pattern-match on keywords and dump the user into a phone-tree of pre-written replies, or they're a thin wrapper around a generic LLM that confidently invents shipping policies you don't have.

The shape of a useful AI agent is different. It's grounded in your actual content — help centre, policy docs, past tickets, product spec, internal wiki. It cites where its answers come from. It refuses when it doesn't know. It escalates to a human, with the full transcript, the moment a query crosses a threshold you've defined. And it improves measurably over time because someone is running an eval suite against it every week.

Building that takes more than wiring an API key into Intercom. It takes retrieval engineering, prompt engineering, eval discipline, and a conversation about what the agent is — and isn't — allowed to do.

Approach

How we build agents that ship

The standard 4-week proof-of-value: define quality, ground it in your data, wire guardrails, deploy with measurement. Extend or stop based on the numbers — not on a demo that wowed someone in a meeting.

  1. 1

    Define what "good" looks like

    Before we touch a model, we agree on the eval set: 30–100 real questions the agent must answer well, and the criteria for what "well" means. Without this, you can't tell if an upgrade actually improves anything.

  2. 2

    Ground it in your real data

    Most agent failures are retrieval failures, not model failures. We build a RAG pipeline from your actual sources — Notion, Confluence, Help Scout, Intercom, Zendesk, PDFs, transcripts — with chunking, re-ranking, and citation that lets users verify what the agent says.

  3. 3

    Wire in guardrails and escalation

    Topic filters so it stays in lane. Refusal patterns for things it shouldn't answer. PII redaction in logs. Hand-off rules to a human (with full transcript) when confidence is low or the user asks. Pricing and contractual questions get escalated by default.

  4. 4

    Ship, evaluate, iterate

    We deploy behind a feature flag, A/B against your existing flow if there is one, and run the eval set weekly. You see exactly which questions the agent improved on, which it regressed on, and where the next investment goes.

Stack

Stack

We pick the model and the surrounding tooling per use case. Streaming web chat needs different infrastructure to overnight document processing. Voice agents need different infrastructure to both.

  • OpenAI
    GPT-4 / GPT-5 / o-series
  • Anthropic
    Claude 4.x family
  • LangChain
    orchestration
  • LlamaIndex
    RAG pipelines
  • Pinecone
    vector store
  • pgvector
    Postgres-native
  • Vercel AI SDK
    streaming UIs
  • Voiceflow
    no-code agents
  • Twilio / Vapi
    voice agents
  • Helicone / Langfuse
    observability
  • Ragas / Promptfoo
    evals
  • Custom code
    Python / TypeScript
Real-world

What this looks like in practice

Figures are typical ranges from comparable engagements

  • 67%
    Tier-1 ticket deflection

    Customer support agent for a UK fintech

    A 90-person fintech was answering 1,400 support tickets a week, two-thirds of which were variations on the same 30 questions about KYC, statements, and limits.

    We built a Claude-powered support agent grounded in the help centre and policy docs, with hard escalation rules for anything involving money movement. Two months in, it resolves 67% of incoming tickets with no human in the loop and a 4.6/5 satisfaction score.

  • 12 → 1
    Hours per RFP response

    Internal knowledge agent for a US consultancy

    Senior consultants were spending 12 hours per week digging through past projects, case studies, and proposals to assemble RFP responses.

    An internal Claude agent indexed 8 years of past proposals, case studies, and SOWs. It now drafts the first version of every RFP response in under a minute, with citations to the source documents. Consultants spend 1–2 hours editing instead of 12 hours building from scratch.

  • 24/7
    Voice agent coverage

    Inbound voice agent for a UK trades business

    An emergency-plumbing business was missing 30% of after-hours calls — and losing the jobs to competitors who answered.

    A voice agent (Vapi + GPT-4) now handles after-hours calls: triages urgency, captures address and issue, books a slot in the scheduling system, and pages the on-call engineer if it's a true emergency. After-hours conversion rate went from 22% to 71%.

Frequently asked questions

Will it hallucinate?
Not if it's built right. The standard pattern that hallucinates is "give the model a question and hope." Our pattern is retrieval-augmented: the model can only answer using passages we've fetched from your real documents, and every answer cites the source. It will refuse questions it doesn't have grounded data for — which is the correct behaviour, even if it feels unfamiliar.
OpenAI or Claude — which should we use?
Both are excellent and the gap is smaller than vendor marketing suggests. Claude tends to do better on long-document reasoning, careful instruction-following, and cases where you want it to refuse or hand off. OpenAI often wins on agentic tool use, multimodal (vision), and the o-series for complex reasoning. We frequently ship with both behind a single API and route per query type. We have no reseller relationship with either.
What happens to our private data?
By default, both OpenAI's API and Anthropic's API contractually do not train on your data. Your documents stay in your vector store (we usually self-host pgvector inside your infrastructure for sensitive cases). For regulated industries we can run open models — Llama, Mistral — on your own infrastructure with no third-party API calls at all.
Can we run it on-prem or in our own cloud?
Yes. Frontier models (GPT, Claude) call out to API endpoints — but everything around them (the vector store, the orchestration, the logging, the UI) runs in your infrastructure. For air-gapped deployments we use open-weight models (Llama 3.x, Mistral, Qwen) deployed on your own GPU instances.
How do you handle GDPR and SOC 2 requirements?
Standard practice on every engagement: PII redaction before any data goes to a model, regional API endpoints (EU for UK/EU clients), a Data Processing Agreement on your terms, and full audit logging of every query and response. We can produce the documentation auditors need — we've been through SOC 2 reviews on the customer side.
What's the smallest agent project you'll take?
A 2–3 week proof-of-value: one well-scoped use case, one source corpus, one channel (web widget, Slack, or internal tool). Usually £8k–£15k / $10k–$18k. If it works we extend; if it doesn't we tell you why and stop.

See what an agent grounded in your data could do.

Send us 10–20 questions a customer or colleague typically asks, and we'll build a working prototype against your real documents in a week — flat fee, refunded if you don't see a path to value.