2026-04-036 min readrag.art team

How to train an AI chatbot on your website in 2 minutes

A honest, step-by-step walkthrough of ingesting a public website into a RAG chatbot — what works instantly, what needs a second pass, and when to stop.

onboardingingestionguide

'Train an AI chatbot on your website' is a phrase most platforms will happily agree with. The polite version of the answer is that nothing is actually being trained — your website becomes a corpus, the corpus becomes embeddings, and the embeddings sit in a vector index the chatbot queries at response time. No model fine-tuning, no weeks of GPU work. Just a well-run ingest.

Done right, the whole loop takes about two minutes for a small site. Here's what's actually happening under the hood and where the hidden costs appear.

What happens when you paste a URL

A crawler fetches the root URL and follows internal links up to a configured depth (typically 2 or 3).
Each page's text content is extracted — stripping nav, footer, scripts.
The content is split into chunks of ~500 tokens, with overlap so mid-sentence splits don't drop context.
Each chunk is embedded by an embedding model (OpenAI's text-embedding-3-small, Cohere, or a local model).
The vectors land in a vector database (Postgres + pgvector, Pinecone, or similar) with metadata: URL, page title, chunk index.
Your bot is wired up to query this index, take the top-k results, and pass them as context to a language model.

Why it looks instant (and when it isn't)

For a 30-page marketing site, the ingest runs in seconds. For a 5,000-page documentation site, it's minutes — and if you want it done right, you'll run it twice: once for the first pass, and again after you look at what got chunked badly. That second pass is the one most vendors don't show you.

The four things that ruin ingest quality

1. JavaScript-only content

If your pricing is hidden behind a React component that fetches from an API, a basic crawler sees an empty div. You need a headless-browser-powered crawler (we use Firecrawl) that renders the page before extracting.

2. Tables and structured data

Chunking naively through a pricing table turns '$49 / mo — 5,000 messages' into three disconnected fragments. Good chunkers detect tables and keep rows together. Bad ones don't, and you'll discover this when the bot confidently tells a customer the wrong price.

3. Boilerplate pollution

Site footers, cookie banners, 'follow us on LinkedIn' — if these end up in chunks, the bot learns to quote them. You want extraction rules that drop nav, aside, footer, and anything under 20 words that repeats across pages.

4. Stale re-crawls

If your site updates weekly and your corpus doesn't, the bot will confidently quote last quarter's pricing. The fix is a scheduled re-ingest (daily for most sites, hourly for inventory-sensitive content).

A practical two-minute workflow

Pick your highest-traffic 20 pages. Not the sitemap. The ones that matter.
Paste your root URL, crawl depth 2, in your platform of choice.
Ask the bot three real questions your team answers every week.
If an answer is wrong, check the citation. Usually the problem is in the chunk, not the prompt.
Schedule a daily re-crawl. Move on.

That's the honest two-minute version. Any platform that tells you this takes days is bolting RAG onto a pipeline that wasn't designed for it. Any platform that claims zero work is overselling.