RAG Architecture for Salesforce Data: Connecting Your Org to LLMs

You want an AI agent that actually knows your Salesforce data. Not in a "trained on the internet" way. In a "here's the latest activity on that account and what the last meeting notes said" way.

That's RAG. Retrieval-Augmented Generation. Instead of fine-tuning a model on your data (expensive, slow, stale the moment it's done), you retrieve relevant records at query time and inject them into the prompt. The LLM responds grounded in your actual data.

I've built a few of these for clients now. Here's the architecture, the decisions you'll make, and the stuff that bit me.

Four Stages

The pipeline: extract data from Salesforce, chunk and embed it into vectors, store the vectors, then retrieve relevant ones at query time and pass them to the LLM alongside the user's question.

Each stage has trade-offs. Let me go through them.

What to Extract (and What to Skip)

Not everything in your org belongs in a RAG pipeline. You want records that carry the kind of context a person would reference when answering a question.

Good candidates: Account and Contact records, Opportunity data (stage, amount, next steps), activity history (Task subjects, descriptions, meeting notes), Case records, Knowledge articles, and any custom objects that hold narrative data like project descriptions or client notes. File attachments and ContentDocument bodies too, if you need document search.

Skip high-volume transactional data (individual line items, log records) unless you're aggregating first. Skip records that are mostly picklists with no description fields. Skip binary or encoded fields.

For extraction: the REST API works for structured record data. Bulk API for your initial full load if you're dealing with millions of records. For ongoing sync, Change Data Capture or Platform Events let you stream changes in near-real-time so your embeddings stay current.

If you're running Data Cloud, you can use it to unify Salesforce and external data before feeding the pipeline. Useful when you need CRM data alongside data warehouse or third-party sources.

Chunking: This Is Where Retrieval Quality Lives or Dies

Raw records need to become text chunks before you can generate embeddings. The chunking strategy directly affects how good your results are, so don't just wing this part.

Record-level chunking is the simplest. Each Salesforce record becomes one chunk. Serialize the relevant fields into readable text:

Account: Acme Corp
Industry: Manufacturing
Annual Revenue: $50M
Description: Midwest manufacturing company specializing in precision metal components.
Last Activity: 2026-03-28 - Quarterly business review with VP of Operations.
Open Opportunities: 2 ($125K total pipeline)

Works well when the full context fits within your embedding model's token window (512 to 8,192 tokens depending on the model).

Field-level chunking breaks individual fields into separate chunks. Makes sense for long-text fields (Case descriptions, Knowledge article bodies) that exceed the token window. Each chunk gets metadata tags, record ID, object type, field name, so you can trace back to the source.

Parent-child chunking groups related records. Instead of embedding an Account and its Contacts separately, you build a composite chunk: Account details plus a summary of related Contacts, recent Opportunities, last activities. Richer context per hit, but more extraction logic to build.

For most Salesforce implementations, I start with record-level chunking and parent-child enrichment on key objects like Accounts and Opportunities. Good balance of quality and complexity.

Embeddings and Storage

Text chunks become vectors via an embedding model. I've used OpenAI's text-embedding-3-small and text-embedding-3-large most. They're solid general-purpose embeddings and the API is straightforward. Cohere embed-v3 is worth a look if you need multilingual support. Open-source models via Hugging Face can cut costs at scale but you're running your own infra.

For storage, the options I've actually used:

Pinecone is managed, fast to set up, and good for getting started. I use this when the client wants minimal ops overhead.

pgvector (PostgreSQL extension) is my pick when the client already has Postgres infrastructure. Keeps the stack simple. One less service to manage.

Weaviate supports hybrid search (vector + keyword) out of the box, which is a real advantage I'll get to in a second.

Whatever you choose, store metadata alongside each vector: the Salesforce record ID, object type, last modified date, and any fields you want to filter on at retrieval time. This metadata is critical. "Only return vectors from Account records" or "only return records modified in the last 90 days" requires it.

Retrieval: Making It Actually Work

User asks a question. Their query gets embedded with the same model. Vector DB returns the top-k most similar chunks (I usually start with 5). Those chunks get injected into the LLM prompt. LLM responds.

The prompt pattern I use:

You are an AI assistant with access to CRM data. Use the following context
to answer the user's question. If the context doesn't contain enough
information to answer, say so.
 
Context:
{retrieved_chunks}
 
User question: {query}

Three things that made a real difference in retrieval quality for me:

Similarity thresholds. Not every result from vector search is useful. I set a cosine similarity floor (usually 0.75) and drop anything below it. Stuffing irrelevant context into the prompt actively hurts response quality. I learned this one the hard way after an agent confidently answered a question about "Acme Corp" using context from "Acme Medical" because the names were close enough in vector space.

Metadata filtering. If the user asks about a specific account, filter vectors to that Account ID first, then run similarity search within that subset. Faster, more accurate, fewer irrelevant results.

Hybrid search. Combine vector similarity with keyword search. Vectors are great for semantic queries ("what's the status of the Acme deal?") but can miss exact matches on specific terms or IDs. Keyword search catches those. Weaviate does this natively. For other vector DBs, implement it at the application layer.

Keeping It Fresh

Stale embeddings are the silent killer. If your vectors reflect last month's data and someone asks about a deal that closed yesterday, the agent gives a wrong answer with full confidence. No hedging, no caveat. Just wrong.

Batch sync (daily or hourly): query records modified since last sync, re-chunk, re-embed, upsert. Simple. Fine when near-real-time isn't critical.

Event-driven sync: Change Data Capture streams record changes to a middleware layer (I usually use AWS Lambda) that re-embeds and upserts affected vectors in near-real-time. More moving parts, but the data is never more than a few minutes stale.

Hybrid (what I usually recommend): batch for the full corpus, event-driven for high-priority objects like Accounts, Opportunities, and Cases.

The Security Problem

Here's the thing Salesforce architects need to internalize: your RAG pipeline runs outside the Salesforce security model. The vector database has no concept of profiles, permission sets, or sharing rules. If your agent serves users with different data access levels, you need to build authorization into the retrieval layer.

User-scoped queries. Before hitting the vector store, check what records the current user can access in Salesforce (SOQL with user context), then filter results to those record IDs. Most accurate. Adds a Salesforce API call to every retrieval.

Pre-filtered indexes. Separate vector indexes per role or profile. Each index only contains records that role can see. Simpler retrieval, but more storage and sync complexity.

Post-retrieval filtering. Retrieve first, then validate each result against Salesforce sharing rules before injecting into the prompt. Adds latency, but works when you can't predict query scope in advance.

None of these are as clean as Agentforce's native security model. That's a real trade-off. But for cross-system use cases where Agentforce isn't the right tool, this is the architecture.

Where to Start

Pick one object with rich text data. Accounts with descriptions and activity history is a good first target. Extract a few hundred records. Chunk at the record level. Generate embeddings with OpenAI and store them in Pinecone's free tier. Build a simple retrieval script that takes a question, fetches chunks, and passes them to Claude or GPT.

Test with questions your team would actually ask. You'll learn more from 500 records and real queries than from planning the perfect architecture on a whiteboard.

If you want help designing a RAG pipeline for your Salesforce data, let's talk.