The views in this post are my own and do not represent my employer. Technical details are generalized and do not disclose customer, proprietary, or confidential information.
Starting with the right problem
In the first post in this series, I wrote about why I am building a Private AI Foundation instead of only calling public AI APIs. This post is the next step: what should the first reference architecture actually look like?
I do not think the first useful version of PAIF is an agent platform. I also do not think it should begin by trying to run every possible AI workload locally. That is too broad, too expensive, and too easy to turn into a science project.
The first practical use case is an internal knowledge system: a way for teams to ask questions over private documents, runbooks, architecture notes, troubleshooting writeups, tickets, source repositories, and operational knowledge.
That is a familiar problem for infrastructure teams. Most organizations already have the knowledge. It is just scattered across SharePoint, internal wikis, GitLab, ServiceNow, design documents, diagrams, and incident notes.
The design questions
I am using this post to answer a specific set of design questions, not to pretend the whole platform is finished.
- What is the first useful workload?
- What needs to stay private?
- What should run locally, and what can use external AI?
- Where does Kubernetes fit in a VCF-based design?
- Where should the knowledge live?
- How should policy decide local versus external routing?
- How do users and applications consume the platform?
- What visibility do operators need?
Those questions matter more than the first tool choice. They define the operating model PAIF has to support.
The private AI maturity model
One of the mistakes I see in private AI conversations is treating privacy like a yes-or-no property. Either everything runs locally, or the platform is not really private.
I think a more useful way to look at it is by asking what stays private.
| Tier | What stays private |
|---|---|
| Tier 1 | Knowledge base |
| Tier 2 | Knowledge processing |
| Tier 3 | Question answering |
| Tier 4 | Full reasoning |
I am intentionally using "knowledge base" here because that maps better to infrastructure language. In AI and NLP language, a collection of documents is often called a corpus. That term is accurate, but it is not how most infrastructure teams talk.
PAIF v1 should target Tier 3. The knowledge base stays private. The processing stays private. Sensitive question answering can stay private. Full frontier reasoning remains hybrid.
That is an important distinction. Private AI does not have to mean every token is generated on-premises. It means the platform can keep sensitive knowledge and sensitive answers inside the environment when the data classification requires it.
The three-plane architecture
The cleanest way I have found to think about PAIF is as three planes: retrieval, model serving, and policy routing.
- Retrieval
- Ingestion, chunking, metadata, embeddings, vector search, citations
- Policy
- Sensitivity, allowed model targets, audit, routing decisions
- Model serving
- Local inference, private Q&A, approved external model calls
This is where PAIF becomes more than a RAG demo. Retrieval finds the right internal context. Policy decides where that context is allowed to go. Model serving generates the answer using either a local model or an approved external model.
The point is not to pick local or cloud AI based on ideology. The point is architecture control: deciding where data lives, where inference happens, and when external AI services are appropriate.
- Ask
- User submits a question through the internal UI or API.
- Classify
- Policy checks the user, request, retrieved context, and sensitivity class.
- Retrieve
- The retrieval plane finds cited internal context from the knowledge base.
- Route
- Restricted context stays local; approved lower-sensitivity work may use an external model.
- Answer
- The response returns with citations, model path, and audit trail.
The retrieval plane
The retrieval plane is where the private knowledge system starts. This is the part that ingests content, extracts text, breaks it into useful chunks, attaches metadata, creates embeddings, and stores a searchable representation.
For infrastructure readers, embeddings are easiest to think of as semantic search fingerprints. They let the platform find text that is related in meaning, not only text that matches the same keywords.
But embeddings alone are not enough. Without metadata, vector search becomes a kind of vibes search. It may find something similar, but not necessarily something current, allowed, or applicable.
A useful PAIF retrieval plane needs enough metadata to know whether a chunk of knowledge is relevant:
- source system
- document type
- platform or component
- version
- environment
- last updated date
- sensitivity
For example, a note about NSX TEP networking in VCF 9 should not be treated the same as an old lab note from a different platform version. Both might be semantically similar. Only one may be the right source for the question being asked.
The first connector targets for PAIF are the systems where enterprise knowledge already lives: SharePoint, an internal wiki, GitLab, and ServiceNow. I do not want PAIF to become another content repository. Source systems should remain authoritative.
PAIF should store what it needs to operate: extracted text, chunks, metadata, vectors, source references, retrieval logs, and audit records. It should not duplicate every document unless there is a specific reason to do so.
The model-serving plane
The local GPU layer is not only there to replace external AI. It is there to process private knowledge close to the data, answer locally when the task fits, and reduce whatever needs to leave the environment.
In PAIF v1, local model serving should support:
- embeddings
- chunk summaries
- metadata extraction
- sensitivity classification
- local answers for restricted content
- prompt and context reduction
- first-pass summaries before a policy-approved external handoff
The last point needs a hard boundary. Sensitive source content should not leave the private environment. If sensitive data is simply summarized and sent to an external model without a clear policy, the private AI story collapses.
External models still matter. They may be better for frontier reasoning, long-form synthesis, complex planning, coding, or low-sensitivity requests where policy allows external use. But they should not be the default destination for sensitive source material.
The policy and routing plane
The routing decision should happen before model selection.
The platform should not start by asking, "Which model is best?" It should start by asking, "Which model targets are allowed?"
| Class | Meaning | Routing |
|---|---|---|
| Public | Approved to leave | Local or external |
| Internal | Company or customer internal, not public | Usually local; external only by explicit policy |
| Restricted | Sensitive operational, customer, or security data | Local only |
That simple model is enough for the first reference architecture. It keeps the hybrid design understandable without turning the first implementation into a full governance product.
Why Kubernetes is the runtime
PAIF is not a traditional VM stack with an AI application installed on top. It is a platform made of services: ingestion workers, sync jobs, embedding services, vector search, APIs, model endpoints, user interfaces, policy services, and observability.
That points toward Kubernetes as the primary runtime.
Kubernetes gives PAIF service discovery, job scheduling, horizontal scaling, secrets management, ingress, persistent volumes, rolling updates, and a natural place to run model endpoints and ingestion workers.
In a VCF-focused PAIF design, supportability matters. Deep Learning VMs or AI workstation patterns still have a place for model testing, experimentation, and persistent data science environments. But a shared AI platform pushes the architecture toward Kubernetes-native services and VKS-style workload clusters.
- VCF substrate
- Compute, GPU access, storage, NSX networking, lifecycle, operational control
- VKS / workload cluster
- Kubernetes runtime for PAIF application services and model endpoints
- PAIF services
- Ingestion workers, vector search, APIs, UI, policy service, audit logging
- Inference paths
- Local GPU inference for sensitive work; controlled egress for approved external AI calls
VMs are not obsolete in private AI. They are still useful for workstations, appliance-style deployments, and simple persistent services. Kubernetes becomes more compelling when private AI becomes a platform consumed through APIs.
Networking, identity, and exposure
PAIF v1 should be internal only.
Users and internal applications should consume it through private network paths. The platform needs controlled access to source systems such as SharePoint, the internal wiki, GitLab, and ServiceNow. It also needs tightly controlled outbound access to approved external AI services when policy allows.
The platform should expose both a web UI and an API. The UI is how people experience the system first. The API is how it becomes infrastructure.
For identity, PAIF should not invent its own user model. It should consume enterprise identity, with Entra ID as the natural default in this design.
Full source-system permission trimming is a future maturity step. For v1, the important thing is that user identity is present in every query, every policy decision, and every audit record.
Observability and audit
AI observability cannot stop at container logs.
PAIF needs to show what happened in the AI path: who asked the question, what sources were retrieved, which model answered, what policy decision was made, whether an external call was allowed or blocked, and what answer came back.
In the first iteration, visibility matters more than elegance. PAIF should log the full prompt, retrieved context, answer, model path, and policy decision so the platform can be debugged and improved.
Those logs become sensitive data too. As the platform matures, they need access control, retention policy, and eventually redaction.
What this reference architecture is not
This first design is intentionally scoped.
It is not full enterprise multi-tenancy. It is not a runtime agent platform. It is not trying to replace frontier models. It is not trying to become the system of record for every document. It is not pretending local inference is automatically cheaper than cloud AI.
Those are important topics, but they are later maturity steps.
The first design target is simpler and more useful:
A VCF-based private AI platform that can retrieve internal knowledge, process it locally, answer sensitive questions locally, and route non-sensitive work to external AI services when policy allows.
That is the foundation I want to build from.