Designing PAIF: A Reference Architecture for Tier 3 Private AI

The views in this post are my own and do not represent my employer. Technical details are generalized and do not disclose customer, proprietary, or confidential information.

Starting with the right problem

In the first post in this series, I wrote about why I am building a Private AI Foundation instead of only calling public AI APIs. This post is the next step: what should the first reference architecture actually look like?

I do not think the first useful version of PAIF is an agent platform. I also do not think it should begin by trying to run every possible AI workload locally. That is too broad, too expensive, and too easy to turn into a science project.

The first practical use case is an internal knowledge system: a way for teams to ask questions over private documents, runbooks, architecture notes, troubleshooting writeups, tickets, source repositories, and operational knowledge.

That is a familiar problem for infrastructure teams. Most organizations already have the knowledge. It is just scattered across SharePoint, internal wikis, GitLab, ServiceNow, design documents, diagrams, and incident notes.

The design questions

I am using this post to answer a specific set of design questions, not to pretend the whole platform is finished.

What is the first useful workload?
What needs to stay private?
What should run locally, and what can use external AI?
Where does Kubernetes fit in a VCF-based design?
Where should the knowledge live?
How should policy decide local versus external routing?
How do users and applications consume the platform?
What visibility do operators need?

Those questions matter more than the first tool choice. They define the operating model PAIF has to support.

The private AI maturity model

One of the mistakes I see in private AI conversations is treating privacy like a yes-or-no property. Either everything runs locally, or the platform is not really private.

I think a more useful way to look at it is by asking what stays private.

Tier	What stays private
Tier 1	Knowledge base
Tier 2	Knowledge processing
Tier 3	Question answering
Tier 4	Full reasoning

I am intentionally using "knowledge base" here because that maps better to infrastructure language. In AI and NLP language, a collection of documents is often called a corpus. That term is accurate, but it is not how most infrastructure teams talk.

Private AI maturity by what stays local

PAIF v1 should target Tier 3. The knowledge base stays private. The processing stays private. Sensitive question answering can stay private. Full frontier reasoning remains hybrid.

That is an important distinction. Private AI does not have to mean every token is generated on-premises. It means the platform can keep sensitive knowledge and sensitive answers inside the environment when the data classification requires it.

The three-plane architecture

The cleanest way I have found to think about PAIF is as three planes: retrieval, model serving, and policy routing.

PAIF reference architecture model

Retrieval: Ingestion, chunking, metadata, embeddings, vector search, citations
Policy: Sensitivity, allowed model targets, audit, routing decisions
Model serving: Local inference, private Q&A, approved external model calls

This is where PAIF becomes more than a RAG demo. Retrieval finds the right internal context. Policy decides where that context is allowed to go. Model serving generates the answer using either a local model or an approved external model.

The point is not to pick local or cloud AI based on ideology. The point is architecture control: deciding where data lives, where inference happens, and when external AI services are appropriate.

Tier 3 private question-answering path

Ask: User submits a question through the internal UI or API.
Classify: Policy checks the user, request, retrieved context, and sensitivity class.
Retrieve: The retrieval plane finds cited internal context from the knowledge base.
Route: Restricted context stays local; approved lower-sensitivity work may use an external model.
Answer: The response returns with citations, model path, and audit trail.

The retrieval plane

The retrieval plane is where the private knowledge system starts. This is the part that ingests content, extracts text, breaks it into useful chunks, attaches metadata, creates embeddings, and stores a searchable representation.

For infrastructure readers, embeddings are easiest to think of as semantic search fingerprints. They let the platform find text that is related in meaning, not only text that matches the same keywords.

But embeddings alone are not enough. Without metadata, vector search becomes a kind of vibes search. It may find something similar, but not necessarily something current, allowed, or applicable.

A useful PAIF retrieval plane needs enough metadata to know whether a chunk of knowledge is relevant:

source system
document type
platform or component
version
environment
last updated date
sensitivity

For example, a note about NSX TEP networking in VCF 9 should not be treated the same as an old lab note from a different platform version. Both might be semantically similar. Only one may be the right source for the question being asked.

The first connector targets for PAIF are the systems where enterprise knowledge already lives: SharePoint, an internal wiki, GitLab, and ServiceNow. I do not want PAIF to become another content repository. Source systems should remain authoritative.

Retrieval plane source flow

Source systems SharePoint, Wiki, GitLab, ServiceNow

Connector layer Sync, extract, normalize

Knowledge index Chunks, metadata, embeddings, vectors

Retrieval API Cited context for the model path

PAIF should store what it needs to operate: extracted text, chunks, metadata, vectors, source references, retrieval logs, and audit records. It should not duplicate every document unless there is a specific reason to do so.

The model-serving plane

The local GPU layer is not only there to replace external AI. It is there to process private knowledge close to the data, answer locally when the task fits, and reduce whatever needs to leave the environment.

In PAIF v1, local model serving should support:

embeddings
chunk summaries
metadata extraction
sensitivity classification
local answers for restricted content
prompt and context reduction
first-pass summaries before a policy-approved external handoff

The last point needs a hard boundary. Sensitive source content should not leave the private environment. If sensitive data is simply summarized and sent to an external model without a clear policy, the private AI story collapses.

External models still matter. They may be better for frontier reasoning, long-form synthesis, complex planning, coding, or low-sensitivity requests where policy allows external use. But they should not be the default destination for sensitive source material.

The policy and routing plane

The routing decision should happen before model selection.

The platform should not start by asking, "Which model is best?" It should start by asking, "Which model targets are allowed?"

Class	Meaning	Routing
Public	Approved to leave	Local or external
Internal	Company or customer internal, not public	Usually local; external only by explicit policy
Restricted	Sensitive operational, customer, or security data	Local only

That simple model is enough for the first reference architecture. It keeps the hybrid design understandable without turning the first implementation into a full governance product.

Why Kubernetes is the runtime

PAIF is not a traditional VM stack with an AI application installed on top. It is a platform made of services: ingestion workers, sync jobs, embedding services, vector search, APIs, model endpoints, user interfaces, policy services, and observability.

That points toward Kubernetes as the primary runtime.

Kubernetes gives PAIF service discovery, job scheduling, horizontal scaling, secrets management, ingress, persistent volumes, rolling updates, and a natural place to run model endpoints and ingestion workers.

In a VCF-focused PAIF design, supportability matters. Deep Learning VMs or AI workstation patterns still have a place for model testing, experimentation, and persistent data science environments. But a shared AI platform pushes the architecture toward Kubernetes-native services and VKS-style workload clusters.

VCF and Kubernetes runtime view

VCF substrate: Compute, GPU access, storage, NSX networking, lifecycle, operational control
VKS / workload cluster: Kubernetes runtime for PAIF application services and model endpoints
PAIF services: Ingestion workers, vector search, APIs, UI, policy service, audit logging
Inference paths: Local GPU inference for sensitive work; controlled egress for approved external AI calls

VMs are not obsolete in private AI. They are still useful for workstations, appliance-style deployments, and simple persistent services. Kubernetes becomes more compelling when private AI becomes a platform consumed through APIs.

Networking, identity, and exposure

PAIF v1 should be internal only.

Users and internal applications should consume it through private network paths. The platform needs controlled access to source systems such as SharePoint, the internal wiki, GitLab, and ServiceNow. It also needs tightly controlled outbound access to approved external AI services when policy allows.

The platform should expose both a web UI and an API. The UI is how people experience the system first. The API is how it becomes infrastructure.

For identity, PAIF should not invent its own user model. It should consume enterprise identity, with Entra ID as the natural default in this design.

Full source-system permission trimming is a future maturity step. For v1, the important thing is that user identity is present in every query, every policy decision, and every audit record.

Observability and audit

AI observability cannot stop at container logs.

PAIF needs to show what happened in the AI path: who asked the question, what sources were retrieved, which model answered, what policy decision was made, whether an external call was allowed or blocked, and what answer came back.

In the first iteration, visibility matters more than elegance. PAIF should log the full prompt, retrieved context, answer, model path, and policy decision so the platform can be debugged and improved.

Those logs become sensitive data too. As the platform matures, they need access control, retention policy, and eventually redaction.

What this reference architecture is not

This first design is intentionally scoped.

It is not full enterprise multi-tenancy. It is not a runtime agent platform. It is not trying to replace frontier models. It is not trying to become the system of record for every document. It is not pretending local inference is automatically cheaper than cloud AI.

Those are important topics, but they are later maturity steps.

The first design target is simpler and more useful:

A VCF-based private AI platform that can retrieve internal knowledge, process it locally, answer sensitive questions locally, and route non-sensitive work to external AI services when policy allows.

That is the foundation I want to build from.