Private LLM Inference

OpenGradient provides private LLM inference infrastructure that combines Oblivious HTTP (OHTTP) with hardware-attested Trusted Execution Environments (TEEs). Prompts and completions are end-to-end encrypted to an attested enclave, and the network path is split using a network relay, so that the entire request is anonymized.

This is offered as a piece of infrastructure on top of the standard Verifiable LLM Execution stack - the same TEE registry, the same on-chain attestation, and same payments on Base - with an added private infernece layer that decouples client identity from request content. It is exposed through the Python SDK and TypeScript SDK so applications can opt in to private inference without managing HPKE, attestation parsing, or relay routing themselves.

TIP

If you only need verifiable inference (provable prompt usage, signed responses) without identity unlinkability and additional privacy guarantees, see Verifiable LLM Execution. Private inference layers an additional privacy guarantee on top of that.

Key Features

End-to-end encryption to the enclave - Prompts and completions are sealed under HPKE (RFC 9180) on the client side (in the user's browser or device), using a public key that is bound to an attested enclave build. Only the secure enclave can decrypt in order to forward it to the model provider.
Identity / content unlinkability - A two-hop architecture (relay + gateway) splits the request so the relay sees the client's IP but not the plaintext, and the enclave sees the plaintext but not the client's IP.
Attested key distribution - The HPKE public key is published alongside an AWS Nitro attestation document. Clients verify the key was generated inside the approved enclave before encrypting anything.
Streaming support - Token-by-token responses use Chunked OHTTP, so each SSE event is individually sealed and streamed back through the relay without revealing content.
Signed responses - Every response is signed inside the enclave (RSA-PSS-SHA256 over keccak256(requestHash || outputHash || timestamp)) with a key bound to the same attestation, so clients can prove an output came from the attested enclave.
Same model coverage - All models supported by the verifiable LLM stack (OpenAI, Anthropic, Google, xAI, etc.) are reachable through the private endpoint.

Trust Model

Private inference splits trust between two independent network entities and an enclave:

Party	Sees	Does Not See
Client	Everything (plaintext, response signature, attestation)	-
Relay	Client IP, OHTTP ciphertext, cost, client payment credits	Prompt, completion, model name, token counts
TEE Gateway (enclave)	Request and response, relay IP	Client IP, client identity
Upstream model provider	Request from the enclave's egress	Client IP, client identity, OpenGradient routing

The privacy guarantee is non-collusion between relay and gateway: as long as those two operators do not share data, no party can link a given client to a given prompt. The enclave attestation ensures the gateway operator cannot read plaintext outside the approved code path even if they wanted to - the HPKE private key never leaves the enclave's memory, and the verifiable and open-source server code ensures that nothing is logged or recorded.

Architecture Overview

The diagram shows the two-hop split: the relay sees who you are (IP) but not what you're saying (OHTTP ciphertext); the enclave sees what you're saying but not who you are. The on-chain TEE Registry anchors trust at both ends — the client verifies the enclave's attestation before encrypting, and verifies the response signature against the registered key after decrypting.

How It Works

1. Enclave Startup & Key Generation

When a gateway enclave boots, it generates two keypairs inside the TEE:

An RSA-2048 signing keypair, used to sign inference responses.
An X25519 HPKE keypair, used as the OHTTP key configuration for encrypting client requests.

Both public keys are bound to a single AWS Nitro attestation document via the nitriding daemon's transcript. Specifically, the attestation's user_data field commits to a transcript of the form:

og-tee-keys|v2|rsa-spki=<DER>|hpke-x25519=<32 bytes>

This means a client that verifies the attestation gets both keys at once - they cannot be substituted independently.

The enclave is registered on the on-chain TEE Registry on the OpenGradient network, which checks the attestation against the AWS Nitro root CA and confirms the enclave's PCR measurements match an approved build.

2. Key Configuration Distribution

Clients fetch the OHTTP key configuration from the gateway before sending their first request:

http

GET /v1/ohttp/config HTTP/1.1
Host: <gateway>

The response describes a fixed HPKE ciphersuite plus the enclave's current X25519 public key:

json

{
  "key_id": 1,
  "kem_id": 32,
  "kdf_id": 1,
  "aead_id": 3,
  "public_key": "<hex X25519 public key>",
  "key_config": "<base64 RFC 9458 key_config>"
}

A separate attestation endpoint returns the Nitro attestation document so the client can confirm the published public key was actually generated inside the approved enclave:

http

GET /enclave/attestation?nonce=<client-nonce> HTTP/1.1

The client must:

Verify the attestation signature chains to the AWS Nitro root certificate.
Confirm the PCR values match the on-chain approved hashes in the TEE Registry.
Recompute the transcript and confirm it matches the attestation's user_data.
Only then trust the X25519 public key as the OHTTP encapsulation key.

The Python and TypeScript SDK will perform these steps automatically.

3. Encapsulating a Request

The client constructs the inner LLM request - model, messages, parameters - exactly as it would for an OpenAI-compatible API:

json

{
  "model": "openai/gpt-5",
  "messages": [{"role": "user", "content": "Summarize this contract..."}],
  "temperature": 0.2,
  "stream": false
}

It then encapsulates this as an OHTTP request: the inner HTTP request is binary-encoded (BHTTP, RFC 9292), HPKE-sealed under the enclave's X25519 public key, and prefixed with a key-config header.

wire        = header || enc || ciphertext
header      = key_config_id(1B) || kem_id(2B) || kdf_id(2B) || aead_id(2B)
enc         = ephemeral X25519 public key (32B)
ciphertext  = ChaCha20-Poly1305( inner BHTTP request )

The client POSTs this opaque blob to the relay:

http

POST /api/v1/chat/ohttp HTTP/1.1
Host: <relay>
Content-Type: message/ohttp-req

<binary OHTTP request>

The relay cannot read any field of the inner request. It only sees the sealed bytes and the source IP.

4. Relay Forwarding & Payment

The relay attaches an X-Payment header containing a signed x402 payment authorization for $OPG on Base, then forwards the sealed payload to the gateway's /v1/ohttp endpoint. The relay pays the gateway; the relay separately bills its own users on a subscription or per-call basis.

This indirection - relay-paid rather than client-paid at the enclave boundary - is what allows the gateway to charge for inference without ever learning who the end user is. The relay handles payments from clients independently, for example using a subscription model, keeping the user identity separated.

5. Gateway Decryption and Inference

Inside the enclave, the gateway:

Verifies the relay's X-Payment header against the x402 facilitator on Base.
HPKE-decrypts the OHTTP payload using the private key that never leaves enclave memory.
Decodes the inner BHTTP request and dispatches it to the appropriate upstream provider (OpenAI, Anthropic, Google, xAI, ByteDance ModelArk, etc.) using the enclave's own egress identity.
Collects the response, signs it inside the enclave, and seals the signed response back to the client.

The signature uses the RSA-2048 signing key that was bound to the attestation:

msg_hash = keccak256( abi.encodePacked(requestHash, outputHash, timestamp) )
sig      = RSA-PSS-SHA256(signingKey, msg_hash, salt_len=32)

The sealed response includes tee_signature, tee_request_hash, tee_output_hash, tee_timestamp, and tee_id (keccak256 of the signing public key DER) so the client can independently verify the response against the on-chain TEE registry entry.

6. Streaming Responses

For stream: true requests, the gateway returns a Chunked OHTTP response using the message/ohttp-chunked-res MIME type. The wire format is:

response_nonce
( varint(sealed_len) || sealed_chunk_ct )+
varint(0)
sealed_final_ct        // AAD = "final"

Each SSE event is sealed individually, so the relay forwards opaque frames without ever seeing token content. The final chunk uses the AAD "final" so a truncation by the relay is detectable by the client. Per-call cost metadata is sealed inside the final SSE event for streaming requests, so the relay also cannot read token usage in the streaming path.

7. Client Verification

After decrypting the OHTTP response, the SDK verifies:

The tee_signature validates against the signing public key from the attestation.
tee_id matches the public key registered in the on-chain TEE Registry.
tee_request_hash matches the hash of the request the client actually sent (proof the enclave saw the same prompt the client intended).
The timestamp is within an acceptable skew.

If all checks pass, the client has cryptographic proof that:

The response came from an enclave running approved code (TEE Registry + attestation).
The exact prompt the client encrypted is the prompt that was answered (request hash).
The exact response the client received is the response the enclave produced (output hash + signature).
The relay could not have read the prompt or completion (HPKE confidentiality).
The gateway could not have learned the client's IP (relay indirection).

Attestation Lifecycle

Phase	What Happens
Build	Enclave image is built reproducibly. PCR0/PCR1/PCR2 measurements are recorded.
Approval	PCR hashes are added to the on-chain approved list via the TEE Registry's admin process.
Boot	Enclave generates RSA + X25519 keypairs inside TEE memory. Keys are committed to nitriding's attestation transcript.
Registration	Enclave registers itself with the on-chain TEE Registry, providing the attestation, signing key, TLS cert, payment address, and endpoint. The contract verifies the attestation, PCRs, and key bindings.
Serving	Clients fetch the OHTTP key config + attestation, verify them, then begin sending sealed requests.
Key rotation	A new keypair triggers a new attestation, a new registration, and a new key config. Clients re-verify before continuing.

See Verifiable LLM Execution → TEE Registry for the full set of on-chain checks performed during registration.

What Is and Isn't Hidden

Visible to relay	Visible to gateway	Visible to upstream model provider
Client IP	Relay IP	Enclave egress IP
Sealed OHTTP request bytes	Plaintext prompt	Plaintext prompt
OHTTP response bytes	Plaintext completion	Plaintext completion
x402 cost settlement (non-stream)	Token usage, cost	Token usage
Request timing	Model selected	Model selected

IMPORTANT

Private inference protects content and identity unlinkability, not metadata about traffic timing or volume. A network observer that sees both the client's connection to the relay and the relay's connection to the gateway can still perform traffic-analysis correlation. Sensitive deployments should consider running the relay and the gateway under independent operators and on independent networks.

Comparison with Verifiable LLM Execution

Property	Verifiable LLM Execution	Private LLM Inference
Hardware-attested execution	✅	✅
Signed responses, on-chain proof	✅	✅
Prompt confidential from TEE operator	✅ (TLS to enclave)	✅ (HPKE to enclave)
Prompt confidential from network observer	✅ (TLS)	✅ (HPKE)
Client IP hidden from inference enclave	❌	✅
Client identity decoupled from payment	❌	✅
Transport	HTTPS + x402	OHTTP + HPKE + x402
Wire MIME type	`application/json`	`message/ohttp-req` / `message/ohttp-res` / `message/ohttp-chunked-res`

Standards & References

Next Steps

Read Verifiable LLM Execution for the underlying TEE registration, attestation, and settlement mechanics that private inference inherits.
See Proof Settlement for how inference proofs are recorded on the OpenGradient network.
Check the Python SDK for the developer-facing interface to private inference (rolling out alongside this infrastructure).

Private LLM Inference ​

Key Features ​

Trust Model ​

Architecture Overview ​

How It Works ​

1. Enclave Startup & Key Generation ​

2. Key Configuration Distribution ​

3. Encapsulating a Request ​

4. Relay Forwarding & Payment ​

5. Gateway Decryption and Inference ​

6. Streaming Responses ​

7. Client Verification ​

Attestation Lifecycle ​

What Is and Isn't Hidden ​

Comparison with Verifiable LLM Execution ​

Standards & References ​

Next Steps ​

Private LLM Inference

Key Features

Trust Model

Architecture Overview

How It Works

1. Enclave Startup & Key Generation

2. Key Configuration Distribution

3. Encapsulating a Request

4. Relay Forwarding & Payment

5. Gateway Decryption and Inference

6. Streaming Responses

7. Client Verification

Attestation Lifecycle

What Is and Isn't Hidden

Comparison with Verifiable LLM Execution

Standards & References

Next Steps