NeuroML LLM Inference
This guide describes how to use OpenGradient's native LLM inference capabilities from smart contracts. To see a list of LLMs supported, go to Supported Models.
OpenGradient supports various security techniques for inference verification and security. Developers can choose the most suitable methods for their use cases and requirements. To learn more about the security options we offer, go to Inference Verification.
LLM Inference Precompile
OpenGradient inference is provided through a standard Interface that any smart contract can use. The inference is implemented by a custom precompile called NeuroML
.
TIP
The precompile is accessible at address 0x00000000000000000000000000000000000000F4
.
The 2 high-level functions exposed by this inference are runModel
and runLllm
. We are only focusing on runLlm
on this page, which is optimized for LLMs. Check model inference for more details on our generic model inference method.
The method is defined as follows:
interface NeuroML {
// security modes offered for LLMs
enum LlmInferenceMode { VANILLA, TEE }
// executed LLMs
function runLlm(
LlmInferenceMode mode,
LlmInferenceRequest memory request
) external returns (LlmResponse memory);
}
Calling runLlm
from your smart contract will atomically execute the requested model with the given input and return the result synchronously.
LLM Input and Output
The request and response for LLMs are similar to the OpenAI API format, however there are some simplifications. The main input is a single prompt, and the final answer is returned as a string as well. We currently do not support multiple-choice answers. In addition, there are some common LLM parameters that can be tuned for each request, such as stop sequence, max tokens, and temperature, giving you full control over the process.
The list of LLMs you can use can be found here. These models can be used through the runLlm
method, making swapping and upgrading models easy.
These 2 types define the input and output definitions:
struct LlmInferenceRequest {
string model; // ID of the LLM to use
string prompt; // LLM prompt
uint32 max_tokens; // max tokens to generate
string[] stop_sequence; // stop sequences for model response
uint32 temperature; // model temperature (between 0 and 100)
}
struct LlmResponse {
string answer; // answer generated by the LLM
bool is_simulation_result; // indicates whether the result is real
}
Running LLMs can naturally take a longer time, so adjusting these parameters, as well as the length of the input and output can have a significant impact on transaction speed and costs.
To read more about is_simulation_result
, please see Simulation Results.
Example Smart Contract
The following smart contract uses Llama3-8B to run a short prompt and saves its answer in the smart contract.
import "opengradient-neuroml/src/OGInference.sol";
contract LlmExample {
string answer;
// Execute a Large Language Model directly in your smart contract
function runLlm() public {
// define stop-sequence
string[] memory stopSequence = new string[](1);
stopSequence[0] = "<end>";
// run Llama 3 model
LlmResponse memory llmResult = NEURO_ML.runLlm(
NeuroML.LlmInferenceMode.VANILLA,
LlmInferenceRequest(
"meta-llama/Meta-Llama-3-8B-Instruct",
"Hello sir, who are you?\n<start>",
1000,
stopSequence,
0
));
// handle result
if (llmResuklt.is_simulation_result) {
answer = "empty";
} else {
answer = llmResult.answer;
}
}
}