Skip to content

NeuroML LLM Inference

This guide describes how to use OpenGradient's native LLM inference capabilities from smart contracts. To see a list of LLMs supported, go to Supported Models.

OpenGradient supports various security techniques for inference verification and security. Developers can choose the most suitable methods for their use cases and requirements. To learn more about the security options we offer, go to Inference Verification.

LLM Inference Precompile

OpenGradient inference is provided through a standard Interface that any smart contract can use. The inference is implemented by a custom precompile called NeuroML.

TIP

The precompile is accessible at address 0x00000000000000000000000000000000000000F4.

The 2 high-level functions exposed by this inference are runModel and runLllm. We are only focusing on runLlm on this page, which is optimized for LLMs. Check model inference for more details on our generic model inference method.

The method is defined as follows:

solidity
interface NeuroML {

    // security modes offered for LLMs
    enum LlmInferenceMode { VANILLA, TEE }

    // executed LLMs 
    function runLlm(
        LlmInferenceMode mode,
        LlmInferenceRequest memory request
    ) external returns (LlmResponse memory);
}

Calling runLlm from your smart contract will atomically execute the requested model with the given input and return the result synchronously.

LLM Input and Output

The request and response for LLMs are similar to the OpenAI API format, however there are some simplifications. The main input is a single prompt, and the final answer is returned as a string as well. We currently do not support multiple-choice answers. In addition, there are some common LLM parameters that can be tuned for each request, such as stop sequence, max tokens, and temperature, giving you full control over the process.

The list of LLMs you can use can be found here. These models can be used through the runLlm method, making swapping and upgrading models easy.

These 2 types define the input and output definitions:

solidity
struct LlmInferenceRequest {
    string model; // ID of the LLM to use
    string prompt; // LLM prompt
    uint32 max_tokens; // max tokens to generate
    string[] stop_sequence; // stop sequences for model response
    uint32 temperature; // model temperature (between 0 and 100)
}

struct LlmResponse {
    string answer; // answer generated by the LLM
    
    bool is_simulation_result; // indicates whether the result is real
}

Running LLMs can naturally take a longer time, so adjusting these parameters, as well as the length of the input and output can have a significant impact on transaction speed and costs.

To read more about is_simulation_result, please see Simulation Results.

Example Smart Contract

The following smart contract uses Llama3-8B to run a short prompt and saves its answer in the smart contract.

solidity
import "opengradient-neuroml/src/OGInference.sol";

contract LlmExample {

    string answer;

    // Execute a Large Language Model directly in your smart contract
    function runLlm() public {
        // define stop-sequence
        string[] memory stopSequence = new string[](1);
        stopSequence[0] = "<end>";

        // run Llama 3 model
        LlmResponse memory llmResult = NEURO_ML.runLlm(
            NeuroML.LlmInferenceMode.VANILLA,
            LlmInferenceRequest(
                "meta-llama/Meta-Llama-3-8B-Instruct",
                "Hello sir, who are you?\n<start>",
                1000,
                stopSequence,
                0
        ));

        // handle result
        if (llmResuklt.is_simulation_result) {
            answer = "empty";
        } else {
            answer = llmResult.answer;
        }
    }
}

OpenGradient 2024