LLM Inference in Python SDK

This guide describes using OpenGradient's native LLM inference capabilities from Python SDK. To see a list of LLMs supported, go to Supported Models.

OpenGradient network supports various security techniques for inference verification and security. You can choose the most suitable methods for your use cases and requirements. To learn more about our security options, visit Inference Verification.

OpenGradient currently supports two types of LLM inference:

llm_completion for completion
llm_chat for chat

LLM Completion

python

def llm_completion(model_cid, prompt, max_tokens=100, temperature=0.0, stop_sequence=None)

Arguments

model_cid: the CID of the LLM model you want to use.
prompt: the input text prompt for the LLM.
max_tokens: (optional) the maximum number of tokens to generate. Default is 100.
temperature: (optional) controls randomness in generation. Higher values make output more random. Default is 0.0.
stop_sequence: (optional) a string that, if encountered, will stop the generation process.

Returns

A tuple containing:
1. The transaction hash of the inference request.
2. A string containing the generated text response from the LLM.

Note: The actual output length may be limited by the model's capabilities or other constraints.

Example

python

import opengradient as og

# initialize SDK
og.init(private_key="<private_key>", email="<email>", password="<password>")

# run LLM inference
tx_hash, response = og.llm_completion(
    model_cid='meta-llama/Meta-Llama-3-8B-Instruct',
    prompt="Translate the following English text to French: 'Hello, how are you?'",
    max_tokens=50,
    temperature=0.0
)

# print output
print("Transaction Hash:", tx_hash)
print("LLM Output:", response)

LLM Chat

python

def llm_chat(model_cid, messages, max_tokens=100, temperature=0.0, stop_sequence=None, tools=[], tool_choice=None)

Arguments

model_cid: the CID of the LLM model you want to use.
messages: the a list of inputs for the chat history used by the LLM for context. Formatting is based on OpenAIs API docs.
max_tokens: (optional) the maximum number of tokens to generate. Default is 100.
temperature: (optional) controls randomness in generation. Higher values make output more random. Default is 0.0.
stop_sequence: (optional) a string that, if encountered, will stop the generation process.
tools: (optional) a list of tools that the LLM can use. Formatting is based on OpenAIs API docs.
tool_choice: (optional) allows the user to specify which tool the LLM must use. Default is "auto" which lets the LLM choose.

Note: tools and tool_choice are only available on certain models based on vLLMs support (e.g. mistralai/Mistral-7B-Instruct-v0.3)

Returns

A tuple containing:
1. The transaction hash of the inference request.
2. The finish reason (stop, tool_calls, etc.)
3. A dict containing the LLM chat return, as well as applicable data such as which tool was called or the tool_call_id.

Note: The actual output length may be limited by the model's capabilities or other constraints.

Example

python

import opengradient as og

# initialize SDK
og.init(private_key="<private_key>", email="<email>", password="<password>")

# create messages history
messages = [
    {
        "role": "system",
        "content": "You are a helpful AI assistant.",
        "name": "HAL"
    },
    {
        "role": "user",
        "content": "Hello! How are you doing? Can you repeat my name?",
    }]

# run LLM inference
tx_hash, finish_reason, message = og.llm_chat(
  model_cid=og.LLM.MISTRAL_7B_INSTRUCT_V3,
  messages=messages
)

# print output
print("Transaction Hash:", tx_hash)
print("Finish Reason:", finish_reason)
print("LLM Output:", message)

Tools Example

python

# Define your tools
tools = [{
        "type": "function",
        "function": {
            "name": "get_current_weather",
            "description": "Get the current weather in a given location",
            "parameters": {
                "type": "object",
                "properties": {
                    "city": {
                        "type":
                        "string",
                        "description":
                        "The city to find the weather for, e.g. 'San Francisco'"
                    },
                    "state": {
                        "type":
                        "string",
                        "description":
                        "the two-letter abbreviation for the state that the city is"
                        " in, e.g. 'CA' which would mean 'California'"
                    },
                    "unit": {
                        "type": "string",
                        "description": "The unit to fetch the temperature in",
                        "enum": ["celsius", "fahrenheit"]
                    }
                },
                "required": ["city", "state", "unit"]
            },
        }
}]

# Message conversation
messages = [
    {
        "role": "system",
        "content": "You are a AI assistant that helps the user with tasks. Use tools if necessary.",
    },
    {
        "role": "user",
        "content": "Hi! How are you doing today?"
    }, 
    {
        "role": "assistant",
        "content": "I'm doing well! How can I help you?",
    }, 
    {
        "role":
        "user",
        "content":
        "Can you tell me what the temperate will be in Dallas, in fahrenheit?"
    }]

tx_hash, finish_reason, message = og.llm_chat(model_cid=og.LLM.MISTRAL_7B_INSTRUCT_V3, messages=messages, tools=tools)

# print output
print("Transaction Hash:", tx_hash)
print("Finish Reason:", finish_reason)
print("LLM Output:", message)

CLI LLM Inference

We also have explicit support for using LLMs through the completion and chat commands in the CLI.

For example, you can run a competion inference with Llama-3 using the following command:

bash

opengradient completion --model "meta-llama/Meta-Llama-3-8B-Instruct" --prompt "hello who are you?" --max-tokens 50

Or you can use files instead of text input in order to simplify your command:

bash

opengradient chat --model "mistralai/Mistral-7B-Instruct-v0.3" --messages-file messages.json --tools-file tools.json --max-tokens 200

The list of models we support can be found in the Model Hub.

To get more information on how to run LLM's using the CLI, you can run:

bash

opengradient completion --help
opengradient chat --help

LLM Inference in Python SDK ​

LLM Completion ​

Arguments ​

Returns ​

Example ​

LLM Chat ​

Arguments ​

Returns ​

Example ​

Tools Example ​

CLI LLM Inference ​

LLM Inference in Python SDK

LLM Completion

Arguments

Returns

Example

LLM Chat

Arguments

Returns

Example

Tools Example

CLI LLM Inference