LLM Inference in Python SDK
This guide describes using OpenGradient's native LLM inference capabilities from Python SDK. To see a list of LLMs supported, go to Supported Models.
OpenGradient network supports various security techniques for inference verification and security. You can choose the most suitable methods for your use cases and requirements. To learn more about our security options, visit Inference Verification.
OpenGradient currently supports two types of LLM inference:
llm_completion
for completionllm_chat
for chat
LLM Completion
def llm_completion(model_cid, prompt, max_tokens=100, temperature=0.0, stop_sequence=None)
Arguments
model_cid
: the CID of the LLM model you want to use.prompt
: the input text prompt for the LLM.max_tokens
: (optional) the maximum number of tokens to generate. Default is 100.temperature
: (optional) controls randomness in generation. Higher values make output more random. Default is 0.0.stop_sequence
: (optional) a string that, if encountered, will stop the generation process.
Returns
- A tuple containing:
- The transaction hash of the inference request.
- A string containing the generated text response from the LLM.
Note: The actual output length may be limited by the model's capabilities or other constraints.
Example
import opengradient as og
# initialize SDK
og.init(private_key="<private_key>", email="<email>", password="<password>")
# run LLM inference
tx_hash, response = og.llm_completion(
model_cid='meta-llama/Meta-Llama-3-8B-Instruct',
prompt="Translate the following English text to French: 'Hello, how are you?'",
max_tokens=50,
temperature=0.0
)
# print output
print("Transaction Hash:", tx_hash)
print("LLM Output:", response)
LLM Chat
def llm_chat(model_cid, messages, max_tokens=100, temperature=0.0, stop_sequence=None, tools=[], tool_choice=None)
Arguments
model_cid
: the CID of the LLM model you want to use.messages
: the a list of inputs for the chat history used by the LLM for context. Formatting is based on OpenAIs API docs.max_tokens
: (optional) the maximum number of tokens to generate. Default is 100.temperature
: (optional) controls randomness in generation. Higher values make output more random. Default is 0.0.stop_sequence
: (optional) a string that, if encountered, will stop the generation process.tools
: (optional) a list of tools that the LLM can use. Formatting is based on OpenAIs API docs.tool_choice
: (optional) allows the user to specify which tool the LLM must use. Default is "auto" which lets the LLM choose.
Note:
tools
andtool_choice
are only available on certain models based on vLLMs support (e.g.mistralai/Mistral-7B-Instruct-v0.3
)
Returns
- A tuple containing:
- The transaction hash of the inference request.
- The finish reason (
stop
,tool_calls
, etc.) - A dict containing the LLM chat return, as well as applicable data such as which tool was called or the
tool_call_id
.
Note: The actual output length may be limited by the model's capabilities or other constraints.
Example
import opengradient as og
# initialize SDK
og.init(private_key="<private_key>", email="<email>", password="<password>")
# create messages history
messages = [
{
"role": "system",
"content": "You are a helpful AI assistant.",
"name": "HAL"
},
{
"role": "user",
"content": "Hello! How are you doing? Can you repeat my name?",
}]
# run LLM inference
tx_hash, finish_reason, message = og.llm_chat(
model_cid=og.LLM.MISTRAL_7B_INSTRUCT_V3,
messages=messages
)
# print output
print("Transaction Hash:", tx_hash)
print("Finish Reason:", finish_reason)
print("LLM Output:", message)
Tools Example
# Define your tools
tools = [{
"type": "function",
"function": {
"name": "get_current_weather",
"description": "Get the current weather in a given location",
"parameters": {
"type": "object",
"properties": {
"city": {
"type":
"string",
"description":
"The city to find the weather for, e.g. 'San Francisco'"
},
"state": {
"type":
"string",
"description":
"the two-letter abbreviation for the state that the city is"
" in, e.g. 'CA' which would mean 'California'"
},
"unit": {
"type": "string",
"description": "The unit to fetch the temperature in",
"enum": ["celsius", "fahrenheit"]
}
},
"required": ["city", "state", "unit"]
},
}
}]
# Message conversation
messages = [
{
"role": "system",
"content": "You are a AI assistant that helps the user with tasks. Use tools if necessary.",
},
{
"role": "user",
"content": "Hi! How are you doing today?"
},
{
"role": "assistant",
"content": "I'm doing well! How can I help you?",
},
{
"role":
"user",
"content":
"Can you tell me what the temperate will be in Dallas, in fahrenheit?"
}]
tx_hash, finish_reason, message = og.llm_chat(model_cid=og.LLM.MISTRAL_7B_INSTRUCT_V3, messages=messages, tools=tools)
# print output
print("Transaction Hash:", tx_hash)
print("Finish Reason:", finish_reason)
print("LLM Output:", message)
CLI LLM Inference
We also have explicit support for using LLMs through the completion
and chat
commands in the CLI.
For example, you can run a competion inference with Llama-3 using the following command:
opengradient completion --model "meta-llama/Meta-Llama-3-8B-Instruct" --prompt "hello who are you?" --max-tokens 50
Or you can use files instead of text input in order to simplify your command:
opengradient chat --model "mistralai/Mistral-7B-Instruct-v0.3" --messages-file messages.json --tools-file tools.json --max-tokens 200
The list of models we support can be found in the Model Hub.
To get more information on how to run LLM's using the CLI, you can run:
opengradient completion --help
opengradient chat --help