| title | Lesson 2 |
|---|---|
| weight | 10 |
| draft | true |
Welcome to the second lesson of the OpenLLM course with Apache Open Serverless! In this lesson, we'll learn how to actually communicate with Large Language Models by creating an LLM chat with streaming capabilities.
- Go to GitHub → Mastro GPT: https://github.com/mastrogpt
- Fork the repository to your account (create a copy)
- Launch the Code Space from your forked repository
Why fork? This gives you write permissions to save changes using Git version control without depending on the original repository.
- Select Lesson 2 from the lessons menu
- This downloads all lesson files (both markdown and PDF formats)
- Each lesson is independent - you don't need previous lesson files
Before starting the exercises, configure a keyboard shortcut to copy text from the editor directly to the terminal. This is extremely useful for this course:
- Open Settings: Click the gear icon (⚙️) in the bottom left corner
- Select Keyboard Shortcuts: Choose "Keyboard Shortcuts" from the menu
- Search for the Command: Type "run selected text in active terminal"
- Set Your Shortcut: Click on the command and set your preferred key combination
- Recommended:
Ctrl+Enter(as used in the course) - This allows you to copy text from files directly to the terminal
- Recommended:
Why this shortcut? It's extremely comfortable and saves time when you need to run commands from the lesson files. This feature is only available in VS Code and is not enabled by default.
LLMs are protected with the same credentials as Open Serverless:
olima_host: The URL to access the LLM serviceAUTH: Authentication credentials from your Open Serverless login
Always use this pattern in your code:
def get_secret(args, env_var_name):
return args.get(env_var_name) or os.environ.get(env_var_name)This allows your code to work in both test and production environments.
- Open terminal and run:
obs ai cli - Import required modules and set credentials:
import os
olima_host = args.get('olima_host')
auth = args.get('auth')
base_url = f"https://{auth}@{olima_host}"- Test the connection:
curl "$base_url"
# Should return: "olima is running"Create a message with this structure:
message = {
"model": "llama-3.1-8b", # Meta's 3.1B parameter model
"prompt": "Who are you?",
"stream": False, # Start without streaming
"input": "Who are you?"
}import requests
url = f"{base_url}/api/generate"
response = requests.post(url, json=message)
result = response.json()
print(result['response'])Streaming allows you to see LLM responses in real-time instead of waiting for complete generation. This makes interactions much more immediate.
message["stream"] = True # Enable streaming
response = requests.post(url, json=message, stream=True)
# Response is now an iterator
for chunk in response.iter_lines():
if chunk:
data = json.loads(chunk.decode('utf-8'))
print(data['response'], end='', flush=True)Each streaming response contains:
model: Model nameresponse: Text chunkdone: Boolean flag (true when complete)- Additional metadata (context, duration, etc.)
Actions in serverless environments are asynchronous and lose contact with the web server, making streaming complex.
The streamer component provides:
stream_host: Host for streaming connectionstream_port: Port for streaming connection- A socket to receive intermediate results
def stream_to_socket(iterator, args):
stream_host = args.get('stream_host')
stream_port = args.get('stream_port')
# Connect to socket
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.connect((stream_host, int(stream_port)))
# Stream results
for item in iterator:
sock.send(f"{item}\n".encode())
sock.close()
return "Streaming completed"Use stream_mock for local testing:
from test.stream_mock import StreamMock
args = {}
mock = StreamMock()
mock.start(args)
# Your streaming function here
result = stream_to_socket(countdown(10), args)
mock.stop()
print(result)Location: Look for TODO E2.1
Task: Add the required parameters to access and authorize Ollama
- Add
@perm olima_host - Add
@perm auth - Use
args.get('olima_host')andargs.get('auth')
Location: Look for TODO E2.2
Task: Fix the streaming to handle Ollama's response format
# Decode the JSON response
data = json.loads(chunk.decode('utf-8'))
# Extract only the response part
response_text = data.get('response', 'error')Location: Look for TODO E2.3
Task: Add model switching functionality
if input_text == "llama":
model = "llama-3.1-8b"
elif input_text == "deepseek":
model = "deepseek-coder:6.7b"# Deploy all functions
obs deploy
# Deploy specific function
obs ide deploy function-name
# Incremental deployment (recommended)
obs ide dev- Always return
streaming: truefor streaming functions - Use
@permannotations to pass secrets - The CLI sees test environment variables
- Production uses configuration + packages environment
- LLM Access: How to connect to and communicate with Large Language Models
- Secret Management: Proper handling of credentials in both test and production
- Streaming Implementation: Real-time response streaming for better user experience
- Model Switching: Dynamic model selection in your applications
- Testing: Using mocks to test streaming functionality locally
- Practice with the exercises
- Experiment with different models
- Build upon this foundation for more complex LLM applications
- Environment Variables: Use
obs config dumpto see all available configuration - Incremental Deploy: Use
obs ide devfor automatic deployment on save - Timeout Handling: Be aware of execution time limits (default: 3 minutes)
Remember: Always test your streaming functions with the mock before deploying to production!