Skip to content

sar/nebius

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

38 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Nebius

Soperator GPU Cluster Finetuning + Inference + Eval | Demo Day

Let's develop a scalable client-facing architecture based around Soperator (kubernetes operator for Slurm) deployed onto mk8s cluster within Nebius AI Cloud. Hardware resources, terraform configuration, and model training code are included in this workflow.

This project is created for Nebius demo day presentation.

Platform Overview

Kubernetes is a solid way of orchestrating containerized workloads, but it's commonly used with pre-emptable compute, pods and ephemeral storage. That's where SLURM comes into play, enabling long-running batch queueing, scheduling, and resource partition allocation between workloads.

Neibus Soperator is leveraged to manage lifecycle state using Kubernetes operator convention and CRDs ensuring higher degree of Reliability + Replicability + Reduced complexity of deployments. SLURM on K8s clusters enables benefits of a hybrid approach and is a proven architecture in the field ensuring so end users can navigate HPC paradigms within a cloud-native Kubernetes environment.

By following this solutions architecture, you'll be able to perform:

  • Soperator mk8s Cluster Deployment
  • Manage E2E (End to End) State
  • Perform NVIDIA GPU Readiness Checks
  • Fine-tuning Large Language Models (LLMs) using PyTorch
  • Scale to distributed training with Torchrun, DDP, NCCL
  • Apply Inference
  • Evaluate and Compare model performance

Scalability of this approach can be deployed to multiple worker nodes alongside a shared filesystem, discussed below.

Architecture Diagram

This solution implements several key components:

  1. Distributed Finetuning using transformers, pytorch, NCCL, ... and launched with a Slurm Batch job
  2. Inference server meant to be containerized or run standalone on dedicated worker node
  3. Evaluation metrics run after training as Slurm Batch job
  4. JSON config files for creating reproduceable training results (MLOps)
  5. Terraform Soperator mK8s Deployment

Nebius Soperator Architecture Diagram


Scenario

Financial services institution called "InnovationBank" wants to develop agents for customer interactions and finetune an LLM on their domain specific dataset. This will enable better customer interactions, reduce costs, and optimize for compliance in a regulated industry.

By applying finetuning we can reduce model token generation time, improve accuracy on this domain specific context, and enable integration within business applications. Fine-tuning also bypasses reasoning for direct responses, since Qwen3 was selected allowing for faster time to inference. Model serving can be extended through nested containerization using an API server or web based interface.

Note: More details will be discussed during Demo Day Presentation.


Process Overview

Base LLM Model

Firstly, model selection is performed from HuggingFace to meet both customer requirements and capacity limits. Qwen3-8B to Qwen3-32B are common models with reasoning capability and can be trained on L40s (48GB) by adjusting batch_size parameters. 14B was selected for balance of training speed, accuracy, and fits on available GPUs. Other key factors that were considered are:

  • Safeguards built-into pre-trained LLM
  • Trainability factor which indicates whether applying fine-tuning can be performed
  • License to be utilized in commercial application

Dataset Selection

Synthetic dataset can be generated using the same model, however it would consume resources, delay training, and require careful human-in-the-loop curation. Open source datasets are limited so Financial-Instruct-500k was selected as it contains non-reasoning conversations between users and AI agents meeting client's industry criteria.


Pre-Processing

Since JSONL format was used the dataset must first be converted then split into train and test files. These are located inside of modules folder:

  • dataset_convert.py transforms parquet to JSONL.
  • dataset_split.py uses ratio to split an input JSONL file into _train and _test datasets in same folder. Setting rand_sort will randomly shuffle order of dataset.
  • dataset_tokenize.py converts JSONL strings of user and assistant conversations into HuggingFace tokenized format. Operations run parallel processing on CPU after loading tokenizer component from pre-trained LLM. Outputs will contain input_ids, labels, attention_mask.

Fine-tuning Process

Main finetune.py script implements LoRA technique to efficiently adapt pre-trained LLM to the financial domain. It handles distributed training across multiple GPUs and applies 4-bit quantization to reduce memory usage balancing maintaining model accuracy with float16 mixed precision. The module loads raw LLM, prepares for k-bit training, applies LoRA adapters, and runs trains on pre-tokenized dataset.

  • Training arguments are read using configuration file in JSON format (see section below). Parameters are applied with create_training_arguments() function.
  • finetune_distributed.sh Uses torchrun command with DDP to perform gradient syncs over shared ethernet, as current setup does not support Infiniband. Assumes model, dataset, checkpoints are accessible between nodes.

Slurm job for finetuning


Model Evaluation

Performance of base model is compared to fine-tuned version using Perplexity and Token Generation speed. Other techniques like BLEU and ROGUE can also be implemented (todo). Evaluation runs each model first on test dataset to calculate deltas.

  • Outputs are written to evaluator_[job_number].out STDOUT in JSON format and can be piped into jq for filtering.
  • Samples are generated from test split as inputs to compare with generated responses.

Example output:

{
  "base_model": {
    "perplexity": 7.033111572265625,
    "message_length_avg": 671.0,
    "sample_generations": [...]
  },
  "finetuned_model": {
    "perplexity": 4.704128742218018,
    "message_length_avg": 125.1,
    "sample_generations": [
      {
        "input": "Question:\nWhat is the average cost of a portfolio on a trading site?",
        "reference": "It sounds for the most part you are a 'buy and hold' type investor and continue to contribute monthly (truncated)",
        "generated": "The average cost of a portfolio on a trading site like Questrade.com can vary depending on the number of trades and shares. However, Questrade's pricing is $0.01 per share with a minimum of $4.95 and a maximum of $9.95 per trade.\n\nFor example, if you make three trades a month, your annual cost would be:\n\n* 3 trades/month x $4.95/trade = $14.85/month (assuming < 495 shares/trade)\n* $14.85/month x 12 months = $178.20/year\n\nQuestrade has no management fees or other charges, allowing you to manage your accounts directly. This pricing structure can be cost-effective, especially for long-term investors who adopt a 'buy and hold' strategy."
      }
    ]
  },
  "comparison": {
    "perplexity": 33.11454405517637,
    "speed": -83.46897662230256,
    "message_length": -81.35618479880775
  }
}

Inference UI

Chat interface is setup using Gradio components and deployed to a sharable link for interacting with the fine-tuned model. Specifying --port will run the server on localhost but can be forwarded using ssh -L to access web interface.

  • Model metrics including TFFT and total time to response were chosen as indicators of model performance.
  • Since LoRA adapters are applied, GPU utilization is considerably less for generation.
  • Currently --finetune is a required flag since UI is only compatible for chat interactions with the LoRA applied finetune checkpoints or final model. To generate samples on base model, use evaluator module instead.
  • Dedicated worker-4 instance is currently where inference batch job is deployed allowing Dual L40s to serve LoRA finetuned model from checkpoint or final bin.

Inference using Worker 4 running Gradio server


Running on Slurm

Modules should be run using their accompanying Slurm batch shell scripts for submission into queue to allow efficient resource allocation:

$ ssh root@login-0

# Path where codesharing is performed
$ cd /home/shared/finetune_code   

# Submit for running on worker-[*] nodes
$ sbatch jobs/<script_name>
$ squeue

# Jobs can also be queued using nested CUDA container
srun --container-image="nvidia/cuda:12.4.1-base-ubuntu22.04" \
    bash -c \
      python3 ./modules/<script.py> && \
      nvidia-smi

View more details about running jobs in Documentation folder.


Deployment

To utilize this scenario you will need to setup a Slurm Operator which manages lifecycle state on a Kubernetes cluster. Begin by deploying following these commands.

# git submodule version control not applied so changes are not updated automatically

$ cd nebius-solutions-library/soperator/installations/default

# Required to auth and export environment variables 
$ source .envrc

# Performing a new deployment will create tfstate bucket and setup dependencies
$ terraform init 
$ terraform plan
$ terraform apply

Configuration included here has known good defaults for NVIDIA L40s GPUs deployable within capacity constraints to 5x worker nodes. Each GPU has 48GB of host addressable memory but does not support Infiniband between nodes.

  • Jails are created to bind mount high throughput network SSD filesystems for storing dataset, fine-tuning checkpoints, and raw model safetensors.

  • Home directory typically for linux userspace contains shared subfolder enabling multiple users to perform code sharing.

  • Ephemeral Ramdisk can increase CPU-GPU copy operations by localizing on each node but has tradeoffs for available resources left for workloads, although mounted by default on /mnt/memory is not currently utilized. Model safetensors can be copied into this path for faster cold start times on both inference and training modules.

  • Object Storage: Version controlled copies of the CONFIG files used for training, inference, and eval setting different parameters and directory paths are handy but placing them on Nebius Storage Bucket expands out ability to implement a complete MLOps pipeline.

  File system usage:
    Size Use% FSType   Directory
    256G   8% virtiofs /
  246.9G   4% ext4     |-/tmp
    256G   3% virtiofs |-/home
     12G   0% tmpfs    |-/mnt/memory
      1T   1% virtiofs |-/mnt/data
      1T   0% virtiofs |-/mnt/chkpt
      1T  10% virtiofs |-/mnt/model

Total Cost of Ownership (TCO)

Right sizing was applied to each instance type, for details refer to terraform.tfvars. Adopting this solution will incur costs on Nebius AI cloud but resources are optimized to the workload itself. Which means that running distributed GPU training or inference allocates only the compute types, storage, and cluster infrastructure needed.

Here is a cost breakdown analysis. Note that pricing was obtained from Nebius website when this presentation was created.

  • Pricing for 1x NVIDIA L40s GPU is $1.35/hr on EPYC Genoa base
  • AMD platform costs $0.01 per vCPU hourly on GPU instances
  • Non GPU EPYC platform is $0.012 per vCPU hourly
  • RAM price is consistent $0.0032 per hour
  • Network SSD costs $0.071 per GiB monthly
  • Non replicated SSD is $0.053 per GiB monthly
  • Shared Filesystem used on jail mounts are $0.08 per GiB monthly

(Converted from Spreadsheet)

Resource Size Price Total Monthly Cost
Compute Worker Nodes (gpu-l40s-d, 2gpu-64vcpu-384gb) 5 nodes $4.57/node/hour $16,680
System Nodes (cpu-d3, 8vcpu-32gb) 3 to 9 nodes $0.20/node/hour $438 to $1314
Controller Nodes (cpu-d3, 4vcpu-16gb) 2 nodes $0.10/node/hour $146
Login Node (cpu-d3, 4vcpu-16gb) 1 node $0.10/node/hour $73
Accounting Node (cpu-d3, 8vcpu-32gb) 1 node $0.20/node/hour $146
Boot Disks Worker Nodes (558 GiB NETWORK_SSD_NON_REPLICATED) 5 nodes x 558 GiB $0.053/GiB/month $148
System Nodes (128 GiB NETWORK_SSD) 3 nodes to 9 nodes x 128 GiB $0.071/GiB/month $27 to $81
Controller Nodes (128 GiB NETWORK_SSD) 2 nodes x 128 GiB $0.071/GiB/month $18
Login Node (256 GiB NETWORK_SSD) 1 node x 256 GiB $0.071/GiB/month $18
Accounting Node (128 GiB NETWORK_SSD) 1 node x 128 GiB $0.071/GiB/month $9
Shared FS slurm-jail filesystem 256 GiB $0.08/GiB/month $20
data submount 1024 GiB $0.08/GiB/month $82
chkpt submount 1024 GiB $0.08/GiB/month $82
model submount 1024 GiB $0.08/GiB/month $82
home submount 256 GiB $0.08/GiB/month $20
accounting filesystem 256 GiB $0.08/GiB/month $20
controller_spool filesystem 128 GiB $0.08/GiB/month $10
Object Storage Daily jail backups 7 x 1024 GiB $0.0147/GiB/month $105
JSON Configs Version Controlled < 1 GiB $0.0147/GiB/month $0
Networking Public IP Address (Login Node) 1 Free $0.00
Total Monthly Cost $18,124 to $19,054

Now that we know the project cost during development and piloting phase with selected end users will cost around $19K monthly, estimating how much inference we'll need to perform is crucial to realize ROI.

  • On current solution with LoRA Finetuned Model, output speed is around 3.8 tokens/sec on single batch inference and 60 tok/sec with batch increased to 16. Base model does perform faster since its packaged as safetensors, peaking closer to 200 tokens/sec on current L40s setup.
  • Assumption is that each chat relies on 8K context window but 2K of output generation, typical of customer service application even in financial industry domain.
  • This setup also assumes continous post-training (finetuning) is applied keeping 4 worker nodes running for improving the model based on fresh datasets.
Scenario Tokens/Sec Tokens/Day Conversations/Day Est Cost/Chat
Single Batch Finetuned 3.8 328,320 164 $3.86
Batch Size 16 Finetuned 60 5,184,000 2,592 $0.24
Optimized Target 200 17,280,000 8,640 $0.07

Further discussions with business stakeholders would highlight the ROI potential of performing inference using a finetuned model including better alignment to industry specific terminology, reduced hallucinations, and long-term potential.


License

This project was created for Nebius Demo Day presentation only.

About

Slurm on Kubernetes Architecture Solution for Fine-tuning LLMs, Inference, and Eval across distributed NVIDIA L40s X8 GPU Cluster on Nebius AI Cloud

Topics

Resources

Stars

Watchers

Forks

Contributors