Let's develop a scalable client-facing architecture based around Soperator (kubernetes operator for Slurm) deployed onto mk8s cluster within Nebius AI Cloud. Hardware resources, terraform configuration, and model training code are included in this workflow.
This project is created for Nebius demo day presentation.
- Platform Overview
- Scenario
- Process Overview
- Running on Slurm
- Deployment
- Total Cost of Ownership (TCO)
Kubernetes is a solid way of orchestrating containerized workloads, but it's commonly used with pre-emptable compute, pods and ephemeral storage. That's where SLURM comes into play, enabling long-running batch queueing, scheduling, and resource partition allocation between workloads.
Neibus Soperator is leveraged to manage lifecycle state using Kubernetes operator convention and CRDs ensuring higher degree of Reliability + Replicability + Reduced complexity of deployments. SLURM on K8s clusters enables benefits of a hybrid approach and is a proven architecture in the field ensuring so end users can navigate HPC paradigms within a cloud-native Kubernetes environment.
By following this solutions architecture, you'll be able to perform:
- Soperator mk8s Cluster Deployment
- Manage E2E (End to End) State
- Perform NVIDIA GPU Readiness Checks
- Fine-tuning Large Language Models (LLMs) using PyTorch
- Scale to distributed training with Torchrun, DDP, NCCL
- Apply Inference
- Evaluate and Compare model performance
Scalability of this approach can be deployed to multiple worker nodes alongside a shared filesystem, discussed below.
This solution implements several key components:
- Distributed Finetuning using transformers, pytorch, NCCL, ... and launched with a Slurm Batch job
- Inference server meant to be containerized or run standalone on dedicated worker node
- Evaluation metrics run after training as Slurm Batch job
- JSON config files for creating reproduceable training results (MLOps)
- Terraform Soperator mK8s Deployment
Financial services institution called "InnovationBank" wants to develop agents for customer interactions and finetune an LLM on their domain specific dataset. This will enable better customer interactions, reduce costs, and optimize for compliance in a regulated industry.
By applying finetuning we can reduce model token generation time, improve accuracy on this domain specific context, and enable integration within business applications. Fine-tuning also bypasses reasoning for direct responses, since Qwen3 was selected allowing for faster time to inference. Model serving can be extended through nested containerization using an API server or web based interface.
Note: More details will be discussed during Demo Day Presentation.
Firstly, model selection is performed from HuggingFace to meet both customer requirements and capacity limits. Qwen3-8B to Qwen3-32B are common models with reasoning capability and can be trained on L40s (48GB) by adjusting batch_size parameters. 14B was selected for balance of training speed, accuracy, and fits on available GPUs. Other key factors that were considered are:
- Safeguards built-into pre-trained LLM
- Trainability factor which indicates whether applying fine-tuning can be performed
- License to be utilized in commercial application
Synthetic dataset can be generated using the same model, however it would consume resources, delay training, and require careful human-in-the-loop curation. Open source datasets are limited so Financial-Instruct-500k was selected as it contains non-reasoning conversations between users and AI agents meeting client's industry criteria.
Since JSONL format was used the dataset must first be converted then split into train and test files. These are located inside of modules folder:
dataset_convert.pytransforms parquet to JSONL.dataset_split.pyuses ratio to split an input JSONL file into _train and _test datasets in same folder. Setting rand_sort will randomly shuffle order of dataset.dataset_tokenize.pyconverts JSONL strings of user and assistant conversations into HuggingFace tokenized format. Operations run parallel processing on CPU after loading tokenizer component from pre-trained LLM. Outputs will contain input_ids, labels, attention_mask.
Main finetune.py script implements LoRA technique to efficiently adapt pre-trained LLM to the financial domain. It handles distributed training across multiple GPUs and applies 4-bit quantization to reduce memory usage balancing maintaining model accuracy with float16 mixed precision. The module loads raw LLM, prepares for k-bit training, applies LoRA adapters, and runs trains on pre-tokenized dataset.
- Training arguments are read using configuration file in JSON format (see section below). Parameters are applied with
create_training_arguments()function. finetune_distributed.shUses torchrun command with DDP to perform gradient syncs over shared ethernet, as current setup does not support Infiniband. Assumes model, dataset, checkpoints are accessible between nodes.
Performance of base model is compared to fine-tuned version using Perplexity and Token Generation speed. Other techniques like BLEU and ROGUE can also be implemented (todo). Evaluation runs each model first on test dataset to calculate deltas.
- Outputs are written to
evaluator_[job_number].outSTDOUT in JSON format and can be piped intojqfor filtering. - Samples are generated from
testsplit as inputs to compare with generated responses.
Example output:
{
"base_model": {
"perplexity": 7.033111572265625,
"message_length_avg": 671.0,
"sample_generations": [...]
},
"finetuned_model": {
"perplexity": 4.704128742218018,
"message_length_avg": 125.1,
"sample_generations": [
{
"input": "Question:\nWhat is the average cost of a portfolio on a trading site?",
"reference": "It sounds for the most part you are a 'buy and hold' type investor and continue to contribute monthly (truncated)",
"generated": "The average cost of a portfolio on a trading site like Questrade.com can vary depending on the number of trades and shares. However, Questrade's pricing is $0.01 per share with a minimum of $4.95 and a maximum of $9.95 per trade.\n\nFor example, if you make three trades a month, your annual cost would be:\n\n* 3 trades/month x $4.95/trade = $14.85/month (assuming < 495 shares/trade)\n* $14.85/month x 12 months = $178.20/year\n\nQuestrade has no management fees or other charges, allowing you to manage your accounts directly. This pricing structure can be cost-effective, especially for long-term investors who adopt a 'buy and hold' strategy."
}
]
},
"comparison": {
"perplexity": 33.11454405517637,
"speed": -83.46897662230256,
"message_length": -81.35618479880775
}
}Chat interface is setup using Gradio components and deployed to a sharable link for interacting with the fine-tuned model. Specifying --port will run the server on localhost but can be forwarded using ssh -L to access web interface.
- Model metrics including TFFT and total time to response were chosen as indicators of model performance.
- Since LoRA adapters are applied, GPU utilization is considerably less for generation.
- Currently
--finetuneis a required flag since UI is only compatible for chat interactions with the LoRA applied finetune checkpoints or final model. To generate samples on base model, use evaluator module instead. - Dedicated
worker-4instance is currently where inference batch job is deployed allowing Dual L40s to serve LoRA finetuned model from checkpoint or final bin.
Modules should be run using their accompanying Slurm batch shell scripts for submission into queue to allow efficient resource allocation:
$ ssh root@login-0
# Path where codesharing is performed
$ cd /home/shared/finetune_code
# Submit for running on worker-[*] nodes
$ sbatch jobs/<script_name>
$ squeue
# Jobs can also be queued using nested CUDA container
srun --container-image="nvidia/cuda:12.4.1-base-ubuntu22.04" \
bash -c \
python3 ./modules/<script.py> && \
nvidia-smiView more details about running jobs in Documentation folder.
To utilize this scenario you will need to setup a Slurm Operator which manages lifecycle state on a Kubernetes cluster. Begin by deploying following these commands.
# git submodule version control not applied so changes are not updated automatically
$ cd nebius-solutions-library/soperator/installations/default
# Required to auth and export environment variables
$ source .envrc
# Performing a new deployment will create tfstate bucket and setup dependencies
$ terraform init
$ terraform plan
$ terraform applyConfiguration included here has known good defaults for NVIDIA L40s GPUs deployable within capacity constraints to 5x worker nodes. Each GPU has 48GB of host addressable memory but does not support Infiniband between nodes.
-
Jails are created to bind mount high throughput network SSD filesystems for storing dataset, fine-tuning checkpoints, and raw model safetensors.
-
Home directory typically for linux userspace contains
sharedsubfolder enabling multiple users to perform code sharing. -
Ephemeral Ramdisk can increase CPU-GPU copy operations by localizing on each node but has tradeoffs for available resources left for workloads, although mounted by default on
/mnt/memoryis not currently utilized. Model safetensors can be copied into this path for faster cold start times on both inference and training modules. -
Object Storage: Version controlled copies of the CONFIG files used for training, inference, and eval setting different parameters and directory paths are handy but placing them on Nebius Storage Bucket expands out ability to implement a complete MLOps pipeline.
File system usage:
Size Use% FSType Directory
256G 8% virtiofs /
246.9G 4% ext4 |-/tmp
256G 3% virtiofs |-/home
12G 0% tmpfs |-/mnt/memory
1T 1% virtiofs |-/mnt/data
1T 0% virtiofs |-/mnt/chkpt
1T 10% virtiofs |-/mnt/model
Right sizing was applied to each instance type, for details refer to terraform.tfvars. Adopting this solution will incur costs on Nebius AI cloud but resources are optimized to the workload itself. Which means that running distributed GPU training or inference allocates only the compute types, storage, and cluster infrastructure needed.
Here is a cost breakdown analysis. Note that pricing was obtained from Nebius website when this presentation was created.
- Pricing for 1x NVIDIA L40s GPU is $1.35/hr on EPYC Genoa base
- AMD platform costs $0.01 per vCPU hourly on GPU instances
- Non GPU EPYC platform is $0.012 per vCPU hourly
- RAM price is consistent $0.0032 per hour
- Network SSD costs $0.071 per GiB monthly
- Non replicated SSD is $0.053 per GiB monthly
- Shared Filesystem used on jail mounts are $0.08 per GiB monthly
(Converted from Spreadsheet)
| Resource | Size | Price | Total Monthly Cost | |
|---|---|---|---|---|
| Compute | Worker Nodes (gpu-l40s-d, 2gpu-64vcpu-384gb) |
5 nodes | $4.57/node/hour | $16,680 |
System Nodes (cpu-d3, 8vcpu-32gb) |
3 to 9 nodes | $0.20/node/hour | $438 to $1314 | |
Controller Nodes (cpu-d3, 4vcpu-16gb) |
2 nodes | $0.10/node/hour | $146 | |
Login Node (cpu-d3, 4vcpu-16gb) |
1 node | $0.10/node/hour | $73 | |
Accounting Node (cpu-d3, 8vcpu-32gb) |
1 node | $0.20/node/hour | $146 | |
| Boot Disks | Worker Nodes (558 GiB NETWORK_SSD_NON_REPLICATED) |
5 nodes x 558 GiB | $0.053/GiB/month | $148 |
System Nodes (128 GiB NETWORK_SSD) |
3 nodes to 9 nodes x 128 GiB | $0.071/GiB/month | $27 to $81 | |
Controller Nodes (128 GiB NETWORK_SSD) |
2 nodes x 128 GiB | $0.071/GiB/month | $18 | |
Login Node (256 GiB NETWORK_SSD) |
1 node x 256 GiB | $0.071/GiB/month | $18 | |
Accounting Node (128 GiB NETWORK_SSD) |
1 node x 128 GiB | $0.071/GiB/month | $9 | |
| Shared FS | slurm-jail filesystem |
256 GiB | $0.08/GiB/month | $20 |
data submount |
1024 GiB | $0.08/GiB/month | $82 | |
chkpt submount |
1024 GiB | $0.08/GiB/month | $82 | |
model submount |
1024 GiB | $0.08/GiB/month | $82 | |
home submount |
256 GiB | $0.08/GiB/month | $20 | |
accounting filesystem |
256 GiB | $0.08/GiB/month | $20 | |
controller_spool filesystem |
128 GiB | $0.08/GiB/month | $10 | |
| Object Storage | Daily jail backups |
7 x 1024 GiB | $0.0147/GiB/month | $105 |
| JSON Configs Version Controlled | < 1 GiB | $0.0147/GiB/month | $0 | |
| Networking | Public IP Address (Login Node) | 1 | Free | $0.00 |
| Total Monthly Cost | $18,124 to $19,054 |
Now that we know the project cost during development and piloting phase with selected end users will cost around $19K monthly, estimating how much inference we'll need to perform is crucial to realize ROI.
- On current solution with LoRA Finetuned Model, output speed is around
3.8 tokens/secon single batch inference and60 tok/secwith batch increased to 16. Base model does perform faster since its packaged as safetensors, peaking closer to200 tokens/secon current L40s setup. - Assumption is that each chat relies on 8K context window but 2K of output generation, typical of customer service application even in financial industry domain.
- This setup also assumes continous post-training (finetuning) is applied keeping 4 worker nodes running for improving the model based on fresh datasets.
| Scenario | Tokens/Sec | Tokens/Day | Conversations/Day | Est Cost/Chat |
|---|---|---|---|---|
| Single Batch Finetuned | 3.8 | 328,320 | 164 | $3.86 |
| Batch Size 16 Finetuned | 60 | 5,184,000 | 2,592 | $0.24 |
| Optimized Target | 200 | 17,280,000 | 8,640 | $0.07 |
Further discussions with business stakeholders would highlight the ROI potential of performing inference using a finetuned model including better alignment to industry specific terminology, reduced hallucinations, and long-term potential.
This project was created for Nebius Demo Day presentation only.



