Move to a pixi-based installation#1244
Conversation
3c0bb01 to
84a617b
Compare
This comment was marked as resolved.
This comment was marked as resolved.
63ef5d0 to
f390904
Compare
requirements*.txt.pixi-based installation
|
@dxqb As promised, the +/- are in the description of the PR. |
5685e79 to
5fc122f
Compare
This comment was marked as resolved.
This comment was marked as resolved.
7211167 to
5ff0085
Compare
5ff0085 to
1b993b1
Compare
|
prefix-dev/pixi#1259 was resolved via prefix-dev/pixi#5241 and included in https://github.com/prefix-dev/pixi/releases/tag/v0.63.0 |
0a4fd4b to
fc392c2
Compare
18735ee to
00f6d52
Compare
ae69116 to
0ad5150
Compare
0ad5150 to
6fc0455
Compare
O-J1
left a comment
There was a problem hiding this comment.
This PR adds 440 LoC. Ignore the lock file, its autogenerated and is 7.2k. We dont manage or touch the lockfile.
Install, update, starting the UI, training (UI & CLI) work on both Linux, Windows (and Mac where appropriate) work.
I didnt have to worry about wtf python I had installed, I didnt have to worry about install location or permissions, I didnt have to worry about already installing pixi, we can automagically install signed binaries as of version v0.65.0.
Overall LGTM and worked. We still need some other users to test just to be absolutely sure, given its scope.
511012a to
81ebce1
Compare
81ebce1 to
c040fb9
Compare
c2115d3 to
9406403
Compare
9406403 to
fbc6991
Compare
dxqb
left a comment
There was a problem hiding this comment.
thanks for your patience
I've left many comments on the code. Here is only what didn't fit anywhere else.
I've tested this PR on a RunPod host. When I run install.sh, pixi isn't found:
This script will automatically download and install Pixi (latest) for you.
Getting it from this url: https://github.com/prefix-dev/pixi/releases/latest/download/pixi-x86_64-unknown-linux-musl.tar.gz
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
100 28.6M 100 28.6M 0 0 25.4M 0 0:00:01 0:00:01 --:--:-- 25.4M
The 'pixi' binary is installed into '/root/.pixi/bin'
Updating '/root/.bashrc'
Please restart or source your shell.
[OneTrainer] + pixi install --locked -e cuda
./lib.include.sh: line 132: pixi: command not found
adding /root/.pixi/bin to PATH helps as a workaround
Install is fast then.
When I install OneTrainer through the cloud tab, all output of pixi is suppressed it seems. This is the only output:
[OneTrainer] + pixi self-update
✔ pixi is already up-to-date (version 0.66.0)
[OneTrainer] + pixi install --locked -e cuda
✔ The cuda environment has been installed.
This could be a problem if you have a slow downstream remote host - you wait and nothing happens or is showing progress.
Overall, this speeds up the installation process by a lot, so it is useful. However, it throws away completely a tried-and-proven installation process. Therefore I still think this should be an option.
It can be an option that we make the default very soon (or even from the beginning?). But at the very least, when a user complains on Discord that the new installation process fails for them, I want us to be able to answer: "run it with OT_PREFER_PIP" while we fix it.
It also removes the possibility to use conda. Are we sure this isn't needed anymore?
I haven't been using it so I can't answer, but I remember people asking if conda is supported etc.
| ) | ||
| def _start_tensorboard(self): | ||
| tensorboard_executable = os.path.join(os.path.dirname(sys.executable), "tensorboard") | ||
| tensorboard_executable = shutil.which("tensorboard") |
There was a problem hiding this comment.
why is this necessary?
shutil.which() returns the path to the executable in PATH
but if tensorboard is in PATH anyway, then why use a path at all and not just execute "tensorboard"?
| # docker tag <image-name> <dockerhub-username>/<repository-name>:<tag> | ||
| # docker push <dockerhub-username>/<repository-name>:<tag> | ||
|
|
||
| FROM runpod/pytorch:2.8.0-py3.11-cuda12.8.1-cudnn-devel-ubuntu22.04 |
There was a problem hiding this comment.
the RunPod and vast.ai both use a base image file that is in wide use on these platforms.
By doing that, these are already in the local docker image caches and only the added layers (containing OneTrainer) have to be downloaded.
By switching to a generic dockerfile based on ubuntu, you loose that benefit.
Not using their image files probably breaks many other things, e.g. access to Jypiter, updating SSH keys through their web API, etc.
|
|
||
| #region Platform Detection | ||
|
|
||
| function Get-Platform { |
There was a problem hiding this comment.
is this platform detection new or different, or was it done the same way before - just rewritten in PS?
| @@ -0,0 +1,29 @@ | |||
| #To build, run | |||
There was a problem hiding this comment.
please see my comment about the NVIDIA Dockerfile first.
does this Dockerfile have a purpose beyond RunPod and Vast? Is it also meant to be run locally?
There was a problem hiding this comment.
Note, for any modifications you make on the Dockerfiles:
the purpose is that OneTrainer can already be installed, in case you use a persistent disk volume on RunPod. It uses /workspace/OneTrainer if already installed, and if not it links to the OneTrainer in the image, which is at /OneTrainer
|
|
||
| - `OT_PREFER_VENV`: If set to `true`, Conda will be ignored even if it exists on the system, and Python Venv will be used instead. This ensures that people who use `pyenv` (to choose which Python version to run on the host) can easily set up their desired Python Venv environments. Defaults to `false`. | ||
|
|
||
| - `OT_LAZY_UPDATES`: If set to `true`, OneTrainer's self-update process will only update the Python environment's dependencies if the OneTrainer source code has been modified since the previous dependency update. This speeds up executions of `update.sh`, and is generally safe, but may miss some updates and important bugfixes for external third-party dependencies. If you use this option, you must set it permanently for *every* script (not just `update.sh`). Defaults to `false`. |
There was a problem hiding this comment.
if LAZY_UPDATES is not supported and not necessary anymore, it should be removed in the cloud training code, where it is used.
However, please see me comments regarding keeping pip for now first
| curl -fsSL https://pixi.sh/install.sh | sh | ||
| elif can_exec wget; then | ||
| print_debug '`wget` found, installing `pixi` with `wget`.' | ||
| wget -qO- https://pixi.sh/install.sh | sh |
There was a problem hiding this comment.
while there are many attack vectors in a python app with many dependencies, I don't think we have trusted a host so far completely without any checks.
What prevents pixi.sh to serve malicious code randomly every 10000 requests?
| "pytorch-lightning==2.6.1", | ||
|
|
||
| # diffusion models | ||
| "diffusers @ git+https://github.com/huggingface/diffusers.git@99daaa802da01ef4cff5141f4f3c0329a57fb591", |
There was a problem hiding this comment.
could you please check that this doesn't cause this issue again:
#1133
| 3. Navigate into the cloned directory `cd OneTrainer` | ||
| 4. Perform the installation: `pixi install --locked -e cuda` (Replace `cuda` by `rocm` or `cpu` if needed). | ||
|
|
||
| **Note:** We don't support ROCm on Windows currently. |
There was a problem hiding this comment.
is that new?
weren't there several users on Discord who successfully used ROCm? were these all on linux?
| 1. Install `pixi`: [Guide](https://pixi.prefix.dev/latest/installation/) | ||
| 2. Clone the repository `git clone https://github.com/Nerogar/OneTrainer.git` | ||
| 3. Navigate into the cloned directory `cd OneTrainer` | ||
| 4. Perform the installation: `pixi install --locked -e cuda` (Replace `cuda` by `rocm` or `cpu` if needed). |
There was a problem hiding this comment.
manual installation with pip should remain possible.
Currently it fails:
INFO: pip is looking at multiple versions of onetrainer to determine which version is compatible with other requirements. This could take a while.
ERROR: Cannot install -r requirements-global.txt (line 1) and diffusers 0.37.0.dev0 (from git+https://github.com/huggingface/diffusers.git@99daaa8#egg=diffusers) because these package versions have conflicting dependencies.
The conflict is caused by:
The user requested diffusers 0.37.0.dev0 (from git+https://github.com/huggingface/diffusers.git@99daaa8#egg=diffusers)
onetrainer 0.1.0 depends on diffusers 0.37.0.dev0 (from git+https://github.com/huggingface/diffusers.git@99daaa802da01ef4cff5141f4f3c0329a57fb591)
To fix this you could try to:
1. loosen the range of package versions you've specified
2. remove package versions to allow pip attempt to solve the dependency conflict
ERROR: ResolutionImpossible: for help visit https://pip.pypa.io/en/latest/topics/dependency-resolution/#dealing-with-dependency-conflicts
Disdvantages:
source venv/bin/activateis replaced byeval "$(pixi -e [cpu|cuda|rocm] shell-hook)"to activate inside an existing shell orpixi -e [cpu|cuda|rocm] shellfor a new, pre-activated shell. Or by prefixing each command withpixi run -e [cuda|rocm|cpu].pip listis replaced bypixi list.Advantages:
pixi, you can runpixi run -e [cuda|rocm|cpu] <command>, it will take care of installation running the command from beginning to end.pixi.lock. Works wonders for reproducibility. To update, just usepixi update.libglandtkfor linux: No more need forsudoanymore, only hard dep is eitherwgetorcurl, and GPU driver (no CUDA toolkit necessary either).Decisions to make:
requirements*.txtToDos for this PR:
pixiand then dopixi install -e ...orpixi run -e ...(or we can get rid of the install script entirely).After merging: