Skip to content

Add Crusoe Cloud backend#3602

Open
peterschmidt85 wants to merge 2 commits intomasterfrom
add-crusoe-backend
Open

Add Crusoe Cloud backend#3602
peterschmidt85 wants to merge 2 commits intomasterfrom
add-crusoe-backend

Conversation

@peterschmidt85
Copy link
Contributor

Summary

  • Add a VM-based Crusoe Cloud backend supporting single-node and multi-node (cluster) provisioning with InfiniBand
  • Uses gpuhunt PR Add Crusoe Cloud provider gpuhunt#211 as online provider for offers, with project quota filtering
  • Storage: persistent data disk for types without ephemeral NVMe (L40S, A40, c1a); auto-detects and RAID-0s NVMe for types with ephemeral storage (A100, H100, MI300X); moves containerd storage so containers get full disk space
  • Cluster support via IB partitions (tested with A100-SXM-IB)

Tested end-to-end

  • L40S: fleet, dev env, GPU, configurable disk (200GB), clean termination
  • A100-PCIe: fleet, dev env, GPU, NVMe auto-mount (880GB), clean termination
  • A100-SXM-IB cluster: IB partition created, 1 node provisioned with IB and 8x NVMe RAID-0 (7TB); 2nd node failed on capacity (out_of_stock)
  • Offers: quota enforcement, disk sizes correct per instance type

Not tested (no capacity/quota)

  • H100-SXM-IB, MI300X-IB, MI355X-RoCE (no hardware available)
  • CPU-only instances c1a/s1a (no quota)
  • Spot provisioning (disabled in gpuhunt)
  • Full 2-node cluster with IB connectivity test

TODOs

  • Spot: disabled until Crusoe confirms how to request spot billing via the VM create API
  • gpuhunt: currently from PR branch; switch to pinned version after gpuhunt PR Fix urls in logs output #211 is merged

Test plan

  • Lint and format pass
  • Unit tests pass (config parsing, backend type listing)
  • E2E: L40S single-node fleet + dev env + GPU + disk
  • E2E: A100-PCIe single-node fleet + dev env + GPU + NVMe
  • E2E: A100-SXM-IB cluster (IB partition + 1 node provisioned)
  • Clean termination verified (no leaked VMs/disks/IB partitions)
  • Full 2-node cluster IB connectivity (blocked on capacity)
  • Spot provisioning (blocked on API clarification)
  • CPU-only instances (blocked on quota)

AI Assistance: This PR was developed with AI assistance.

Made with Cursor

peterschmidt85 and others added 2 commits February 24, 2026 01:57
Add a VM-based Crusoe Cloud backend supporting single-node and
multi-node (cluster) provisioning with InfiniBand.

Key features:
- gpuhunt online provider for offers with project quota filtering
- HMAC-SHA256 authenticated REST API client
- Image selection based on GPU type (SXM/PCIe/ROCm/CPU)
- Storage: persistent data disk for types without ephemeral NVMe;
  auto-detects and RAID-0s NVMe for types with ephemeral storage;
  moves containerd storage so containers get the full disk space
- Cluster support via IB partitions
- Two-phase termination with data disk cleanup

Tested end-to-end:
- L40S: fleet, dev env, GPU, configurable disk (200GB), clean termination
- A100-PCIe: fleet, dev env, GPU, NVMe auto-mount (880GB), clean termination
- A100-SXM-IB cluster: IB partition created, 1 node provisioned with IB
  and 8x NVMe RAID-0 (7TB); 2nd node failed on capacity (out_of_stock)
- Offers: quota enforcement, disk sizes correct per instance type

Not tested (no capacity/quota):
- H100-SXM-IB, MI300X-IB, MI355X-RoCE (no hardware available)
- CPU-only instances c1a/s1a (no quota)
- Spot provisioning (disabled in gpuhunt, see TODO)
- Full 2-node cluster with IB connectivity test

TODOs:
- Spot: disabled until Crusoe confirms how to request spot billing
  via the VM create API endpoint
- gpuhunt dependency: currently installed from PR branch; switch to
  pinned version after gpuhunt PR #211 is merged and released

AI Assistance: This implementation was developed with AI assistance.

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant