LIP6 GPU Cluster User Guide
This guide aims to provide new members of the LIP6 laboratory with a concise overview of how to connect to and utilize the lab’s computing resources (CPU cluster and GPU cluster). Whether you are accessing from the internal network or an external network, this document covers the necessary configuration steps.
References
- LIP6 Official Documentation (Intranet access only)
- LIP6 Cluster Documentation
- Guide by Dr. Giasemis
1. Credentials Preparation
Before starting, ensure you have the following credentials:
- LIP6 Username
- Secure Password: This is the password for logging into the lab’s Linux systems and computing resources, which is different from your email password. If you need a reset, please contact the system administrator (e.g., Manuel Bouyer).
2. Connecting to the Cluster
The connection method varies slightly depending on your network environment.
2.1 From Internal Network
If you are already connected to the LIP6 wired network or Wi-Fi, you can connect via SSH directly.
Connect to CPU Cluster
Used for data preprocessing, scripting, and lightweight calculations.
1 | ssh your_username@cluster.lip6.fr |
Connect to GPU Cluster Front-end
Used for submitting GPU jobs and managing environments. Note: The front-end node is NOT for running computational tasks.
1 | ssh your_username@front.convergence.lip6.fr |
2.2 From External Network
Outside the lab, you need to connect via a Gateway (Jump Host). Different research groups might use different gateways:
- SYEL / CIAN Groups: Typically use
barder.lip6.frorducas.lip6.fr. - Other Groups: Please consult your supervisor or admin. The general gateway is usually
ssh.lip6.fr.
The following examples assume barder.lip6.fr.
Method A: SSH Jump (Recommended)
Directly use the -J flag to specify the gateway:
1 | # Connect to CPU Cluster |
You will need to enter your password twice (once for the gateway, once for the target).
Method B: SSH Config (Efficient)
To avoid typing long commands every time, it is recommended to configure your ~/.ssh/config file. Edit (or create) this file on your local machine:
1 | # Define Gateway |
Once configured, simply run ssh lip6-cluster or ssh lip6-gpu to connect.
2.3 Passwordless Login (Optional)
To skip typing passwords frequently, set up SSH key-based authentication.
- Generate Key Pair (Skip if you already have one):
1
ssh-keygen -t ed25519 -C "your_email@example.com"
- Copy Public Key to Gateway and Targets:
1
2
3ssh-copy-id -i ~/.ssh/id_ed25519.pub lip6-gateway
ssh-copy-id -i ~/.ssh/id_ed25519.pub lip6-cluster
ssh-copy-id -i ~/.ssh/id_ed25519.pub lip6-gpu
After these steps, you should be able to log in without a password.
3. Using the GPU Cluster (Convergence)
After logging into front.convergence.lip6.fr, you are on the frontend node of the cluster. Never run heavy computational tasks on the frontend node. You must use the Slurm Workload Manager to request compute nodes.
3.1 Common Slurm Commands
sinfo: View partition and node states.squeue: View currently running and queued jobs.squeue -u your_username: View only your jobs.scancel [job_id]: Cancel a job.
3.2 Requesting GPU Resources
Interactive Mode
Suitable for code debugging and environment exploration.
Option 1: Request a Shell (srun)
Request 1 node, 1 GPU, for 30 minutes:
1 | srun --partition=convergence --nodes=1 --gpus=1 --time=00:30:00 --pty bash -i |
Option 2: Two-step Allocation (salloc)
Reserve resources first, then connect.
1 | # 1. Allocate resources |
After successful entry, the command prompt will change to your_username@nodeXX, indicating you now have access to that node’s GPU.
Batch Mode
Suitable for long-running training tasks. Write a script (e.g., job.sh):
1 |
|
Submit the job:
1 | sbatch job.sh |
3.3 Hardware Overview
The LIP6 Convergence cluster consists of 1 frontend and 10 compute nodes:
| Node Name | CPU | Memory | GPU Configuration |
|---|---|---|---|
| node01 | 2× AMD EPYC 7543 | 2 TB | 4× NVIDIA A100 80GB (SXM) |
| node02-06 | 2× Intel Xeon Gold 6330 | 2 TB | 4× NVIDIA A100 80GB (PCIe) |
| node07-10 | 2× Intel Xeon Gold 6330 | 1 TB | MIG Mode (For smaller tasks) |
Tip:
node01is the most powerful and suitable for large-scale training;node07-10have MIG (Multi-Instance GPU) enabled, splitting each A100 into two 40GB instances. Specify the type when requesting resources.
Hope this guide helps you get started quickly. Happy coding!