The ICMP/LQM BeoWulf Cluster
- Compute nodes
- Batch job queue
on the EPFL VPN or the EPFL local network).
1. Compute nodes
- The nodes are loosely connected through a LAN Gigabyte network.
- The nodes do not have hard drives thus « live » entirely on the RAM.
- The nodes have access to permanent storage through a NFS server (see file-systems below)
- Each node has a 4-core Intel CPU i5-3350P, thus 384 cores total.
- The nodes qwf001 and qwf0[05-14] have 4GB of memory.
- The nodes qwf0[02-04] and qwf0[15-96] have 8GB of memory.
- The network must be used as little as possible.
- I/O operations should not be done concurently by a large number of processes over the network.
- Files written in the node local memory (in RAM) must keep small in order to not saturate the compute nodes memory.
- Large files may be written through the network on the NFS shares.
- /home: This is a permanent storage file-system hosted on lqmmaster. The nodes have access to it as a NFS share. This is meant for small files and binaries (typically your scripts, source code and programs. There is no backup.
- /data: This is a large permanent storage file-system hosted on lqmmaster. The nodes have access to it as a NFS share.This is meant to store the data-files resulting from the calculations. There is no backup.
- /scratch: This is a local file-system present both on the nodes and lqmmaster. On the nodes, this storage resides on the RAM and thus is erased when a node halts or reboot. The content of the lqmmaster:/scratch may be pushed on the nodes qwf#:/scratch using « qwf-sync » while the nodes qwf:/scratch content may be retrieved on permanent storage on lqmmaster using « qwf-sync-back » (see Utilities below). This is a very fast file-system where you want to put your input files and /or write the (small) output files.
3. Batch job queues
- qwfall: Contains all the nodes and has a maximum runtime of 24 hours.
- qwfall-long: Contains all the nodes and has a maximum runtime of 7 days.
- qwfhm: Contains the 85 nodes which have 8GB of memory, qwf0[02-04] and qwf0[15-96]. Maximum runtime of 24 hours.
- qwfhm-long: Contains the 85 nodes which have 8GB of memory, qwf0[02-04] and qwf0[15-96]. Maximum runtime of 7 days.
- qwflm: Contains the 11 nodes which have 4GB of memory, qwf001 and qwf0[05-14]. Maximum runtime of 24 hours.
- qwflm-long: Contains the 11 nodes which have 4GB of memory, qwf001 and qwf0[05-14]. Maximum runtime of 7 days.
- qwf-clear: Clears the local scratch on the compute nodes.
- qwf-count: Returns the number of available compute nodes.
- qwf-foreach: Execute a command sequencially on each node.
- qwf-hosts: Returns a list of the available compute nodes.
- qwf-list: Constructs a list of compute node-names such as « qwf001.qwf,qwf002.qwf,……,qwf016.qwf ».
- qwf-passwd: Updates your password on all the nodes. Please do not change your password using the standard unix « passwd » as it would only change it on lqmmaster.
- qwf-sync: Push the content of your lqmmaster:/scratch/$USER folder onto the compute nodes qwf#:/scratch/$USER folders.
- qwf-sync-back: Pulls back the contents of the compute nodes qwf#:/scratch/$USER into your current directory.
folder with the content. This is unidirectional. Files present on lqmmaster:/scratch/$USER
that are not present or different on the nodes are copied on the nodes. File on the nodes
absent or different on lqmmaster are not pulled back on it (see qwf-sync-back).
This is equivalent of doing:
rsync -a /scratch/$USER $USER@qwfhost:/scratch$USER
for each qwf hosts « qwfhost ».
Usage: qwf-sync [OPTIONS]
-h Print this help message.
-H HOSTSLIST Push only on the nodes from HOSTSLIST, a comma-separated list of nodes