RDFMG Guide

Cluster Information

The RDFMG cluster is a computing cluster comprising 66 compute nodes, centralized around a single head node. It serves as a powerful computational resource for the Nuclear Engineering group at NCSU. The Head node is powered by an AMD Opteron 6320 Processor, featuring 16 cores. Out of the 66 compute nodes:

  • 14 nodes are equipped with AMD Opteron 6376 Processors, each offering 64 cores.
  • 40 nodes are equipped with AMD EPYC 7452 Processors, each offering 64 cores and 2 threads per core.
  • 12 nodes are equipped with AMD EPYC 7513 Processors, each offering 64 cores, and 2 threads per core.

This configuration underscores the cluster’s capability to handle extensive computations, essential for advanced nuclear engineering simulations and analyses.

 

Server administrators:

Pascal Rouxelin:     pnrouxel@ncsu.edu
Mario Milev:           mlmilev@ncsu.edu

 

1. How to establish connections to the cluster?

First-time users (Windows):

1- You need an SSH client such as PuTTY or mobaXterm (open source software). The latter is recommended.

https://mobaxterm.mobatek.net/download.html

2- You need to install an SFTP client, typically WinSCP (open source). mobaXterm can also be used.

https://winscp.net/eng/download.php

3- In MobaXterm, click on the top left tab “Session”. A window pops up, click on “SSH”. A window will prompt you to enter the host name (rdfmg.ne.ncsu.edu) and your username (<NCSU_ID>). For example, if your name is Christopher Waddle with an NCSU ID “cwaddl”:

Host name: rdfmg.ne.ncsu.edu
Username: cwaddl

The domain name depends on the server you would like to connect to.

Your temporary password is your NCSU ID, (in the example, cwaddl).

4- Change your password as soon as you sign in: type in the terminal the command passwd.

5- (optional) If you are off-campus or decide to use the ncsu_guest wifi, you will need a VPN to access the cluster. Connect to the VPN before typing in your credentials in MobaXterm (or Putty) and WinSCP.

https://oit.ncsu.edu/campus-it/campus-data-network/vpn/

First-time users (Mac):

1- Open a terminal and type:

ssh <NCSU_ID>@rdfmg.ne.ncsu.edu

For example, if your name is Chris Waddle and your user ID is cwaddl:

ssh cwaddl@rdfmg.ne.ncsu.edu

2- For file transfer, you can use a software such as FileZilla, or operate file transfer from your terminal.

sftp username@rdfmg.ne.ncsu.edu

Here are useful commands to perform the transfers:

List items that you have on the server: ls

List items on your local machine: lls

Find your location on the server: pwd

Find your location on your local machine: lpwd

Download a file from server to your local machine: get <filename_on_server>

Upload a file from local machine to server: put <filename_on_local_machine>

Change directory locally: lcd

Change directory on remotely: cd

3- Change your password as soon as you sign in: type in the terminal the command passwd.

4- (optional) If you are off-campus or decide to use the ncsu_guest wifi, you will need a VPN to access the cluster.

https://oit.ncsu.edu/campus-it/campus-data-network/vpn/

Current users:

Contact the server administrators to retrieve your credentials.

 

2. What is the structure of RDFMG cluster?

The RDFMG cluster consists of one head node called rdfmg, and 66 compute nodes called node001-070. After logging in, you will access your Home directory (/home) on the rdfmg node. Simulations are assigned from the rdfmg node to the compute nodes via Slurm scripts (see section 5). The workload manager currently used on the server is Slurm. Slurm is a job scheduling system widely used on Linux servers, it fulfills three essential functions:

  • Allocates resources to the 66 compute nodes based on user’s requests. Resources include number of nodes, number of CPUs and simulation time (etc.).
  • Handles the starting and execution of a “job” on the nodes assigned.
  • Distributes available resources between users based on a queue system.

Basic Linux commands and script execution such as C, Python, Java (etc.) can be executed on the head node. Do not execute computationally intensive simulations on the head node.

Two of the 66 compute nodes (node014 and node026) are dedicated to “interactive sessions” (see section 5). Interactive sessions are used for code debugging or quick on-the-fly simulations. It lets the user log in onto node014 or node026 to run simulation codes without Slurm scripts to facilitate the debugging process and avoid the queue system applied by the workload on other nodes.

 

3. What are environment modules? How to load modules?

Users initialize their environment when they log in by setting environment information for every application they will reference during the session. The Environment Modules package is a tool that simplify shell initialization and lets users easily modify their environment during the session with module files. Modules can be loaded or unloaded dynamically.

Here is an example of loading a module on RDFMG cluster:

module load GCC/10.3.0

To unload a module

module unload GCC/10.3.0

List the modules currently loaded:

module list 

Here are some module commands you may find useful:

module avail List the available modules. Note that if there are multiple versions of a single package that one will be denoted as (default). If you load the module without a version number you will get this default version.
module whatis List all the available modules along with a short description.
module load MODULE Load the named module.
module unload MODULE Unload the named module, reverting back to the OS defaults.
module list List all the currently loaded modules.
module help Get general help information about modules.
module help MODULE Get help information about the named module.
module show MODULE Show details about the module, including the changes that loading the module will make to your environment.

 

4. Where are all the codes that we are ready to play with?

All the computer codes are installed in two directories:

/cm/shared/apps/ncsu

/cm/shared/codes

Restricted access is enforced. Some codes, such as RAVEN, OpenFoam or DAKOTA, are open source: contact the server administrators to be granted access. For most of the codes however, Export Control regulations restrict accessibility. License agreements are required to obtain executable or source privileges. The license agreement process is established on a code-dependent basis.

 

5. How to submit, monitor and delete my jobs?

The RDFMG cluster uses the SLURM Workload Manager to manage and schedule jobs (see section 2). Submitting, monitoring and cancelling jobs is handled by three terminal commands:

sbatch <SlurmScript> Submits a job.

Example: sbatch  run.sh

squeue Displays the status of all jobs of all users. To display the jobs of one user:

squeue  -u cwaddl

scancel <JobID> Terminates a job prior to its completion. <JobID> is the job ID number displayed by squeue.

Example: scancel 215896

SLURM includes numerous directives, which are used to specify resource requirements. SLURM directives appear as header lines (lines that start with #SBATCH) in a script (i.e. a text file).

5.1 Job scripts

To run a job in batch mode on a high-performance computing system using SLURM, the resource requirements are written in SLURM script.

Example of a SLURM job script (saved for example in a text file called run.sh):

 #!/bin/bash

 #SBATCH -J “jobname”           # Name of your job, optional

 # SBATCH -N 1                          # Number of nodes requested

 #SBATCH -n 32                         # Number of CPU. Max: N × 64

 #SBATCH -t 24:00:00                # Maximum simulation time

#PBS -p defq                             # queue Name. Must be gradq or all

 /cm/shared/codes/dummyCode/exec   myInput.inp


5.2 Submitting jobs

To submit your job, use the SLURM sbatch command. If the command runs successfully, it will return a job ID to standard output, for example:

sbatch run.sh

5.3 Interactive sessions

To request an interactive session, load the workbench module and call the command isqp:

module load codes/workbench

isqp -w node026

OR to log on node014:

isqp -w node014

The user will automatically be logged onto node014 or node026  with 4 CPUs. The execution of a code will not require the sbatch command or a Slurm script. Executing the code presented in section 5.1 would simply be:

/cm/shared/codes/dummyCode/exec  myInput.inp