Quantum ESPRESSO (QE) is "an integrated suite of Open-Source computer codes for electronic-structure calculations and materials modeling at the nanoscale. It is based on density-functional theory, plane waves, and pseudopotentials.". In general, it runs well on LUMI-C.
Installing Quantum ESPRESSO¶
We provide automatic installation scripts for several versions of QE. In general, the installation procedure is described on the EasyBuild page. The step by step procedure to install QE 7.1 is:
- Load the LUMI software environment:
module load LUMI/22.06.
- Select the LUMI-C partition:
module load partition/C.
- Load the EasyBuild module:
module load EasyBuild-user.
Then, you can run the install command
The installation takes about 3 minutes. Afterwards, you will have a module called "QuantumESPRESSO/7.1.0-cpeGNU-22.06" installed in your home directory. Load the module to use it
The usual QE binaries,
ph.x etc. will now be in your
QE via the Slurm scheduler, e.g.
srun pw.x. Please note
that you must do
module load LUMI/22.06 partition/C to see your Quantum
Espresso module in the module system. The same applies to the Slurm batch
scripts which you send to the compute nodes.
You can see other versions of QE that can be automatically installed by running the EasyBuild command
or checking the LUMI-EasyBuild-contrib repository on GitHub directly. It is generally recommended to use the latest version of Quantum Espresso compiled with the most current Cray programming environment. Older version often work, but may not run optimally as the system configuration (software, libraries, drivers, underlying hardware etc.) may have changed since they were originally built.
Example batch scripts¶
A typical batch job using 2 compute nodes and MPI only:
#!/bin/bash #SBATCH -J GaAs128 #SBATCH -N 2 #SBATCH --partition=small #SBATCH -t 00:30:00 #SBATCH --mem=200G #SBATCH --exclusive --no-requeue #SBATCH -A project_XYZ #SBATCH --ntasks-per-node=128 #SBATCH -c 1 export OMP_NUM_THREADS=1 module load LUMI/22.06 partition/C QuantumESPRESSO/7.1.0-cpeGNU-22.06 srun pw.x -nk 4 -i gab128.in > gab128.out
A typical batch job with MPI and 4 OpenMP threads per rank:
#!/bin/bash #SBATCH -J GaAs128 #SBATCH -N 2 #SBATCH --partition=small #SBATCH -t 00:30:00 #SBATCH --mem=200G #SBATCH --exclusive --no-requeue #SBATCH -A project_XYZ #SBATCH --ntasks-per-node=32 #SBATCH -c 4 export OMP_NUM_THREADS=4 export OMP_PLACES=cores export OMP_PROC_BIND=close module load LUMI/22.06 partition/C QuantumESPRESSO/7.1.0-cpeGNU-22.06 srun pw.x -nk 4 -i gab128.in > gab128.out
Making use of k-point parallelization (the flag
-nk) is very important in
Quantum Espresso. In the following test case, a GaAs supercell with
128-atoms, an SCF cycle completes in a about 1 hour on a LUMI-C
compute node using the default settings without k-point parallelization. By
-nk 2, the runtime is cut to 38 minutes (1.3x faster). K-point
parallelization is so important that it can be advantageous to reduce to the
number of cores used on the compute nodes, and/or increase the number of
k-points just to get even multiples which maximizes the value of
Consider the case with 14 k-points: 128 cores are not evenly divisible by 14,
which prevents the full use of k-point parallelization. But what about 14*8 =
112 cores? The graph below shows that you can actually gain speed by reducing
the number of active cores and carefully placing the MPI ranks on the right
The key to achieving the best performance is to explicitly bind the MPI ranks so that 7 cores are used on each core complex die (CCD). This means that the 8th core needs to be skipped on each CCD in the EPYC CPUs in LUMI-C. There is no easy way to do this in Slurm, other than hard-coding it like this:
$ srun --cpu-bind=map_cpu:0,1,2,3,4,5,6,8,9,10,11,12,13,14,16,17,18,19,20,21,22,24,25,26,27,28,29,30,32,33,34,35,36,37,38,40,41,42,43,44,45,46,48,49,50,51,52,53,54,56,57,58,59,60,61,62,64,65,66,67,68,69,70,72,73,74,75,76,77,78,80,81,82,83,84,85,86,88,89,90,91,92,93,94,96,97,98,99,100,101,102,104,105,106,107,108,109,110,112,113,114,115,116,117,118,120,121,122,123,124,125,126 pw.x ....
It is especially important to do this if you run with e.g. half the number of cores on compute node to free up memory.
OpenMP parallelization can be used primarily to save memory. It is
automatically used when
OMP_NUM_THREADS is set to something else than 1 in
the job script. OpenMP seems to work with decent efficiency on LUMI-C. Try
testing lower numbers of
OMP_NUM_THREADS first, about 2-4. Typically you will
get about the same speed as with MPI only for regular DFT calculations, maybe
somewhat faster (5-10%). The main benefit, however, is considerably less memory
usage, about 20% less for DFT. We have not tested the effect on exact exchange
or other higher order methods yet.
FFT task parallelization and pencil decomposition are both necessary when
running a large number of cores per pool of k-points. The
-ntg flag alone
seems to not work in recent versions of Quantum Espresso. You need to use it
together with pencil decomposition (
-pd true), which is an undocumented
feature. Typically, small values of
-ntg are sufficient to see a speed-up,
-ntg 2 -pd true, but you need to increase
-ntg as the number of
For sub-space diagonalization with SCALAPACK (the flag
-nd), keeping the
default (maximum) value is typically best. Reducing it will typically just
increase the runtime.
The RMM-DIIS diagonalization algorithm is significantly faster than the
Davidson algorithm on LUMI. This is the option
in the input file. For regular DFT, it can be 50% faster per SCF step, so it is
worth trying out. As usual, the electronic convergence may not be as a stable
as with the Davidson algorithm, so it depends on the use case whether there is
a net gain.