GGCE-PETSc interface: advanced features¶
In this tutorial we go over two advanced features enabled by the GGCE-PETSc interface: the double-parallel scheme and the basis-matrix separation, which together unlock large-scale phonon configuration calculations.
Double-parallel scheme¶
In Sec. Running GGCE in parallel, we alluded to the double-parallel scheme of running GGCE.
To recap: in the double-parallel scheme, the calculation is MPI-parallelized BOTH across different
\((k,\omega)\) points, AND within the solution of the matrix equation at each individual
point. To enable and control this feature when running GGCE-PETSc, we use the brigade_size
keyword when instantiating the Solver.
COMM = MPI.COMM_WORLD
solver = MassSolverMUMPS(system=system, mpi_comm=COMM, brigade_size=1)
This is the only syntactic change required to enable this scheme.
By a “brigade”, we mean a group of MPI processes
that are collectively working on a single \((k,\omega)\) point. The keyword
brigade_size
is used in the code to split all MPI processes into brigades of
equal size brigade_size
. Each brigade works collectively on the matrix for a single
\((k,\omega)\) point, and together the brigades cover the entire \((k,\omega)\)
space.
By default, brigade_size=WORLD_SIZE
, meaning that there is only one brigade,
and the Solver proceeds sequentially through all \((k,\omega)\) points whilst
maximally distributing the matrix of equations across all available MPI processes.
The main limitation for this keyword is that the number of MPI processes should be
evenly divisible by brigade_size
, otherwise an Error is raised and the calculation
is terminated.
Note
Due to the specifics of the PETSc
API, the number of “jobs” fed into Solver.greens_function(k, w, eta)
–
that is, len(k)*len(w)
– is required to be evenly divisible by the number of brigades
(otherwise the PETSc solvers hang indefinitely). The Solver automatically extends (pads)
the \(k, \omega\) arrays with the minimum needed number of extra points to
prevent this. The padding is done to the \(omega\) array by default, unless it
is of length 1: in that case, the \(k\) array is padded. The padding is always
done outside the range specified by the user, so the input \(k,\omega\) arrays
are NOT redefined but merely extended, so that the user still has perfect
control over where the calculations are being carried out.
All calculations, including over the padded points, are returned by
greens_function
. This is why sometimes the output array of greens_function()
may have slightly larger dimensions than the original input array. This is of
course easily fixed by merely indexing the output array up to the lengths of the
original \(k,\omega\) arrays
result = solver.greens_function(k, w, eta)
result_notpadded = result[:len(k),:len(w)]
The optimal choice of brigade_size
is ultimately up to the user. For small
systems where the matrix can be easily stored and solved by a single CPU, the
optimal speed-up is accomplished with maximum splitting, i.e. brigade_size=1
.
When dealing with larger matrices (say of dimension more than 100,000), optimal
speed-up will have to be determined by experimenting. In practice, brigade_size
should be just enough for the matrix solution to be possible within MUMPS in terms of
memory and time, so that the largest number of independent brigades can be formed.
For the calculation in the previous section, we see that the sizes of matrices
being solved are quite small (merely hundreds of equations). This allows us to set
brigade_size=1
and enjoy linear speed-up in the calculation.
Basis-matrix separation¶
Without a double-parallel scheme, GGCE is limited to phonon configurations where the matrix produced can be reasonably quickly solved by SciPy’s sparse direct solver (<100,000 or so). Using the double-parallel scheme removes this restriction – if a suitable number of MPI processes (CPUs) is available, the matrix can be appropriately segmented, stored and solved in parallel, allowing much larger matrices to be handled.
However, one memory bottleneck still remains in GGCE: the basis. The basis, as
computed and stored inside the System
object, is at core a Python dict
that encodes the structure of equations induced by a chosen phonon cloud variational
configuration. It does this agnostic of particular values of \(k, \omega, \eta\),
which is a great advantage as it allows to only compute the basis once and then easily
generate all matrices for a range of \(k,\omega\) values.
And yet the basis itself constitutes a problem for calculations with large configurations.
Being a dict
, the basis is inherently not parallelizable: it cannot be shared across
MPI processes. Instead, each process has an individual copy of the entire basis,
resulting in incredible redundancy and memory usage.
One solution to this problem is to change the basis generation and storage approach: this is one of the goals of the next release. In the meantime, another solution is to separate the computation of the basis from the actual calculation of the Green’s function, by pre-computing and storing on disk the matrices to be solved by the sparse direct solver.
This is accomplished within the MassSolverMUMPS
in two stages. First,
we must define the directory for saving both the matrices to be solved and the
Model
and System
objects
matr_dir = p["root_matr_dir"]
root_dir = p["root_sys"]
model = Model.from_parameters(...)
model.add_(...)
Next, we create the first MassSolverMUMPS
instance, the purpose of which is
to create the matrices and save them to disk.
sysgen_petsc = MassSolverMUMPS(
system=System(model),
root=root_dir,
mpi_comm=COMM,
matr_dir=matr_dir,
brigade_size=1,
)
Setting the matr_dir
keyword determines the location where the matrices will
be saved. Notice that the creation of matrices is in principle parallelizable:
while the basis is not shareable, in the sense that each MPI process will still
have a redundant copy of it, groups of MPI processes can still work on
generating and writing different matrices to disk. In especially dire cases,
to save memory, a smaller number of MPI ranks should be used while keeping
allocated memory constant.
To prepare the matrices and save them to disk, we use the following command
sysgen_petsc.prepare_greens_function(k, w, eta)
del sysgen_petsc # to free up memory used to store the basis
The matrices are dumped to disk using pickle
in sparse matrix format,
meaning that we save three arrays – row indices, column indices, and values –
for nonzero entries only. They are saved with a filename formatted as
matr_at_k_{kval:.10f}_w_{wval:.10f}_e_{etaval:.10f}.pkl
.
Once the matrices are written to disk, the solver object, which contains the basis, is deleted to free up memory.
Next, we create the Solver that will actually do the solving. Notice that we need
to create a new System object with autoprime=False
keyword AND to pass this
keyword to MassSolverMUMPS
to make sure that the basis is not generated when
we pass the System to the Solver.
system_unprimed = System(model, autoprime=False)
executor_petsc = MassSolverMUMPS(
system=system_unprimed,
root=root_dir,
mpi_comm=COMM,
matr_dir=matr_dir,
autoprime=False,
brigade_size=1,
)
results_petsc = executor_petsc.greens_function(k, w, eta)
This second Solver then looks in the matr_dir
directory and loads the matrices
from there, constructing filenames from the k,w,eta
it was given.
We showed how to run basis-matrix separate calculations within a single Python script. Alternatively, this can be run as two separate, sequential calculations (i.e. with different batch job scripts), using different memory and CPU counts to optimize resource usage.