GGCE-PETSc interface: advanced features

In this tutorial we go over two advanced features enabled by the GGCE-PETSc interface: the double-parallel scheme and the basis-matrix separation, which together unlock large-scale phonon configuration calculations.

Double-parallel scheme

In Sec. Running GGCE in parallel, we alluded to the double-parallel scheme of running GGCE. To recap: in the double-parallel scheme, the calculation is MPI-parallelized BOTH across different \((k,\omega)\) points, AND within the solution of the matrix equation at each individual point. To enable and control this feature when running GGCE-PETSc, we use the brigade_size keyword when instantiating the Solver.

COMM = MPI.COMM_WORLD
solver = MassSolverMUMPS(system=system, mpi_comm=COMM, brigade_size=1)

This is the only syntactic change required to enable this scheme.

By a “brigade”, we mean a group of MPI processes that are collectively working on a single \((k,\omega)\) point. The keyword brigade_size is used in the code to split all MPI processes into brigades of equal size brigade_size. Each brigade works collectively on the matrix for a single \((k,\omega)\) point, and together the brigades cover the entire \((k,\omega)\) space.

By default, brigade_size=WORLD_SIZE, meaning that there is only one brigade, and the Solver proceeds sequentially through all \((k,\omega)\) points whilst maximally distributing the matrix of equations across all available MPI processes. The main limitation for this keyword is that the number of MPI processes should be evenly divisible by brigade_size, otherwise an Error is raised and the calculation is terminated.

Note

Due to the specifics of the PETSc API, the number of “jobs” fed into Solver.greens_function(k, w, eta) – that is, len(k)*len(w) – is required to be evenly divisible by the number of brigades (otherwise the PETSc solvers hang indefinitely). The Solver automatically extends (pads) the \(k, \omega\) arrays with the minimum needed number of extra points to prevent this. The padding is done to the \(omega\) array by default, unless it is of length 1: in that case, the \(k\) array is padded. The padding is always done outside the range specified by the user, so the input \(k,\omega\) arrays are NOT redefined but merely extended, so that the user still has perfect control over where the calculations are being carried out.

All calculations, including over the padded points, are returned by greens_function. This is why sometimes the output array of greens_function() may have slightly larger dimensions than the original input array. This is of course easily fixed by merely indexing the output array up to the lengths of the original \(k,\omega\) arrays

result = solver.greens_function(k, w, eta)
result_notpadded = result[:len(k),:len(w)]

The optimal choice of brigade_size is ultimately up to the user. For small systems where the matrix can be easily stored and solved by a single CPU, the optimal speed-up is accomplished with maximum splitting, i.e. brigade_size=1. When dealing with larger matrices (say of dimension more than 100,000), optimal speed-up will have to be determined by experimenting. In practice, brigade_size should be just enough for the matrix solution to be possible within MUMPS in terms of memory and time, so that the largest number of independent brigades can be formed.

For the calculation in the previous section, we see that the sizes of matrices being solved are quite small (merely hundreds of equations). This allows us to set brigade_size=1 and enjoy linear speed-up in the calculation.

Basis-matrix separation

Without a double-parallel scheme, GGCE is limited to phonon configurations where the matrix produced can be reasonably quickly solved by SciPy’s sparse direct solver (<100,000 or so). Using the double-parallel scheme removes this restriction – if a suitable number of MPI processes (CPUs) is available, the matrix can be appropriately segmented, stored and solved in parallel, allowing much larger matrices to be handled.

However, one memory bottleneck still remains in GGCE: the basis. The basis, as computed and stored inside the System object, is at core a Python dict that encodes the structure of equations induced by a chosen phonon cloud variational configuration. It does this agnostic of particular values of \(k, \omega, \eta\), which is a great advantage as it allows to only compute the basis once and then easily generate all matrices for a range of \(k,\omega\) values.

And yet the basis itself constitutes a problem for calculations with large configurations. Being a dict, the basis is inherently not parallelizable: it cannot be shared across MPI processes. Instead, each process has an individual copy of the entire basis, resulting in incredible redundancy and memory usage.

One solution to this problem is to change the basis generation and storage approach: this is one of the goals of the next release. In the meantime, another solution is to separate the computation of the basis from the actual calculation of the Green’s function, by pre-computing and storing on disk the matrices to be solved by the sparse direct solver.

This is accomplished within the MassSolverMUMPS in two stages. First, we must define the directory for saving both the matrices to be solved and the Model and System objects

matr_dir = p["root_matr_dir"]
root_dir = p["root_sys"]

model = Model.from_parameters(...)
model.add_(...)

Next, we create the first MassSolverMUMPS instance, the purpose of which is to create the matrices and save them to disk.

sysgen_petsc = MassSolverMUMPS(
                              system=System(model),
                              root=root_dir,
                              mpi_comm=COMM,
                              matr_dir=matr_dir,
                              brigade_size=1,
                              )

Setting the matr_dir keyword determines the location where the matrices will be saved. Notice that the creation of matrices is in principle parallelizable: while the basis is not shareable, in the sense that each MPI process will still have a redundant copy of it, groups of MPI processes can still work on generating and writing different matrices to disk. In especially dire cases, to save memory, a smaller number of MPI ranks should be used while keeping allocated memory constant.

To prepare the matrices and save them to disk, we use the following command

sysgen_petsc.prepare_greens_function(k, w, eta)

del sysgen_petsc  # to free up memory used to store the basis

The matrices are dumped to disk using pickle in sparse matrix format, meaning that we save three arrays – row indices, column indices, and values – for nonzero entries only. They are saved with a filename formatted as matr_at_k_{kval:.10f}_w_{wval:.10f}_e_{etaval:.10f}.pkl.

Once the matrices are written to disk, the solver object, which contains the basis, is deleted to free up memory.

Next, we create the Solver that will actually do the solving. Notice that we need to create a new System object with autoprime=False keyword AND to pass this keyword to MassSolverMUMPS to make sure that the basis is not generated when we pass the System to the Solver.

system_unprimed = System(model, autoprime=False)
executor_petsc = MassSolverMUMPS(
                                system=system_unprimed,
                                root=root_dir,
                                mpi_comm=COMM,
                                matr_dir=matr_dir,
                                autoprime=False,
                                brigade_size=1,
                                )

results_petsc = executor_petsc.greens_function(k, w, eta)

This second Solver then looks in the matr_dir directory and loads the matrices from there, constructing filenames from the k,w,eta it was given.

We showed how to run basis-matrix separate calculations within a single Python script. Alternatively, this can be run as two separate, sequential calculations (i.e. with different batch job scripts), using different memory and CPU counts to optimize resource usage.