Technical description of code

Entire code

Matlab and Bash files:

Matlab files:

Performance Evaluation

Entire code

See the sections that follow for an in-depth explanation of why the specific methods were chosen for optimizing performance. The integration of these methods (outside of work done in PySpark) is shown in this speed up plot, where the serial implementation was performed on 1 core:

hi

As shown, the method of using parfeval with GPUarrays in the original basis approaches MATLAB’s black box optmization for CPUs (see Physical Model Description in main branch). The load-balanced optimization method (see GPU-accelerated diagonalization in main branch) is hindered by the change in basis from matrix concatenation. Using distributed arrays requires data to be distributed unevenly amongst workers, which may need lead to increase in timing due to communication overhead.

See reproducibility in “Parallelization of Matrix Operations”

Parallelization of Matrix Operations

Matrix multiplication is required to properly evolve the states in this simulation, and MATLAB provides several tools to parallelize matrix operations. Their descriptions are included within the first comment block of parallelization_methods_testing.m:

Each of the above tools was used in testing the time required to multiply two random matrices of size 2^N x 2^N, with N = 12. The test was repeated for n = 50 iterations, and the average timings of each method were as follows (all are without using GPUs):

hi

Given this, parallelizing the relevant segment of our ED_evolve_block_diag.m code (lines 160-171 and the associated for loop) was done via co-distributed / distributed arrays with no GPU and via parfeval using GPU arrays. The speed up from a serial implementation of the same lines in ED_evolve_block_diag.m using 1 core was found to be:

hi

We see that for larger numbers of spins, co-distributed / distributed arrays achieve greater speed up whereas for smaller spins, our custom load-blocking method with GPU arrays and parfeval gives a greater speed up. The difference between the two parfeval with GPU array trials here (orange and green lines in plot) comes from the block size difference coming out from eigen solving our system Hamiltonian.

Since the distributed arrays trial was done without GPUs, combined with the fact that distributed arrays require communication amongst workers in the parallel pool and parfeval does not, this speed up shows that optimized communication reduces the bottleneck, implying that our matrix multiplication is a fine grained application.

Reproducibility. For GPUs, 4 were used with the characterization given below.

hi hi

Challenging Aspects

Entire code

Parallelization of Matrix Operations

Back to home page