The two main operations to be considered are parallel construction of the unassembled Galerkin operator and parallel application of the operator. The aim of the algorithm is to spread the calculation evenly over the available processors while minimizing the amount of processor-processor communication.
Because I wish to minimize the communication overhead, I chose a slightly different method of constructing the unassembled operator. Instead of using the element as the basic unit, I used the node. This means that my unassembled system will consist of rows of the global operator rather than submatrices. The reason for this choice should become clear as I outline my method.