I have written a finite element code that runs efficiently on the CM. A key factor in obtaining this efficiency is the use of a regularly connected mesh. This allows the use of fast ``CSHIFT'' operations for processor-processor communication rather than the slower global routing. I solved the Galerkin system of equations by the conjugate gradient method. This allowed me to avoid constructing the global operator. I chose to use a nodal decomposition of the operator rather than the usual element matrix decomposition. I need only distribute data once per iteration of the CG algorithm with this decomposition. If the stencil compiler is used, the speed of the algorithm is limited by the speed of the SUM operation. On a 4096 processor machine the code ran at 275MFlops. If a 64K processor machine were to be used I would expect to achieve a speed of 4.4GFlops.