For 2d brittlefracture problems, TNNMG is too slow. For load steps away from the actual fracture, the operatorsplitting algorithm is twice as fast, if not faster.
I measured the TNNMG performance with the gprof profiler. About a quarter of the time is spent in each of the three methods: the presmoother, IntegralDirectionalRestriction::subDifferential
, and IntegralFunctionalConstrainedLinearization::bind
. I think there a still ways to make the code quite a bit faster; here are patches and ideas:

IntegralFunctionalConstrainedLinearization::addRowTransposedXMatrixXRow
: Exploit that the result matrix is symmetric. 
IntegralDirectionalRestriction::subDifferential
: Precompute the productevaluationMatrix * direction
. 
Use gradientaware truncation. Early tests suggest that this reduces the number of iterations at least when the load is low. 
Use a different line search algorithm. Calls to subDifferential
are not cheap (though much less expensive with the patch mentioned above), and we know that the functional is smooth. 
Exploit matrix symmetry in IntegralCoordinateRestriction::addHessian
. 
Rewrite the evaluationMatrix
to contain derivatives of scalar shape functions, rather than lineared strains. I believe that several parts of the algorithm are limited by the memory speed, rather than by computing speed. Storing only shape function gradients instead of strains would reduce the required memory bandwidth considerably. It would also reduce our overall memory requirements which, according to Daniel's measurements, are nothing to brag with either. 
Even the entries of the evaluationMatrix
are sparse. In 2d, the entries are 4x3 matrices, but 7 of the 12 are always zero. If the matrixvector multiplication becomes a bottleneck then using a dedicated matrix type may speed up the code. 
Use SymmetricMatrix
objects for small symmetric objects. Not sure how much this would buy us.