Just a minor thing I noticed.
include/cutlass/half_t.h set std::numeric_limit<half_t>::digits to 10 but it should be 11. The implicits 1 bit at the start is counted normally e.g. std::numeric_limits::digits == 24.
It appears to be wrong in line 565 and 628. Not sure if anything with cutlass depends on it being wrong so I'm not filing an pull request. Let me know if I should.