How to write a fast Softmax kernel