1
2.4× speedup on Long Range Arena (seq. 3× speedup on GPT-2 (seq. 3× extra quickly than conventional attention for prevalent seq. The backward pass usually calls for the matrices S, P ∈ ℝN×N to compute the gradients with respect to Q, K, V. However, by storing the output O and the softmax normalization stats (????, ????), we are able to recompute the awareness matrix S and P simply within the backward pass from blocks of Q, K, V in SRAM.
What is Plikli?

Plikli is an open source content management system that lets you easily create your own user-powered website.

Latest Comments
Statistics