runs very slow. Storing R0 to an 8-byte memory location (with HO bytes zero) and loading D0 from that memory location is much faster (though it does wipe out S1). For example, a numeric-to-hexadecimal string conversion function I wrote ran almost three times faster by not using the single-precision registers. Note that this issue does not seem to happen on the Raspberry Pi 3's Cortex-A53 CPU.