Maximus32
Developer
I wanted to experiment with the MMI instructions to see how much they can benefit performance. I searched for examples but couldn't find any, so I created a few tests. I must say I'm surprised how powerfull the MMI instructions are, compared to the standard MIPS III instructions!
So basically what we get is 128 bit integer vector instructions that can operate on the following vectors:
- 16x int8_t
- 8x int16_t
- 4x int32_t
- 2x int64_t
I created the following small library named "libeevec". An EE Vector Library:
https://gitlab.com/ps2max/libeevec
Using C++ and classes it's possible to do:
As a demo, I've tried to optimize a sound mixer. Mixing int16_t type samples from 32 different sound sources. This scales very well using the CInt16_8 class, and the result is a 10x speedup.
The reason it's more than 8x faster is that the MMI instruction also clamp the samples for free, so instead of writing for each sample:
We can do with just 1 instruction, to mix and clamp 8 samples at once:
Anyway, it was just a small experiment. I hope this triggers some of you to also experiment with the EE MMI. Any additions to the 'library' via gitlab pull requests are welcome. But other than this experiment I have no plans for it at this time.
Some more random thoughts on the vector instructions:
- With newer compilers it is possible to automatically use vector instructions. Search for "auto vectorization".
- What the MMI instructions are for integers, the VU0 is for floats. We can create a CFloat32_4 class that uses the VU0. Perhaps an experiment for another day ;-).
- The VU0 could also be used by the auto vectorization of compilers ... dreaming of clang with MMI and VU0 auto vectorization support ...
So basically what we get is 128 bit integer vector instructions that can operate on the following vectors:
- 16x int8_t
- 8x int16_t
- 4x int32_t
- 2x int64_t
I created the following small library named "libeevec". An EE Vector Library:
https://gitlab.com/ps2max/libeevec
Using C++ and classes it's possible to do:
Code:
int32_t i1[] = {1, 2, 3, 4};
int32_t i2[] = {2, 3, 4, 5};
CInt32_4 vi1(i1), vi2(i2);
CInt32_4 vresult = vi1 + vi2;
As a demo, I've tried to optimize a sound mixer. Mixing int16_t type samples from 32 different sound sources. This scales very well using the CInt16_8 class, and the result is a 10x speedup.
The reason it's more than 8x faster is that the MMI instruction also clamp the samples for free, so instead of writing for each sample:
Code:
// Mix
dest += source[iSource].sample[iSample];
// Clamp max
if (dest > 32767)
dest = 32767;
// Clamp min
if (dest < -32767)
dest = -32767;
We can do with just 1 instruction, to mix and clamp 8 samples at once:
Code:
// PADDSH : Parallel Add with Signed Saturation Halfword
dest += source[iSource].sample[iSample8];
Anyway, it was just a small experiment. I hope this triggers some of you to also experiment with the EE MMI. Any additions to the 'library' via gitlab pull requests are welcome. But other than this experiment I have no plans for it at this time.
Some more random thoughts on the vector instructions:
- With newer compilers it is possible to automatically use vector instructions. Search for "auto vectorization".
- What the MMI instructions are for integers, the VU0 is for floats. We can create a CFloat32_4 class that uses the VU0. Perhaps an experiment for another day ;-).
- The VU0 could also be used by the auto vectorization of compilers ... dreaming of clang with MMI and VU0 auto vectorization support ...