PS2 [MX4SIO/SIO2SD] SD Card Adapter and SD-driver for the PS2 SIO2 interface

There is another way to prevent sio2man from interrupting the transfer. Calling sio2_pad_transfer_init will lock sio2man for our use. Calling sio2_transfer_reset will unlock sio2man. This is what I was using in my latest attempts to make the sio2sd driver compatible with sio2man, without the need to disable interrupts. This works perfectly with sio2man from ps2sdk, but it did not work with ROM0:SIO2MAN.
Great! I didn't think about that. The different SIO2MAN modules have different export tables AFAIK, so some sort of detection of the module (exports?) version the game is using will be necessary and using the corresponding exports table... and hoping that some unknown module won't be used. Which is why I think it won't be a bad idea if the code I proposed in the psx-scene thread about suspending and resuming SIO2 transfers, using the hardware, be tested.

Why do you think DMA transfers are slower than PIO transfers? Or do you mean the byte reversing will take more time when processed in a separate task?
Simply because the IOP doesn't run while DMA is active - more specifically - while a DMA (BCR register) block is being transferred.
Once DMA has been set-up from IOP side, the DMAC waits for the DREQ line to be asserted. Then the DMAC asserts the DACK line and transfers 0x10 words (or whatever the size of the BCR block is) in the way set-up by the SSBUSC registers, or in the case of internal hardware like the SIO2 - however the SIO2-DMAC logic was wired. For the SIO2, the transfer should be to/from its FIFO, so this should not take long. Then the DMAC checks if DREQ is asserted. (At that point it arbitrates between channels, letting higher priority DMA channels run.) It if is, it continues with another SIO2 transfer and if not, it lets lower priority DMA channels, including the IOP core, run for a few cycles, before checking DREQ again. This is done for the entire DMA transfer size.

However I never checked how exactly DMA for the SIO2 works. In the document I wrote (attached here https://www.psx-place.com/threads/s...e-ps2-sio2-interface.29210/page-9#post-244748 ) I mention that DMA BCR blocks are sync'ed with the SIO2 transfer queue elements. But this is about all I know.
My test showed that using DMA is slower. If somebody can find a way to use DMA and complete the transfer faster, then I'd be more than happy to agree on that.

For instance an application like uLE that can copy files from HDD to SD. If both drivers only perform well with 100% CPU load, then having 2 drivers copy files would result in terrible performance.
I don't quite agree - in fact, due to the fact that we are using the IOP 100% it means that we are not doing anything unneeded. One driver reads the data as fast as possible, and the other writes it as fast (and soon) as possible, resulting in maximum performance. It is up to the program that uses both to not request chunks of data of unoptimal size (and I believe that programs like uLE already are optimized for fast copying ..?).

So in theory, if the SD card can transfer at 1.8MB/s, and the FAT IOP can convert the data at 6.75MB/s. This would result in a CPU load of around 27%. For a slim ps2 the CPU load would be around 16%.
Assuming the DMA controller can provide 1.8MB/s data this sounds like a promising solution. If only I had the time to build and test it ;).
This is interesting! I really can't remember why DMA was making transfers slower now (as it was years since the test), but if it was due to some mistake of mine, this really has potential!

The problem with the ps2 slim is that the reading speed of the usb is too slow, this memory card has a reading speed sufficient to overcome this problem from what I read, is it so?
And what do you think you can connect a USB device through the memory card?
I don't think this will result in increased speed. (Such an idea is unfeasible, unless the interface used to connect the USB host controller is fast enough.) The only reason we get decent speeds from SD cards, is because:
- the interface is simple and matches the SIO2 interface.
- the modern SD cards are very fast (in fetching data), so even with slow interface transfer speeds, the overall transfer speed is decent.
 
SIO2MAN from the PS2SDK was based on XSIO2MAN.
Thanks, I'll try it out.

Simply because the IOP doesn't run while DMA is active - more specifically - while a DMA (BCR register) block is being transferred.
I'm assuming the DMA controller transfers data between the fifo and memory at memory speeds. So over 100MB/s. If the SD card transfers at 2MB/s this would add a 2% load on the IOP. I'm not sure I understood everything you wrote, but is this assumption correct? If not, what CPU load are you expecting on the IOP during DMA transfers?

I don't quite agree - in fact, due to the fact that we are using the IOP 100% it means that we are not doing anything unneeded. One driver reads the data as fast as possible, and the other writes it as fast (and soon) as possible, resulting in maximum performance.
Yes, maximum performance per device. And if the IOP only has to process 1 device at a time, you would be right. But the file copy examply with this design looks like so:
Code:
read  block1 from device1 = 100% CPU load, 100% speed on device1,   0% speed on device2
write block1 to   device2 = 100% CPU load,   0% speed on device1, 100% speed on device2
read  block2 from device1 = 100% CPU load, 100% speed on device1,   0% speed on device2
write block2 to   device2 = 100% CPU load,   0% speed on device1, 100% speed on device2
Etc...
End result: 100% CPU load with 50% speed for both devices combined. Reading and writing cannot be performed at the same time.

What I'm proposing is that reading 1 device, and writing another device can be done simultaneously, like so:
Code:
read  block1 from device1                           = 33% CPU load, 100% speed on device1,   0% speed on device2
read  block2 from device1 + write block1 to device2 = 66% CPU load, 100% speed on device1, 100% speed on device2
read  block3 from device1 + write block2 to device2 = 66% CPU load, 100% speed on device1, 100% speed on device2
Etc...
End result: 66% CPU load, 100% speed for both devices combined. Not only does the overall speed increase, it also leaves room for more devices/drivers to process things at the same time.
 
I'm assuming the DMA controller transfers data between the fifo and memory at memory speeds. So over 100MB/s. If the SD card transfers at 2MB/s this would add a 2% load on the IOP. I'm not sure I understood everything you wrote, but is this assumption correct? If not, what CPU load are you expecting on the IOP during DMA transfers?
Quite true. I can't really remember why I got low DMA speeds.
I think I will do tests again.
The potential problem with DMA is that you will always have the transfer take one sector more worth of time (or however it takes to reverse the bits order), which could be quite a while, especially for small size filesystem operations.
And you can't assume that the other device (the one the data is written to) will be writing the data while the data is being read from the SD card, because programs like uLE use a single buffer (I am not sure I am correct about this), so:
- reads bufferSz data from input device
- writes bufferSz data to destination device, and the next iteration starts only when the writing code returns.
In theory, if the destination is a HDD, which has a big cache, then it will accept all the data very quickly, in which case the actual storing of data on the HDD platters may coincide with the next read from SD Card. But this is not often the case - if the target is a USB device, or Ethernet.
However in all cases, I don't see how using DMA for reading the SD Card would improve the transfer speed, because whatever requested the data, will continue its execution, only when all the data it requested has been read, so we need the the read to complete as soon as possible.
This is not like a DMA transfer from HDD/Ethernet(SPEED) to IOP and then piping the data by DMA to EE, where if we start the two DMA transfers roughly at the same time, they will interleave and we get speeds of the whole transfer of up to 51MB/s.

From your previous posts:
Maximus32 said:
Immagine instead of debug messages this is the FMV sound. Then sound would not be able to play, or the FMV would stutter. Even though there is enough raw speed, there's not enough CPU time left to play it.

I think the the two reasons I didn't want to use DMA were not wanting to affect SIO2MAN and that the bits order reversing was taking too long.
I see now what you mean - I was focusing at completing the transfers as soon as possible, and in my tests, I only considered doing the bits order reversing on a word at a time, the speed of which was comparable to the SIO2 transfer speeds, which is why I thought that reversing each word as soon after it is received as possible, would result in the best speed possible. And then, there was even time to spare, which is why I added CRC calculation as well. :D But it seems I was looking from the wrong side.
Indeed it seems that using registers with the masks pre-loaded, as you suggest, actually saves a lot of time, so using DMA will make sense.

BTW, What you said about the IOP stalling until the transfer completes, but it is assumed that no software would make big data transfer requests.
If the SD driver uses DMA, and does not keep interrupts disabled, how would anything else, but an interrupt handler run at that time? No other thread will be able to interrupt the SD card reading thread, unless it sleeps while waiting for the data to be fetched by the SD card, so maybe this would have to be implemented too (if the SD card takes too long to find and start outputting the data).
Even without the SD driver thread sleeping, interrupt-driven functions of the IOP will function, so stuff like SBUS transfers will be able to run... maybe even sifCmd and RPC(?) (which will be a great improvement).

So it seems to me, like the best way would be to have both variants of the code in the driver, optimized as much as possible, and the decision which one to use, would be in the hands of the user of the IRX, with the DMA variant be the default, as it should be more compatible with the IOP execution model.

What I'm proposing is that reading 1 device, and writing another device can be done simultaneously, like so:
Your assumption will only work if:
- the SD card driver bit-reversal code is fast enough (which you already proved it can be to some extent)
- whatever is starting the transfer will be able to interleave the transfers (the current block reading from SD card and the previous block writing to the target device)... maybe this would work if both transfers are interrupt-driven and are started from the EE...? But I really find it hard to imagine how this would work, if done from the IOP or EE RPC.

To sum-up - using DMA will probably not result in any speed improvement for short reads (which is the case anyway with PIO too however), but it will enable the IOP to do some if its other tasks, which will make it more usable. Still I don't know how this will result in the operation of programs like OPL.

EDIT: @Maximus32 Do you think it would be a good idea to extend the sectorCount field of your BDM to u64 or two u32 fields, because in theory, an SD card CSV v2 SDXC can be more than 1TB (2TB max) and SDUC can reach 128TB... (which seems a bit unimaginable, but the specs support it).
 
Last edited:
I corrected my old code to use the initialization you suggested (as well as the dummy byte, which I saw that according to the specs is necessary before the clock is removed) from: https://github.com/Krasutski/sdcard_spi_driver/blob/master/spi_sdcard_driver.c
In reality, the cards work even if the clock is paused before a dummy byte is given, but it should be safer to have it, so I included it. And that should also make it so that the next sent command executes correctly.

I looked at my sector transfer code... making it use DMA optimally will be a challenge.
We don't consider a read of a single sector, as that is slow.
A read of multiple sectors consists of:
- A: CMD18 with R1 response
- B: At this point the card fetches the data from flash and prepares it for sending. 0xFF is returned by the card through this time. Once done, a readDataToken is sent rather than the 0xFF.
- C: The card sends 0x200 bytes - one sector, followed by two CRC bytes.
- Then repeats from step B for the next sector.
- D: CMD 12 is sent, which terminates the transfer.

Point A:
Sending a command is slow in general, because of the time necessary to set-up the hardware and also the method of waiting for a response: A response can come at any point within a fixed timeout period (i.e. number of bytes read) after the command's last byte has been sent. This means that we don't know in advance how many bytes we have to transfer, to get the response. So there are two methods:
- sendSdCmd() Transfer a single byte at a time, checking each one, if it is the response byte. This is slow, because many transfers are ran - one per byte and there are 6 command bytes and commonly up to 8 bytes of delay before the 5 response bytes.
- sendSdCmdB() Transfer the maximum possible packet size = 6 + CMD_WAIT_RESP_TIMEOUT + 5. But according to the driver by Krasutski, CMD_WAIT_RESP_TIMEOUT is 100, which is a lot - 4.6 times less than the size of a whole sector.

The first method can be considered as unsafe, because the clock gets paused between every two bytes. However it seems to work fine. So in theory, if the first method is further optimized for speed, this would be best.
One way of doing that would be to not do the initialization in sendCmd() each time. But we want to separate the SD Card code from the SIO2 code. So the solution would be to have a function that can re-run a transfer, with minimal code (and thus be faster).

Point B:
This, and the CRC bytes are part of the reason why DMA is unoptimal. At the start of period B, a transfer should be set-up, which reads a single byte and checks if it is the readDataToken. One might suggest moving the DMA and SIO2 initialization code for the sector transfer before this, and this would make sense, but can't be done, because we are using SIO2 here as well.
Another idea would be to make this a separate queue element and place the sector transfer and CRC after it on the queue. This won't work because AFAIK, we have no way of re-running a single queue element... or maybe there is - through placing a null-element after it... but this again becomes very involved in the SIO2.

So in short - we can't just use DMA to transfer multiple sectors at once, because we have to wait for each one and poll the card to check when it is ready to send it.
We can't sleep the thread while waiting for the card, because we have to actively check the card for its state.
The only thing that can be done is probably releasing some cycles to other threads, but this can make the SD Card transfer too slow.

Point C:
Again requires setting-up SIO2, which is slow. We can't even do the whole sector transfer at once, because a single queue element can transfer at most 0x100 bytes - half of the sector size.
DMA would make getting the data to memory faster. But keep in mind, that a single memory access under program control and a DMA access to the same memory actually take the same amount of time, but for the program execution overhead. And also that with the PIO code we actually save the time of one writing to RAM and the reading the data back to reverse the bits. But will also mean that the bits order reversing would have to happen later. In practice, the DMA transfer of the next sector would first have to be started, and then the code to reverse the bits of the previous should run. Now taking into account the waiting for the card to fetch the sector data, and the CRC bytes, so the code will become quite complicated.

I tested without the command sending code and card polling - i.e. with only the raw PIO sector + CRC transfer code and the results are:
Code:
MX4SIO: sz 00000200  1699 kB/s
MX4SIO: sz 00000400  1787 kB/s
MX4SIO: sz 00000600  1799 kB/s
MX4SIO: sz 00000800  1762 kB/s
MX4SIO: sz 00000A00  1802 kB/s
MX4SIO: sz 00000C00  1805 kB/s
...
MX4SIO: sz 00004000  1810 kB/s
MX4SIO: sz 00004200  1808 kB/s
MX4SIO: sz 00004400  1810 kB/s
MX4SIO: sz 00004600  1811 kB/s
So 1800kB/s is the maximum with the current PIO code (no SD Card needs to be connected for the test). Maybe I'll implement a function of the driver to be able to do such a test, so we can compare the performance of the different PS2s.
With a bit more optimization, regardless bit-reversing is done or not MX4SIO: sz 00001800 1816 kB/s. The code is extremely short now, so there is no way to get better speed.

@Maximus32 How do you propose we let other IOP drivers use the free time while this one is neither doing DMA not reversing the bits?
Maybe by sleeping the current thread and having the DMA interrupt or transfer completion wake it up?
BTW, if we register a DMA interrupt, that will mean that SIO2MAN can't have its own registered (which it doesn't anyway, so I guess this is OK).

EDIT 3: Actually, if it can be done only with a few DelayThread() at key places, it would be better, and won't require having to deal with interrupts, ect. What do you think about that?


Checking the code, one of the other issues with DMA is the trigger - the transferring of a DMA block from SIO2 FIFO to RAM is done at the end of each queue element. So if we use two elements, 0x100 bytes each, this is what will happen:
- SIO2 receives 0x100 bytes from SD Card - this takes a fixed, long period of time.
- DMA is triggered by the completion of the queue element and transfers the 0x100 bytes to RAM.
- The above is repeated.
So this way, because the DMA transfer starts when half the sector has been received, we get a further delay of the duration of the DMA transfer of 0x100 bytes in addition to the SIO2 SPI transfer duration. One way around this would be to cut the transfer in more blocks, and queue elements, respectively.
Surprisingly, actual tests show that cutting the transfer to blocks of less than 0x100 bytes, makes it slower. Maybe because interrupting the SIO2 transfer slows it down (tests showed that accessing some SIO2 registers too often(or maybe the FIFO) slows the SIO2 down), and because this way the card has less time to fetch the data - because the (second 0x100-byte) DMA transfer takes place basically around the time the card knows to fetch the next sector.

It seems I can't get the DMA transfer rates as high as those of PIO mode. Still DMA mode will be necessary.
Some speed comparisons: https://docs.google.com/spreadsheets/d/1ZVjVafwJi0pioURmA-jm4bolBWTT1notffAecYBVcL4/edit?usp=sharing
D,BR = DMA + bit-reversing (the DMA-only test does not include bit-reversing)


EDIT 1:

@Maximus32 You did once test the USB Flash raw block-device transfer speeds, right? We can compare them with the SD Card ones, to see what kind of improvement we can expect from this driver.


EDIT 2:

Measuring the parts of the transfer. After adding the time-recording code, the speeds dropped as follows:
New card, DMA, no CRC noBitRev, but for the last-sector one:

from: MX4SIO: sz 00004600 1733 kB/s MX4SIO: sz 00000400 1352 kB/s
to:
MX4SIO: sz 00000200 305 kB/s
MX4SIO: sz 00000400 1111 kB/s
MX4SIO: sz 00000600 1249 kB/s
MX4SIO: sz 00000800 1333 kB/s
MX4SIO: sz 00000A00 1386 kB/s
MX4SIO: sz 00000C00 1427 kB/s
...
MX4SIO: sz 00003800 1595 kB/s
...
MX4SIO: sz 00004600 1607 kB/s


New card, DMA, no CRC noBitRev, but for the last-sector one (8-sector transfer):
Code:
0000:  loopBgn  5 sec  361369 usec  waitData 1005  dmaSetup 9  sioAndDmaRun{ bitRev 5  freeTime 269 }  crcRead 9  lastSectBR 7  toNextIter 4

0001:  loopBgn  5 sec  362677 usec  waitData 11  dmaSetup 6  sioAndDmaRun{ bitRev 3  freeTime 269 }  crcRead 7  lastSectBR 4  toNextIter 3

0002:  loopBgn  5 sec  362980 usec  waitData 10  dmaSetup 6  sioAndDmaRun{ bitRev 4  freeTime 269 }  crcRead 7  lastSectBR 4  toNextIter 3

0003:  loopBgn  5 sec  363283 usec  waitData 10  dmaSetup 6  sioAndDmaRun{ bitRev 4  freeTime 269 }  crcRead 7  lastSectBR 4  toNextIter 3

0004:  loopBgn  5 sec  363586 usec  waitData 10  dmaSetup 6  sioAndDmaRun{ bitRev 4  freeTime 269 }  crcRead 6  lastSectBR 4  toNextIter 4

0005:  loopBgn  5 sec  363889 usec  waitData 10  dmaSetup 5  sioAndDmaRun{ bitRev 4  freeTime 269 }  crcRead 7  lastSectBR 4  toNextIter 3

0006:  loopBgn  5 sec  364191 usec  waitData 10  dmaSetup 6  sioAndDmaRun{ bitRev 3  freeTime 269 }  crcRead 7  lastSectBR 4  toNextIter 3

0007:  loopBgn  5 sec  364493 usec  waitData 10  dmaSetup 6  sioAndDmaRun{ bitRev 4  freeTime 269 }  crcRead 6  lastSectBR 113

One sector (0x200 bytes) transfer (above) takes ~303us => 1650kB/s, which excludes the initial command sending overhead and the initial waiting for data ("waitData 1005" at the first entry above).
(This is perhaps also why I initially disregarded DMA as a usable mode, as the PIO raw transfer speeds + bit-reversal) are: sz 00004600 1811 kB/s.)


loopBgn - Time of (sectors) loop beginning. All other times are in uSec
waitData - Waiting for the card to prepare the data for sending.
dmaSetup - The time duration of the code that sets-up DMA and the SIO sector transfer.
sioAndDmaRun {bitRev} - Time of the bit-reversal while sector data is being transferred to SIO2 FIFO. (No bit-rev for the above test.)
sioAndDmaRun {freeTime} - Duration after the bit-reversal to the end of the time while IOP is free, as SIO transfer and DMA haven't yet completed.
crcRead - Duration while CRC is being read.
lastSectBR - Duration of the bit-reversal of the last sector (only done for the last sector).
toNextIter - Duration to the next loop iteration.

The durations of the initial command sending and the stop-command are not measured.



New card, DMA, no CRC, with BitRev:
Code:
MX4SIO:  0000:  loopBgn  5 sec  254453 usec  waitData 1004  dmaSetup 9  sioAndDmaRun{ bitRev 4  freeTime 269 }  crcRead 10  lastSectBR 6  toNextIter 5
MX4SIO:  0001:  loopBgn  5 sec  255760 usec  waitData 12  dmaSetup 6  sioAndDmaRun{ bitRev 121  freeTime 152 }  crcRead 6  lastSectBR 5  toNextIter 4
MX4SIO:  0002:  loopBgn  5 sec  256066 usec  waitData 10  dmaSetup 5  sioAndDmaRun{ bitRev 119  freeTime 154 }  crcRead 6  lastSectBR 4  toNextIter 5
MX4SIO:  0003:  loopBgn  5 sec  256369 usec  waitData 10  dmaSetup 4  sioAndDmaRun{ bitRev 119  freeTime 154 }  crcRead 6  lastSectBR 5  toNextIter 4
MX4SIO:  0004:  loopBgn  5 sec  256671 usec  waitData 10  dmaSetup 5  sioAndDmaRun{ bitRev 119  freeTime 154 }  crcRead 6  lastSectBR 4  toNextIter 4
MX4SIO:  0005:  loopBgn  5 sec  256973 usec  waitData 10  dmaSetup 5  sioAndDmaRun{ bitRev 119  freeTime 154 }  crcRead 6  lastSectBR 4  toNextIter 5
MX4SIO:  0006:  loopBgn  5 sec  257276 usec  waitData 10  dmaSetup 5  sioAndDmaRun{ bitRev 119  freeTime 154 }  crcRead 6  lastSectBR 4  toNextIter 4
MX4SIO:  0007:  loopBgn  5 sec  257578 usec  waitData 11  dmaSetup 5  sioAndDmaRun{ bitRev 119  freeTime 153 }  crcRead 34  lastSectBR 113

It can be seen, that because the total time of the SIO trandfer + DMA transfer is fixed, the sum of sioAndDmaRun: bitRev + freeTime is always the same = ~273.
The bit-reversal appears to take about half the available time, however it still makes the transfer take longer, which may be due to compeeting for access to the RAM with the DMAC or something else.
Also the "free time" - sioAndDmaRun, is not uniform - it would start with a SIO transfer from the SPI to FIFO, then once 0x100 bytes have been transferred, the transfer of the first DMA block will begin, which would pause program execution, and resume it aftet the block transfer ends. Meamwhile, it is assumed that without waiting for DMA to complete, the second queue entry transfer begins (next 0x100 bytes). However it is possible, that in order to prevent buffer overflow, the transfer is paused, until the buffer is emptied enough by DMA.

All DMA tests use 32-bit bit-reversal (in bytes) code, with the mask variables asigned in advance. To be able to use a different optimization setting than the rest of the code, it can be moved to its own object file and have a different -O setting in the Makefile.

Unlike previous tests (without the time-recording code), now the duration of a sector transfer is 303us both with and without bit-reversing.

Adding DelayThread(1) after the bit-reversing, under sioAndDmaRun, makes the speed drop from sz 00004600 1606 kB/s to sz 00004600 1473 kB/s (bit-reversing disabled). One way to reduce the negative effect of this, would be to do it once every several sector tranfers.

A good way to do something useful while waiting for slow cards to fetch the data to be read, is placing a DelayThread(1); in the code below.
Fast cards usually end the loop before 4 iterations have passed, so there is no slowdown observed for them.
Code:
   i = 0;
   rdPollBytePrepare(port);
   do {
     dataToken = rdPollByteWait(0xFF, port);
     i++;
     if (i >= 5) DelayThread(1); //for slow cards, we can let other threads run for a bit.
     if (i> 0x800) { mprtf("\n ERR: Timeout on waiting for data (token). ");  break; }
   } while (dataToken == 0xFF);
   rdPollByteComplete(port);

EDIT 4: Bit-reversing: 0x200 bytes for 120us is 4170 kB/s, - close to the 4500kB/s you got on a FAT PS2. Using your function, gives about the same speed. Using -O3 with your function takes 82us (and there is a bit more code around it, which may be why I am not reaching the 73 us for 6765kB/s you achieved. However using -O3 for the whole IRX drops the total speed from sz 00004600 1605 kB/s to 1583 kB/s (so the objects better be separated maybe). With -O2 the bit-reversal takes 80us and the speed is 1600 kB/s. Or I can copy the optimized asm from -O3 to an inline func.

EDIT 5: The PIO functions can also have DelayThread added to the code that waits for the card to get the data or to write it to flash.

EDIT 6: I don't remember why I thought that the driver wasn't working on PPC-IOP models. I tested the new driver and it works fine, just slow - 1320 kB/s is the max PIO speed for the fastest card I have, while 1180 kB/s is the fastest DMA speed. I think the reason is that MC slot 2 (port 3) is used, which has a bit-rate override by DECKARD. Which is why last time I was testing with PPC-IOP patching code, trying to remove the override, and maybe that is where some incompatibility happened. In theory, testing at MC slot 1 should solve the problem.
One other thing I changed, was that now -O2 is always enabled, so maybe this changed the compatibility, but I doubt it.

EDIT 7: Tests on SCPH-79000 PPC-IOP.
DECKARD has several overrides for settings to the SIO2 made from the IOP, including such that write to registers (0xBF80825C) when other registers have been written.
The maximum transfer clock is capped at 24MHz (=48MHz/2) and values 0 and 1 result in the same clock, unlike on SCPH-30000, where they result in misformed data and odd logic reactions. Lower values work as expected (value 3 -> 48/3=16MHz).
The minimum inter-byte duration appears to be 352us, which is a LOT, and also most likely the reason why transfer speeds on the PPC-IOP are slower. It is 176us on the MIPS-IOP. This is probably because 0xBF80825C bits 23:16 is 0 on the MIPS IOP, while DECKARD forces it to 3, and according to my notes:
Code:
Total number of SCK-cycles between bytes = 2 + max(pCtrl1.23:16, 2); So the minimal period is 2+2=4 [SCK-cycles]. Specifying values 0,1 or 2 has the same effect - 4 [SCK-cycles], while a value of 3 results in 5 [SCK-cycles]. This effect is present regardless the divisor value (tested only with values 4-6).
In theory, because the PPC-IOP runs on a higher frequency than the MIPS-IOP, the SIO2 should be able to run at higher frequency as well, but this does not seem to be the case (although the SIO2 uses the 48MHz clock, so perhaps that is why). It might be that this additional delay was necessary due to synchronization of the SIO2 FIFO with the faster PPC-IOP clock.

EDIT 8:
The PPC-IOP SIO2 has an additional inter-byte delay of 244ns and the smallest value added cycles to that is 3, which results in the ~340ns inter-byte period measured above.
clockSpeedDiv = 2 = 24MHz interBytePeriod = 0xBF80825C.23:16 :
9 = 588 ns
8 = 548 ns
7 = 508 ns
6 = 464 ns
5 = 424 ns
4 = 384 ns
3 = 340 ns minimum
2,1,0 = 340

period = 244 + clkCycleDuration * (interBytePeriod[cy] -0.5)

0xBF80825C.23:16 = 4 -> 3.5cy:
244 + 3.5cy * 40ns(/2 24MHz) = 244 + 140 = 384ns
244 + 3.5cy * 63ns(/3 16MHz) = 244 + 220 = 464ns
244 + 3.5cy * 84ns(/4 12MHz) = 244 + 294 = 538ns

I removed the DECKARD code that overrides register values, and it appears that any lower values still result in the durations corresponding to the overridden values. So there is nothing to gain from patching DECKARD, and the fastest mode has inter-byte delay of 352 ns, (opposed to 176 ns on a MIPS-IOP). Maybe this is due to additional synchronization between the SIO2 FIFO and the SIO2 SPI shift register and/or the new PPC-IOP core, running at higher speed, while the SIO2 is clocked by the 48MHz clock.
This means 40ns(24MHz) * 8bits = 320ns; 320 + 352 = 672ns per byte., which means 1453kB/s is the maximum bandwidth of the SIO2 on a PPC-IOP, while on the MIPS-IOP it is 1969kB/s.

(Of course DECKARD patching can still be of use for projects that emulate another device using an SD Card.)

EDIT 9: "Speed testing storage devices" thread from psx-scene: https://ia802907.us.archive.org/11/items/psx-scene-processed-archive/part0293.html#T157395P1
For USB and other devices' speeds.

EDIT 10: I don't know how, but now I am getting speeds closer (basically the same) for DMA as for PIO. Maybe I had forgotten to remove DelayThread from the polling code or maybe it is some timing parameters I changed.
Note that the speed tests in EDIT 6 are for the fastest card. For the slowest card, the DMA speed on MIPS-IOP is ~ 1100kB/s.
Above only read speeds are measured. Writes to flash generally take more time, which may not matter much though, when more than one sector is written.

EDIT 11: @Maximus32 How do I handle the SD Card detection (insertion)? For removal, I'd have to make detect the presence on each operation and if it fails to, then unregister the block device. But how to detect insertion? It would have to be a thread, with an infinite loop and a DelayThread(5000000) inside, but should this be in the driver, or should the BDM or the user driver call that to check for new cards?
 
Last edited:
Oh my.... you've written a book!

There's so much good stuff in there. I cannot reply to everything, but I'll keep this post with me during my tests.

EDIT: @Maximus32 Do you think it would be a good idea to extend the sectorCount field of your BDM to u64 or two u32 fields, because in theory, an SD card CSV v2 SDXC can be more than 1TB (2TB max) and SDUC can reach 128TB... (which seems a bit unimaginable, but the specs support it).
It can be updated in the future if needed, but for now I don't see the need.

@Maximus32 How do you propose we let other IOP drivers use the free time while this one is neither doing DMA not reversing the bits?
Maybe by sleeping the current thread and having the DMA interrupt or transfer completion wake it up?
BTW, if we register a DMA interrupt, that will mean that SIO2MAN can't have its own registered (which it doesn't anyway, so I guess this is OK).
By waiting for a semaphore/mutex/event triggered from the DMA completion interrupt.
If there's a lot of transfers to handle (like reading 32KiB) then the DMA completion interrupt can start the next DMA transfer and reverse the bytes. Untill the entire 32KiB is received and reversed, only then will it wake up the sleeping thread. So only 1 thread switch per transfer would be needed.

@Maximus32 You did once test the USB Flash raw block-device transfer speeds, right? We can compare them with the SD Card ones, to see what kind of improvement we can expect from this driver.
Yes, I think you found the results already (between 800KB/s and 1100KB/s if I'm not mistaking). But I'm guessing the CPU load is a lot less when using USB. So I'm not sure what the end result for a running game will be.

EDIT 11: @Maximus32 How do I handle the SD Card detection (insertion)? For removal, I'd have to make detect the presence on each operation and if it fails to, then unregister the block device. But how to detect insertion? It would have to be a thread, with an infinite loop and a DelayThread(5000000) inside, but should this be in the driver, or should the BDM or the user driver call that to check for new cards?
It has to be done from the driver. The driver will call bdm_connect_bd and bdm_disconnect_bd. There's multiple ways to do something periodically on the IOP. A thread with an endless while loop calling DelayThread is a possibility. Using SetAlarm would also be possible.

---

I've made a little drawing of how I think a 4KiB PIO or DMA transfer can look like in time and CPU load. Also an interesting thing to check is the block size. This is 512B by default, but can be increased to 2048B maximum. This will increase performance a little for both PIO and DMA I think.

sio2sd_transfer.png


First the command request has to be sent, then when the SD card is processing the request, we can sleep the IOP for a small time as you also noted. The sleep time needs to be just a little shorter than the time the SD card needs. I don't think it will make much sense to sleep in between the 512B blocks. This is applicable for both PIO and DMA I think.

Then when we wake up shortly before the data is ready, we have to wait for the data to become ready. Both modes need to do this.

Once the data is ready, we can start the PIO data transfer, or the DMA data transfer. For DMA data transfer, the DMA completion interrupt will have to:
1 - wait for the next data block (just as PIO mode does)
2 - start the next DMA transfer
3 - invert the bytes
4 - (only at end of entire transfer) signal user task of completion
 
Would something like "adaptive reads" be possible?

A disc-sector is 2048Bytes anyway and usually they/games seem to do 2-8 sector (hence 4-16KB) reads/requests anyway, except for "syncronous reads"(?), so if it "let out" some latencies for those multi-sector-reads, it might speed up/increase bandwidth and decrease cpu-load in certain situations!
 
Also an interesting thing to check is the block size. This is 512B by default, but can be increased to 2048B maximum.
I am not sure SD Cards support 2048-byte blocks (or maybe don't in SPI mode?):
"The maximum block length is given by 512 Bytes regardless of READ_BL_LEN, defined in the CSD."
"SDHC and SDXC Cards only support 512-byte block length."
And even if 2048-byte blocks are supported, dealing with alignment issues (not aligned to 2048-bytes access to the card) and buffers being too small may be a problem.

By waiting for a semaphore/mutex/event triggered from the DMA completion interrupt.
If there's a lot of transfers to handle (like reading 32KiB) then the DMA completion interrupt can start the next DMA transfer and reverse the bytes. Untill the entire 32KiB is received and reversed, only then will it wake up the sleeping thread. So only 1 thread switch per transfer would be needed.
Initially my thought was to avoid using interrupts, even though SIO2MAN does not use the SIO2 DMA interrupt(s - RX,TX), and instead just put a DelayThread(1) before the polling for DMA completion (or inside it). Note that the SIO2 transfer will always take the same amount of time, so a DelayThread() with a fixed value will do a decent job. However my tests show that this makes the transfers *much* slower, even when the argument of DelayThread is 1us. (And I did measure the time duration of the polling of the DMA completion and that was >100us, so it seems that DelayThread takes a long time even with very low values.)
One other way is like you suggested - the interrupt waking-up the thread, but that would still be too slow I think.
Which only leaves your initial idea - to start the next transfer from the interrupt. There is nothing wrong with that, but the code is not the best one to split between two functions...

How should the driver deal with a buffer that is not 4-bytes aligned? Should it support such buffers at all? I really hope that is not necessary, or we would need a local buffer and a memcpy, for the unaligned cases.

Would something like "adaptive reads" be possible?

A disc-sector is 2048Bytes anyway and usually they/games seem to do 2-8 sector (hence 4-16KB) reads/requests anyway, except for "syncronous reads"(?), so if it "let out" some latencies for those multi-sector-reads, it might speed up/increase bandwidth and decrease cpu-load in certain situations!
I don't know of a way to make reading any faster, regardless how the code is written and how much data is read at a time. The SD card simply can transfer more than 512 bytes at a time, and even if it could, the SIO2 can't do more than 256 at a time.
The SD cards I tested already make the delays between sector transfers very low (only the initial delay is long - I have DelayThread() there to let other threads run for a bit), so the card always prepares the next sector when more than one is requested, so that part is dealt with by the card (multi-sector transfer is always used, but for single-sector writes, when single sector one is faster).
 
Naaah, tl;dr I don't mean it on the lowest level, but what BDM-test does! ;)

If those request-chunks could be adaptive depending on the games read-request, it should decrease some cpu-load while increasing bandwidth.
 
How should the driver deal with a buffer that is not 4-bytes aligned? Should it support such buffers at all? I really hope that is not necessary, or we would need a local buffer and a memcpy, for the unaligned cases.
The only 2 users of all block device drivers right now:
1) the FAT32 driver. It uses an intermediate cache to hopefully make reads faster. This cache should be aligned.
2) OPL ingame. These buffers should be aligned too.

If alignment is needed then I think it would be best to put that into BDM. But since the number of users is small and the performance impact is big, I think it's best to make sure buffers are aligned. We probably do need to check this and give an error message if they are not.

Note about DelayThread(1) from the manual:
"Although the suspend time can be specified in microseconds, if the value is less than 100 microseconds, it
will be rounded up to 100 microseconds."

Naaah, tl;dr I don't mean it on the lowest level, but what BDM-test does! ;)

If those request-chunks could be adaptive depending on the games read-request, it should decrease some cpu-load while increasing bandwidth.
The games don't request single-sectors, but any number of (2KiB) sectors. In that sense the current implementation in OPL-ingame is already adaptive. Read requests I've seen mostly rangle between 8KiB and 32KiB.
 
Oh, great!
Yes, I meant on an abstraction, not on the driver/lowest level!

I told wisi via PM (Discord) and he first thought I meant on the lowest level, but agreed it should be possible on a higher level.

Great to know that OPL's code is already adaptive.


Edit: Oh and I mentioned the disc-sectors (2048Bytes) vs. Storage/SD-Sectors (512Bytes) for that reason. We never need to go lower than 2048Bytes and that should only only be needed for the syncronous reads AFAIK!
Ond asynchronous reads, it reads at least 2 sectors AFAIK, so that's where I am hinting at.
I think especially games reading asynchronously, MIGHT yield a bit better performance with some optimizations.
 
Last edited:
New DMA/interrupt based driver here:
https://gitlab.com/ps2max/ps2sdk/-/blob/master-ps2max/iop/memorycard/sio2sd_bd/src/sdCard.c

FAT PS2 results:
Code:
IOP Read/Write speed test
-------------------------
Start reading 'sdc0p0' block device:
Read 1024KiB in 1832ms, blocksize=512, speed=572KB/s
Read 1024KiB in 1219ms, blocksize=1024, speed=860KB/s
Read 1024KiB in 917ms, blocksize=2048, speed=1143KB/s
Read 1024KiB in 764ms, blocksize=4096, speed=1372KB/s
Read 1024KiB in 689ms, blocksize=8192, speed=1521KB/s
Read 1024KiB in 643ms, blocksize=16384, speed=1630KB/s
Read 1024KiB in 624ms, blocksize=32768, speed=1680KB/s
Read 1024KiB in 614ms, blocksize=65536, speed=1707KB/s
Read 1024KiB in 608ms, blocksize=131072, speed=1724KB/s

slim PS2 results:
Code:
IOP Read/Write speed test
-------------------------
Start reading 'sdc0p0' block device:
Read 1024KiB in 1965ms, blocksize=512, speed=533KB/s
Read 1024KiB in 1386ms, blocksize=1024, speed=756KB/s
Read 1024KiB in 1101ms, blocksize=2048, speed=952KB/s
Read 1024KiB in 959ms, blocksize=4096, speed=1093KB/s
Read 1024KiB in 888ms, blocksize=8192, speed=1180KB/s
Read 1024KiB in 846ms, blocksize=16384, speed=1239KB/s
Read 1024KiB in 827ms, blocksize=32768, speed=1267KB/s
Read 1024KiB in 819ms, blocksize=65536, speed=1280KB/s
Read 1024KiB in 814ms, blocksize=131072, speed=1288KB/s
Compared to the PIO version it's slightly slower:
2K blocks: 1254KB/s -> 1143KB/s
128K blocks: 1795KB/s -> 1724KB/s

But since there's a low cpu load the IOP seems more responsive. For instance the debug messages appear line-by-line, instead of at the end of the entire test run.
 
@Maximus32 I probably asked you this before, but forgot: What name should the SD Card driver use (give to the BDM) (currently "sdc")?
When the same name is registered twice, it is just registered at a different number under the same name, right? Do I need to set a number somehow or not (in the case two or more cards are inserted).
With the current driver up to four will be supported - one in each controller and MC slot.

BTW, I am currently rewriting a lot of the driver, separating its parts to files and will add the DMA changes you suggested, as well as SD and other cards and controllers detection code (I hope that will work).
@Takeshi I also plan on adding a detection for the MX4SIO device MCU - through a particular command, which, if detected, will enable extended features of the driver (namely a very simple serial debugging interface, and anything else one can think of).

EDIT: @Maximus32 Just noticed your new driver and test and they are great! :)
Though I don't know if there will be much need for my code now, that you wrote it...
Anyway, I'll try to complete mine, and then see if there is anything useful from it.
And yeah, the DMA does work great! I am actually surprised you got that speed with event flags, though the case is actually that interrupts are used and they are quite a lot faster than thread-switching. I don't know why PIO is slower, but this is better, as we won't need to choose between DMA and PIO anymore.
I also moved the initialization (setting the port-register) from the transfer functions to the beginning, so that the code there is reduced.

As for the way debug messages print, I don't know if that is necessarily related to the driver, but maybe it is, because at random times they print all at once, when I tested.

OK, I see how you got such speed! You actually added the CRC PIO queue element(cmd) after the DMA elements. This is an interesting idea certainly... I didn't think it is safe, so I didn't do it. The buffer is 0x100 bytes and I wasn't sure if a PIO transfer after DMA won't overrun it.
I think you forgot to read the two bytes after the very last DMA transfer, which is why you are having problems with command 12. ;)

EDIT 2:
I am also considering using the EventFlag for exclusive access, rather than the semaphore, so that we don't use a semaphore as well. What do you think?
 
Last edited:
From bdm.h:
Code:
    // Device name + number + partition number
    // Can be used to create device names like:
    //  - mass0p1
    char* name;
    unsigned int devNr;
    unsigned int parNr;

name = the name of the driver/device-type. sdc was for SDCard.
devNr = the device number, 0 = first SD card, 1 = second SD card, etc...
parNr = partition number, familiar to linux users, 0 = the raw device itself. This is only thing the driver will ever need. 1 = first partiton, created by for instance the FAT32 driver.

So four SD cards will be: sdc0p0, sdc1p0, sdc2p0 and sdc3p0

There's still lots to be done in the driver, some things that come to mind:
- Fix initialization from krasutski's code (card size is not detected). If there's really a bug in krasutski's code we can send a PR to his git repository.
- Move byte reversing code to low-priority user task. The current processing from ISR is a hack (but works for now). Perhaps move it to an assembly function so we're not relying on compiler flags for this speed-essential function.
- Multiple cards as you're working on
- Card detection as you're working on
- Writing (not tested yet)
- Compatibility with sio2man from ps2sdk. It does use the same interrupt I'm using now, so that's going to be a problem. Since it's based on rom0:XSIO2MAN it probably also uses the interrupt.

Please do complete yours, and share the code. We can learn from eachothers and get the best end-result ;). I also don't have much time to do anything.

About the name... what's MX4SIO? Is it the new MCU version or just the new name for SIO2SD? If there's an MCU version I would really like it to do byte reversing!

And yeah, the DMA does work great! I am actually surprised you got that speed with event flags, though the case is actually that interrupts are used and they are quite a lot faster than thread-switching. I don't know why PIO is slower, but this is better, as we won't need to choose between DMA and PIO anymore.
The event flag is used only 1x per transfer, the rest is handled using interrupts. The PIO code is still faster by the way, but I think the only difference in speed is caused by the byte reversing of the last block (as seen in the drawing a few posts back).

I also moved the initialization (setting the port-register) from the transfer functions to the beginning, so that the code there is reduced.
Yes, but I'm not sure if this optimization will work in combination with sio2man. Perhaps we still need to initilize these everytime we lock-out sio2man.

OK, I see how you got such speed! You actually added the CRC PIO queue element(cmd) after the DMA elements. This is an interesting idea certainly... I didn't think it is safe, so I didn't do it. The buffer is 0x100 bytes and I wasn't sure if a PIO transfer after DMA won't overrun it.
lol, I thought I copied it from your PIO example, but yes it's a little different ;).

I think you forgot to read the two bytes after the very last DMA transfer, which is why you are having problems with command 12. ;)
What 2 bytes? The 2 CRC bytes I read in the ISR, are there another 2 bytes after the last transfer? I actually copied the CMD12 hack from your example, becouse I couldn't figure out why the multi block read wouldn't work.

I am also considering using the EventFlag for exclusive access, rather than the semaphore, so that we don't use a semaphore as well. What do you think?
I don't think it will make a difference in speed. So I would choose whatever makes the code best readable.
 
  • Like
Reactions: TnA
- Fix initialization from krasutski's code (card size is not detected). If there's really a bug in krasutski's code we can send a PR to his git repository.
I don't like his code - he uses a LOT of shifting to get the CSD register. It can be done with a hacked-up struct and only a bit of shifting, only for the fields we need, while leaving the rest still accessible, should somebody need them.
https://gitlab.com/ps2max/ps2sdk/-/...rycard/sio2sd_bd/src/spi_sdcard_driver.c#L338

- Move byte reversing code to low-priority user task. The current processing from ISR is a hack (but works for now). Perhaps move it to an assembly function so we're not relying on compiler flags for this speed-essential function.
Hmm... I am not sure I quite agree. But I guess you base this on the fact that the user-code (BDM) requested a big buffer of data (rather than having to reverse each sector immediately), and we can do the bit-reversing even later. Still I think that doing it in the interrupt handler is better. It should not be making DMA any slower. Why do you think it should be done by a lower-priority task?


- Compatibility with sio2man from ps2sdk. It does use the same interrupt I'm using now, so that's going to be a problem. Since it's based on rom0:XSIO2MAN it probably also uses the interrupt.
Actually I think I have that partially. First we are in luck, as AFAIK, SIO2MAN does not use the DMA interrupts but only the SIO2 interrupt. So as long as we use only the DMA interrupts we will be just fine.
As for compatibility with SIO2MAN running together with it, there are several options:
- Use SIO2MAN's functions to prevent other users from using it. But this requires us knowing if it is SIO2MAN or XSIO2MAN.
- Use some semi-hacky save/restore SIO2 state code, which I wrote sometime ago, still untested (should be decent if it works).
- Don't use any exports of SIO2MAN and look for the module (if loaded on IOP) and hook its event flag and use it to prevent other users from using it while we do. Again not the best solution, but at least does not care for SIO2MA/XSIO2MAN.

It is questionable what we will be doing with writing. The DMA interrupt for writing will get triggered too early, so even though we can even go as far as to have all sectors be sent by DMA at the very start (in a single transfer) (the SIO2 HW will request each one when it reaches its corresponding queue element), this will not give information about where each sector ends. Maybe 7 (=14/2 - full queue) sectors can be sent like that, and then we can poll SIO2 completion. This will make it slower but not by much. Another way would be cutting the transfer to more blocks, but this was found to make it slower.
Or of course we could register our own interrupt handler to the SIO2 and then somehow restore the old one... Maybe this will be the only decent solution.


About the name... what's MX4SIO? Is it the new MCU version or just the new name for SIO2SD? If there's an MCU version I would really like it to do byte reversing!
I don't think the MCU will be able to keep the same speed if it does the reversing.
My goal is to have a driver that works fine without an MCU, but also to support a limited set of MCU features - like the MCU can filter whether commands are for the SD Card or for an MC and not pass them to the SD Card then.
Tests with a more powerful MCU should also be done, but there only will be a point in such one if it can do MG stuff, which is a bit doubtful (it if can do them fast enough).
So for now the MCU is assumed to be a slow one, to exist or not (the driver should support both cases) and to have the bonus feature of a simple serial port interface, so that people can debug the IOP through the MC connector (with or without an external UART-USB adapter - to be decided).

As for the name, well, MaXimus32 4SIO :D I forgot what it meant. TnA and Takeshi and I kind of agreed it sounded best (as pronunciation) and decided on it (I hope I am not mistaken).
I just decided to use that name for the driver, as it is more unique.


The event flag is used only 1x per transfer, the rest is handled using interrupts. The PIO code is still faster by the way, but I think the only difference in speed is caused by the byte reversing of the last block (as seen in the drawing a few posts back).
That makes sense. Still great that we have decent speed DMA. I am so glad I don't have to leave it to the user to pick the lesser evil. :)

Yes, but I'm not sure if this optimization will work in combination with sio2man. Perhaps we still need to initilize these everytime we lock-out sio2man.
Yes. I am still initializing them, just not so often.


lol, I thought I copied it from your PIO example, but yes it's a little different ;).
I think I was doing only DMA. Good thing it works, and on both fat and slim. So maybe I'll include this in my driver too.


What 2 bytes? The 2 CRC bytes I read in the ISR, are there another 2 bytes after the last transfer? I actually copied the CMD12 hack from your example, becouse I couldn't figure out why the multi block read wouldn't work.
Here: https://gitlab.com/ps2max/ps2sdk/-/blob/master-ps2max/iop/memorycard/sio2sd_bd/src/sdCard.c#L560
spisd_read_multi_block_end() already calls CMD12. But before it you have to read the two CRC bytes. They are left in the SIO2 FIFO and if you did not reset the FIFO, they will be there and the command IO may fail... or if you are resetting the FIFO for each command, then I don't know what is the cause for the problem. BTW, there was a note somewhere in the specs that the SD card will continue sending sector data and CMD12 may have some of its error bits set, when issued, but this does not mean it failed.

I don't think it will make a difference in speed. So I would choose whatever makes the code best readable.
But we should also think of the OPL in-game driver, but maybe it won't make any difference to that either.
 
  • Like
Reactions: TnA
I don't like his code - he uses a LOT of shifting to get the CSD register. It can be done with a hacked-up struct and only a bit of shifting, only for the fields we need, while leaving the rest still accessible, should somebody need them.
https://gitlab.com/ps2max/ps2sdk/-/...rycard/sio2sd_bd/src/spi_sdcard_driver.c#L338
I don't like it either. The struct you're using is much better. But it was difficult for me to separate the SPI->PS2 code from the SD->SPI code. If you can find the original source I would be happy to switch back. Or we could change the initalization to use the struct, then send a PR to the original repo?

Hmm... I am not sure I quite agree. But I guess you base this on the fact that the user-code (BDM) requested a big buffer of data (rather than having to reverse each sector immediately), and we can do the bit-reversing even later. Still I think that doing it in the interrupt handler is better. It should not be making DMA any slower. Why do you think it should be done by a lower-priority task?
It's important to service interrupts from hardware as soon as possible. If a hardware interrupt cannot be serviced really fast, things start to slow down or break. For instance:
- our mx4sio driver transfer will slow down, becouse the next DMA transfer cannot be started
- an audio buffer is empty, it will start to stutter
- a network buffer is full, it will overflow
The key thing is to do only what's absolutely nessecary in the ISR, and then return. So from this perspective, if we want all hardware of the IOP to be working fast, we need to re-enable interrupts ASAP. This is what all operating systems do, for instance:
https://www.oreilly.com/library/view/linux-device-drivers/0596000081/ch09s05.html

Task switches are not really slow on the IOP as far as I've seen. But they will get slow (or have unreliable speed) if another task disables interrupts for long periods of time to do their processing. So again, I think returning from the ISR fast will also show thatthe IOP is not slow, neither is it's task switching.

If every (512 byte) interrupt triggers a "bottom-half" interrupt handler to do the byte reversing, then only the last task-switch will cause a slight performance penalty. This could be solved if we let the task that requests (and waits for) the data also does the byte reversing.

Another speed improvement would be to receive the last sector using PIO, at the cost of some CPU load. This should result in the same speed of the PIO mode. Again this could be done with interrupts enabled from the user task.

Actually I think I have that partially. First we are in luck, as AFAIK, SIO2MAN does not use the DMA interrupts but only the SIO2 interrupt. So as long as we use only the DMA interrupts we will be just fine.
Outch, I'm using the SIO2 interrupt. That's probably why is was safe for me to read the last 2 bytes. Indeed when using the DMA interrupt we cannot know the last 2 bytes have already arrived.

Here: https://gitlab.com/ps2max/ps2sdk/-/blob/master-ps2max/iop/memorycard/sio2sd_bd/src/sdCard.c#L560
spisd_read_multi_block_end() already calls CMD12. But before it you have to read the two CRC bytes. They are left in the SIO2 FIFO and if you did not reset the FIFO, they will be there and the command IO may fail... or if you are resetting the FIFO for each command, then I don't know what is the cause for the problem. BTW, there was a note somewhere in the specs that the SD card will continue sending sector data and CMD12 may have some of its error bits set, when issued, but this does not mean it failed.
The two CRC bytes are read from the ISR, before it wakes up the task that then sends the CMD12.

EDIT1:
After moving the reversing code to the user thread, the results are roughly the same. Latest code here:
https://gitlab.com/ps2max/ps2sdk/-/blob/master-ps2max/iop/memorycard/sio2sd_bd/src/sdCard.c
 
Last edited:
Is the Bytereversal-code in C (compiled with some -O* flag), or is it manually written in MIPS (I) - ASM?

I think this could also yield more speed - by freeing CPU-Cycles - and probably can be reused for other things which need Byte-Reversal (like Firewire)!


Another thing regarding CPU-Cycles...
SMB uses A LOT of CPU-Cycles and can still peak at 2-3MB/s, so...
  1. I think SIO2SD/MX4SIO/etc. would use LESS CPU-Load (than SMB).
  2. The interface is slower than LAN.
...so I think, we can continually have a stable high bandwidth in almost every situation, which is of course slower than SMB, BUT also WAY less demanding on the IOP!


This combined with @Maximus32's work regarding DMA, will probably be important to drive the amount of needed CPU-Cycles down to the point, that even those games which stutter via SMB, run well on the SD-Adapter (due to a lower amount of CPU-Cycles being required during transfer), even though the actual max. theoretical bandwidth is lower in total.
 
It's important to service interrupts from hardware as soon as possible. If a hardware interrupt cannot be serviced really fast, things start to slow down or break.
Finally, I understand now - it is because other interrupts can't interrupt the SIO2 interrupt, so we are not looking at how much code gets executed as a whole(because it will always be roughly the same), but we want to let more urgent stuff done in interrupts happen ASAP (as soon as their trigger occurs), so that other processes don't run-out of data or overflow.
That does make a lot of sense... but I have no good idea on how to run the bit-reversal in a thread.
One way is to use again some signalling to continue the thread with the bit-reversal. But who knows if thread-switching will slow-down overall execution.
The other way is to make use of the knowledge that the whole transfer always runs in a fixed time, so in theory, if we start bit-reversal a at a certain point in time before the end of the complete transfer, it will complete with the transfer. This seems very unsafe (sectors getting reversed before they have been received), so the performance of the first method should be tested first.
(Only now I saw your edit.)

Another speed improvement would be to receive the last sector using PIO, at the cost of some CPU load. This should result in the same speed of the PIO mode. Again this could be done with interrupts enabled from the user task.
I am not sure this will work, because in my experience, the bit-reversing over PIO takes almost all the time available there... Actually, it might work, if we wait for a bit more data to gather: While receiving the first 0x100 bytes of the last sector, we reverse the first 0x100 bytes of the previous sector, and maybe even the second 0x100 bytes, and then start reversing the first 0x100 bytes of the last sector, and for the last 0x100 bytes of the last sector, we do it interleaved (as my original PIO code).
But maybe this is just going too far. :D

Outch, I'm using the SIO2 interrupt. That's probably why is was safe for me to read the last 2 bytes. Indeed when using the DMA interrupt we cannot know the last 2 bytes have already arrived.
For better or for worse, I think we would have to hook SIO2MAN and/or INTRMAN...

The two CRC bytes are read from the ISR, before it wakes up the task that then sends the CMD12.
I saw it now (I thought it worked differently). Hmm, then I have no idea what is happening, but I didn't have any issues with the commands so-far (after fixing my initialization code).

EDIT1:
After moving the reversing code to the user thread, the results are roughly the same. Latest code here:
https://gitlab.com/ps2max/ps2sdk/-/blob/master-ps2max/iop/memorycard/sio2sd_bd/src/sdCard.c
* cmd.sector_size might take long time - better have the 512 be a macro.
I am not sure this is safe: https://gitlab.com/ps2max/ps2sdk/-/blob/master-ps2max/iop/memorycard/sio2sd_bd/src/sdCard.c#L567
cmd.sector_done The read of this variable.
Maybe it is safe, because it is soon after the event flag-waiting and is a function call, but what if another SIO2 completion interrupt triggers before the reading of this variable is done and the bit-reversal func gets a value one-sector higher? Maybe this can happen if there are many other threads running and this one does not get ti run, before the interrupt occurs once more, causing this problem. Maybe it should be made using a separate local variable.
But then, with the assumption I made, there is a risk of missing one setting of the event flag, resulting with less reversed sectors. But in that case, when the completion is reached, the code can simply reverse in a loop all remaining sectors (if any).


@TnA I already copied the -O3 bit-reversal code to inline assembly in a function, and is already integrated and tested in my new code. I don't see how it can be made any shorter/faster. ;)

EDIT 1: @Maximus32 Do all your SD Cards work with the new driver you are using? If so, then I think there is a good chance they will work with mine too.
 
Last edited:
* cmd.sector_size might take long time - better have the 512 be a macro.
Done. I added this becouse I wanted the size to be configurable. I tested with 2048 byte reads. That does seem to work! So it probably makes the reads faster, BUT... larger blocks also mean I have a larger block to reverse AFTER the last transfer is done. So the end result for small blocks was slower performance. For larger reads (like 64KiB) there was a small improvement.
This was before I switched to using DMA interrupts instead of SIO2 interrupts.... so that does give another possibility to interrupt after 512 bytes, even when the SIO2 transfer is 2048+2 bytes. Perhaps I'll try the 2048 byte transfers again.

I am not sure this is safe: https://gitlab.com/ps2max/ps2sdk/-/blob/master-ps2max/iop/memorycard/sio2sd_bd/src/sdCard.c#L567
cmd.sector_done The read of this variable.
Maybe it is safe, because it is soon after the event flag-waiting and is a function call, but what if another SIO2 completion interrupt triggers before the reading of this variable is done and the bit-reversal func gets a value one-sector higher? Maybe this can happen if there are many other threads running and this one does not get ti run, before the interrupt occurs once more, causing this problem. Maybe it should be made using a separate local variable.
Done. The DMA interrupt now counts the number of transferred sectors. The user thread counts the number of reversed sectors. This would allow for the user thread to wake up really late, and reverse multiple sectors at once.

EDIT 1: @Maximus32 Do all your SD Cards work with the new driver you are using? If so, then I think there is a good chance they will work with mine too.
I don't have many uSD cards... it works with my Samsung EVO 16GB uSD.

Other changes:
- I'm now using the SIO2_DMA interrupt instead of the SIO2 interrupt. It now works together with sio2man again. The performance penalty of locking/unlocking sio2man is minimal. I'm not sure reading the 2 CRC bytes is still safe to do, but it seems to work. However I think I have to wait for SIO2 completion to read those 2 bytes just to be safe.
 
  • Like
Reactions: TnA
I tested with 2048 byte reads.
How?
I mean the card will output 512 bytes and then 2 CRC bytes and then require polling to check for when the next 512 bytes are ready.
Did you send some command to the card to configure it to a different 'sector'-size? (Because if you didn't, AFAIK the card will do what it usually does and you will get some combination of sector data and CRC bytes.) Or when you use bigger blocks, that only refers to doing the bit-reversal in bigger blocks?

I'm not sure reading the 2 CRC bytes is still safe to do, but it seems to work. However I think I have to wait for SIO2 completion to read those 2 bytes just to be safe.
It might be OK actually. Depends quite on the SIO2 implementation, but maybe it won't transfer if the FIFO is full, so it will wait for enough space to become available... though there are overflow flags I think. I think that it might be that when DMA is used, it does not start the next transfer until the DMA has completed.

I don't have many uSD cards... it works with my Samsung EVO 16GB uSD.
Is there any card you have, it doesn't work with? Or you can't test it?
 
  • Like
Reactions: TnA
How?
I mean the card will output 512 bytes and then 2 CRC bytes and then require polling to check for when the next 512 bytes are ready.
Did you send some command to the card to configure it to a different 'sector'-size? (Because if you didn't, AFAIK the card will do what it usually does and you will get some combination of sector data and CRC bytes.) Or when you use bigger blocks, that only refers to doing the bit-reversal in bigger blocks?
CMD16 can change the transfer block size. I don't think it has anything to do with the sector size. It can be changed to a maximum of 2048. Then you will get 2048 bytes data + 2 bytes CRC.
http://elm-chan.org/docs/mmc/mmc_e.html

Is there any card you have, it doesn't work with? Or you can't test it?
I have another card that works, but it's also a samsung evo. That's all I can test.
 

Similar threads

Back
Top