newlib porting challenges

Maximus32 · Jan 18, 2020

uyjulian said:
I'm attempting to debug the newlib fork of OPL on my DTL-T10000, but it doesn't even show anything on the screen after 1 minute.

That's great! Debugging OPL has proven to be very difficult. Just recently I have had the same issue with a normal (slim) PS2. Have you updated and recompiled binutils for the EE recently? Recent updates:

binutils: https://gitlab.com/ps2max/toolchain/binutils/commits/ps2-v2.14 <- fixes a similar bug to what you have
gsKit: https://gitlab.com/ps2max/gsKit/commits/move-libc-to-newlib <- no big changes
opl: https://gitlab.com/ps2max/Open-PS2-Loader/commits/move-libc-to-newlib <- no big changes

uyjulian · Jan 18, 2020

I just updated.
In newlib branch, changing the display setting to 1080i results in a hang and corrupted screen:

Code:

dsedb R> break
*** Breaked
 at=70030c00  v0-1=0000000e,7000004c  a0-3=30402f00,0000000e,7fffffff,00376840
 t0-7=003767c8,00376902,00000000,00000000, 00000000,00000000,00000000,00000000
 s0-7=00395f00,00370000,00370000,00395cc0, 00395e00,00395e00,00395e40,00000000
 t8=00000000 t9=00000000   k0=00000080 k1=00000000   gp=00381070 sp=00398000
 fp=003980a0 ra=001268d0   lo=000000c0 hi=00000000   sa=00000000 PC=0012f960
 badvaddr=26410a80 badpaddr=181c2440
 $cause   = 0x70008800 [ BD2 CE3 EXC2=Reset IP7 IP3 EXC="External Interrupt" ]
 $status  = 0x70030c13 [ Cu210 EDI EIE IM3 IM2 KSU=User EXL IE ]
  0x0012f958: 0x41010005  bc0t    0x0012f970
  0x0012f95c| 0x00000000  nop
->0x0012f960: 0x00000000  nop
  0x0012f964: 0x4100fff6  bc0f    0x0012f940
  0x0012f968| 0x00000000  nop
  0x0012f96c: 0x00000000  nop
  0x0012f970: 0x03e00008  jr      $ra
dsedb S>

In both upstream and newlib branch, changing the display setting will result in semaphore maxCount overflow.

Code:

dsedb S> run opl.elf
Loading program (address=0x00100000 size=0x00279200) ...
Loading program (address=0x004d6420 size=0x00000004) ...
Loading program (address=0x004d8060 size=0x00000000) ...
Loading program (address=0x004edfe8 size=0x00000038) ...
Loading 7645 symbols ...
Entry address = 0x001000d8
GP value      = 0x00381070
*** Resetted
*** No Connect
*** Resetted
       
EE DECI2 Manager version 0.06 Oct 31 2003 19:02:01
  CPUID=2e14, BoardID=4126, ROMGEN=2005-1124, 128M
       
       
# EEKERNEL[SignalSema] : Semaphore maxCount overflow.
#   SemaphoreID:15 count:2 maxCount:1
       
*** Unexpected reply - type=BREAKR result=EXCEPTION
*** Target program stopped. Check the location by dr command.
dsedb S> dr
 at=70030c00  v0-1=0000000f,00000108  a0-3=0000000f,003bce00,7fffffff,00376840
 t0-7=003767c8,00376902,00000000,00000000, 00000000,00000000,00000000,00000000
 s0-7=003bc9d0,00370000,00370000,00395cc0, 00395e00,00395e00,00395e40,00000000
 t8=00000000 t9=00000000   k0=80016ed8 k1=00000000   gp=00381070 sp=00398020
 fp=003980a0 ra=001269f4   lo=00000180 hi=00000000   sa=00000000 PC=001ee398
 badvaddr=26410a80 badpaddr=181c2440
 $cause   = 0x70008424 [ BD2 CE3 EXC2=Reset IP7 IP2 EXC="Breakpoint" ]
 $status  = 0x70030c13 [ Cu210 EDI EIE IM3 IM2 KSU=User EXL IE ]
  0x001ee390: 0x24030042  li      $v1,0x42
  0x001ee394: 0x0000000c  syscall 0x0
->0x001ee398: 0x0000000d  break   0x0
  0x001ee39c: 0x00000000  nop
  0x001ee3a0: 0x24030044  li      $v1,0x44
  0x001ee3a4: 0x0000000c  syscall 0x0
  0x001ee3a8: 0x03e00008  jr      $ra
dsedb S>

Disabling the maxCount checking by setting bit 18 of the EE parameter allows the program to work fine.

Maximus32 · Jan 19, 2020

Do you also know what sema is causing this issue? In gsKit hires there are 3 semaphores used:
sema_hsync_id <- used as event, when a new render pass starts
sema_vsync_id <- used as event, on each vsync
sema_queue_id <- used for locking the render queue

All 3 have max_count=1. I can not find anywhere how to disable the max_count checking (bit 18?). Do you know where it's defined?

sp193 · Jan 19, 2020

uyjulian said:
I'm attempting to debug the newlib fork of OPL on my DTL-T10000, but it doesn't even show anything on the screen after 1 minute.

You need to compile OPL without stripping it. I forgot if there is an option to get the Makefile to not add the -s option to LDFLAGS (which would otherwise strip out all symbols), but that is what you need. Adding the -g option to CFLAGS may also help you get further, as line numbers will be available - you can do sload with dsedb.

With at least the symbols, you can use the bt command to get the stack trace up to that point. The dt command will allow you to see what all the threads are doing. However, this only works with a new(er) kernel release.

uyjulian said:

The TOOL EE kernels from the late SDKs had some software traps added by SCEI. This was to aid the developer with identify bad code that are impacted by the design flaws of the standard EE kernel.
For example, the semaphore implementation of the EE kernel does not check whether the count exceeds the maximum.

There are various traps, including a stack depth check, which must be disabled since the switch to the EE core will also trigger that trap.
The traps also affect games, which will trigger when older games are run. For this reason, I usually disable all these traps before debugging earlier titles.

Maximus32 said:
All 3 have max_count=1. I can not find anywhere how to disable the max_count checking (bit 18?). Do you know where it's defined?

The semaphore doesn't have an upper-bounds check in the EE kernel, so you need to prevent the semaphore's value from from being increased beyond the maximum.

@uyjulian was referring to the TOOL EE BOOT Parameter (EBOOTP), which is not an option for CEX and DEX consoles. Neither does this check exist in other types of PS2s, other than the TOOL.
Even if RDB/TDB startup card (includes the DECI2 function of OPL's GUI) is used on a CEX/DEX, such functionality will be absent.

Maximus32 · Jan 19, 2020

sp193 said:
You need to compile OPL without stripping it. I forgot if there is an option to get the Makefile to not add the -s option to LDFLAGS (which would otherwise strip out all symbols), but that is what you need. Adding the -g option to CFLAGS may also help you get further, as line numbers will be available - you can do sload with dsedb.

opl.elf and opl_stripped.elf are two different binaries. The 'run' and 'sim' targets will use the un-stripped version: opl.elf. The -g flag is added when compiled with DEBUG=1. So:
make DEBUG=1 clean run
Will compile OPL with -g flag, and send the unstripped version to the ps2 using ps2client.

You probably also need to compile gsKit with these flags though, becouse that's where all the HIRES code is.

uyjulian · Jan 20, 2020

Symbols kind of help.

Here is info on the HIRES issue:

Code:

dsedb R> break
*** Breaked
 at=70030c00  v0-1=0000000e,7000004c  a0-3=30403000,0000000e,7fffffff,00376940
 t0-7=003768c8,00376a02,00000000,00000000, 00000000,00000000,00000000,00000000
 s0-7=00396000,00370000,00370000,00395dc0, 00395f00,00395f00,00395f40,00000000
 t8=00000000 t9=00000000   k0=00000080 k1=00000000   gp=00381170 sp=00398100
 fp=003981a0 ra=00126968   lo=000000c0 hi=00000000   sa=00000000 PC=0012f9f8
 badvaddr=26410a80 badpaddr=180c2440
 $cause   = 0xf0008800 [ BD BD2 CE3 EXC2=Reset IP7 IP3 EXC="External Interrupt" ]
 $status  = 0x70030c13 [ Cu210 EDI EIE IM3 IM2 KSU=User EXL IE ]
   <dmaKit_wait_fast+0x10>:
  0x0012f9f0: 0x00000000  nop
  0x0012f9f4: 0x41010008  bc0t    0x0012fa18  # <dmaKit_wait_fast+0x38>
->0x0012f9f8| 0x00000000  nop
  0x0012f9fc: 0x00000000  nop
  0x0012fa00: 0x41010005  bc0t    0x0012fa18  # <dmaKit_wait_fast+0x38>
  0x0012fa04| 0x00000000  nop
  0x0012fa08: 0x00000000  nop
dsedb S> bt
 $BT0=0x0012f9f8 <dmaKit_wait_fast+0x18>      $SFA0=0x00398100 $SFS0=0x00000000
 $BT1=0x00126960 <gsKit_queue_send+0x50,$L59,$LM168> $SFA1=0x00398100 $SFS1=0x00000020
 $BT2=0x00126a3c <dma_to_gs_thread+0xac,$L73> $SFA2=0x00398120 $SFS2=0x00000080
 $BT3=0x00100200 <_exit+0x30>                 $SFA3=0x003981a0 $SFS3=0x00000000
dsedb S>

Some information on the semaphore issue:

Code:

dsedb S> dr
 at=70030c00  v0-1=0000000f,00000108  a0-3=0000000f,003bcf00,7fffffff,00376940
 t0-7=003768c8,00376a02,00000000,00000000, 00000000,00000000,00000000,00000000
 s0-7=003bcad0,00370000,00370000,00395dc0, 00395f00,00395f00,00395f40,00000000
 t8=00000000 t9=00000000   k0=80016ed8 k1=00000000   gp=00381170 sp=00398120
 fp=003981a0 ra=00126a8c   lo=00000180 hi=00000000   sa=00000000 PC=001ee4a8
 badvaddr=26410a80 badpaddr=180c2440
 $cause   = 0x70008424 [ BD2 CE3 EXC2=Reset IP7 IP2 EXC="Breakpoint" ]
 $status  = 0x70030c13 [ Cu210 EDI EIE IM3 IM2 KSU=User EXL IE ]
   <SignalSema>:
  0x001ee4a0: 0x24030042  li      $v1,0x42
  0x001ee4a4: 0x0000000c  syscall 0x0
->0x001ee4a8: 0x0000000d  break   0x0
  0x001ee4ac: 0x00000000  nop
   <WaitSema>:
  0x001ee4b0: 0x24030044  li      $v1,0x44
  0x001ee4b4: 0x0000000c  syscall 0x0
  0x001ee4b8: 0x03e00008  jr      $ra
dsedb S> bt
 $BT0=0x001ee4a8 <SignalSema+0x08>            $SFA0=0x00398120 $SFS0=0x00000000
 $BT1=0x00126a84 <dma_to_gs_thread+0xf4,$LM199> $SFA1=0x00398120 $SFS1=0x00000080
 $BT2=0x00100200 <_exit+0x30>                 $SFA2=0x003981a0 $SFS2=0x00000000
dsedb S> dt
tid prio      pc       sp      status cause sid wakeup count function
  0 128 00081fc8 00000000       Ready            0      5569 
  1   2 0015becc 01ffe5d0       Ready            0      5556 <_start>
  2   0 001ee4b8 003bc0e0        Wait  Sema   0  0         3 <$LM1>
  3 112 001ee448 0039a220        Wait Sleep      0         1 <$LM11>
  4  30 001ee448 00393b90        Wait Sleep      0        30 <$LM36>
  5  96 001ee448 003b8080        Wait Sleep      0         1 <$LM165>
  6   1 001ee4a8 00398120         Run            0         7 <$LM171>
dsedb S>

I simply added -g to the EE_CFLAGS of ps2sdk, gsKit, and Open PS2 Loader. I didn't do "GSKIT_DEBUG=1" for gsKit or "DEBUG=1" for Open PS2 Loader.

GDB would be nice to convert address to source file/line number...

wisi · Jan 20, 2020

ee-addr2line should also be able to do that. https://linux.die.net/man/1/addr2line

sp193 · Jan 20, 2020

If you used the sload command to load the source files, then you could get the line numbers, from what I do remember.

Maximus32 · Jan 20, 2020

I'm not sure I understand the debugging output I'm looking at. Does the code hang at dmaKit_wait_fast? The function looks like this:

Code:

void dmaKit_wait_fast(void)
{
asm __volatile__("sync.l; sync.p;" \
"0:" \
"bc0t 0f; nop;" \
"bc0t 0f; nop;" \
"bc0t 0f; nop;" \
"bc0f 0b; nop;" \
"0:");
}

I have seen the issue in multiple forms:
- The corrupted screen as in the screenshot from @uyjulian . The interface still responds most of the time in this case, switching modes can recover back to normal screen.
- Garbage (mostly black with small lines showing) on screen and no response
- Black screen and no response

Here's what I found out on recent testing:
- I increased the max_count of the semaphores from 1 to 100, but no change to the issue
- I added printf's (my only way of debugging) to all init/deinit related function, then the issue is 95% gone!

The printf's are sent using RPC calls to the IOP, so I'm thinking it must be timing related. This is the debug log when switching from 720p to 1080i:

Code:

gsKit_hires_deinit_global
gsKit_remove_hsync_handler
gsKit_queue_free
gsKit_queue_free
gsKit_queue_free
gsKit_queue_free
gsKit_queue_free
gsKit_queue_free
gsKit_queue_free
gsKit_deinit_global
gsKit_queue_free
gsKit_queue_free
gsKit_hires_init_global
gsKit_init_global_custom
gsKit_check_rom
gsKit_queue_init
gsKit_queue_init
gsKit_hires_init_screen
gsKit_init_screen
gsKit_set_buffer_attributes
gsKit_check_rom
gsKit_vram_alloc
gsKit_create_passes
gsKit_create_pass
gsKit_queue_init
gsKit_setpass
gsKit_create_pass
gsKit_queue_init
gsKit_setpass
gsKit_create_pass
gsKit_queue_init
gsKit_setpass
gsKit_create_pass
gsKit_queue_init
gsKit_setpass
gsKit_create_pass
gsKit_queue_init
gsKit_setpass
gsKit_create_pass
gsKit_queue_init
gsKit_setpass
gsKit_queue_init
dma_to_gs_thread
gsKit_add_hsync_handler
RENDERMAN New vmode: 11, 1920 x 540

I've tried removing the printf's 1-by-1 but that just seems to slowly increase the probability of it crashing. I already thought I fixed the issue a couple of times, but every time I removed the debugging (printf's) the issue came back.
Next thing I can try is using eesio debugging. I'm hoping eesio printing will not affect the bug.

wisi · Jan 20, 2020

I had a similar case with some bug in the OPL IGS extension I extended - was affected by a delay (printf on EE SIO in my case) and did crash if I removed one particular printf() in the code. I never managed to find the cause. But because it was an in-game thing, with the PS2 game running and IGS involved a very unnatural process of "unloading" the GS, VIF1 and corresponding DMA channel from the game's data (and attempting to save that data) and then loading that data back and hoping it doesn't crash, which depended heavily on the game running, I didn't even try to debug it and simply left some delay there. Adding a long enough delay at one point would "fix" the problem in my case.
Probably is unrelated to what you are experiencing though. I guess similar bugs can appear due to synchronization when not using the hardware in the expected ways.

EDIT: Printf() is generally quite a long/slow function and from what I remember, EE SIO printf waits until the string is printed (when a new character is sent to EE SIO, the function waits in a loop for the UART FIFO to empty), so this may be even slower than the printf through IOP.
So my suggestion is to add a very long delay at one or a few places and then remove them one by one, while trying to determine which ones of them have the same effect, and which ones have no effect, which will basically show between which two points in execution the delay has an effect on the bug.

sp193 · Jan 20, 2020

If you are game, you could enable the DECI2 module support for the OPL UI and debug with dsedb. Although it will not allow you to do IOP-side debugging, it would be sufficient to debug the EE-side program. It is not particularly stable, unfortunately. The stock kernel does not support the EE thread extension, so commands like dt, do not work.

If the EE somehow gets stuck in the sync function, it could mean that the GIF has crashed. The last time I saw that happen, was due to the DMA chain being corrupted. That happened, due to multiple threads trying to use the GIF.

Since the Semaphore max count value does not work in the EE kernel, increasing it will not do anything. Unfortunately.

uyjulian · Jan 20, 2020

The code hangs at dmaKit_wait_fast.

Increasing the max_count will probably delay the bug, not fix it. There may be a resource leak.

Maximus32 · Jan 23, 2020

sp193 said:
If the EE somehow gets stuck in the sync function, it could mean that the GIF has crashed. The last time I saw that happen, was due to the DMA chain being corrupted. That happened, due to multiple threads trying to use the GIF.

uyjulian said:
The code hangs at dmaKit_wait_fast.

This bug mostly results in garbage on screen and most of the time it hangs. So a corrupted DMA chain seems like a likely cause. Causing garbage on the screen and causing the GIF to hang. Then causing dmaKit_wait_fast to hang.

@fjtrujy created a simple gsKit example that results in the same bug. Making it more easy to test different things. The example is in the 'move-libc-to-newlib' branch. Pressing L1 or R1 will switch modes between 720p and 1080i. The bug always happens on the second mode switch. So 720p -> 1080i -> bug.

Adding printf's again results in the bug going away. Removing them one-by-one does not lead to a single place where the bug is, instead the chance of the bug happening slowly increases.

Adding delays in the same place where the printf's are does not fix the bug at all. I've added delay times up to the point where mode switching takes multiple seconds. So it's not time related.

Adding 'SyncDCache(0, 32*1024*1024)' in the places of the printf's does fix the bug. This also supports the theory of a DMA chain corruption. I've reduced the cache flush to 1 instance currently in gsKit_create_passes. So only 1 cache flush during mode switching seems to fix the bug. I'm still trying to find out why this is needed. Code here:
https://gitlab.com/ps2max/gsKit/commit/089e222a9d17416bc05a35cf3e8ff2cdc18a2500

All queues should be allocated using 'memalign(64, size) | 0x30000000'. The added 0x30000000 should result in a TLB allocated mirror, using uncached memory. So flushes of the queue should not be needed before transmission using dmaKit_send_chain_ucab.

sp193 · Jan 23, 2020

I thought of trying to debug this on my T10000, but I seem to be jumping through a lot of flaming hoops to get a working environment here... so I think I might not after all.

Maximus32 said:
Adding 'SyncDCache(0, 32*1024*1024)' in the places of the printf's does fix the bug. This also supports the theory of a DMA chain corruption. I've reduced the cache flush to 1 instance currently in gsKit_create_passes. So only 1 cache flush during mode switching seems to fix the bug. I'm still trying to find out why this is needed. Code here:
https://gitlab.com/ps2max/gsKit/commit/089e222a9d17416bc05a35cf3e8ff2cdc18a2500

All queues should be allocated using 'memalign(64, size) | 0x30000000'. The added 0x30000000 should result in a TLB allocated mirror, using uncached memory. So flushes of the queue should not be needed before transmission using dmaKit_send_chain_ucab.

Sony documented that if cached and uncached accesses are mixed, then you can get memory corruption. The buffers seem to be nicely aligned, but there is another possibility for contamination: the allocated memory region has dirty data cache lines over them. It may happen, if the memory was previously used for something else, which was accessed via the cached window.
Flushing/invalidating the data cache lines over the buffer once would fix this sort of thing, which seems to be in line with what you have done.

It is also for this reason, that gsKit_init_global_custom() has a call to FlushCache(), before the allocated memory is used.

Unfortunately, I cannot really see where the dirty cache lines may be introduced. But you might have a clearer picture of where it might be going wrong.

By the way, what's with the switch to the non-blocking mode for the FILEIO client, in the example?

Code:

  fioSetBlockMode(FIO_NOWAIT);

I think it may cause the check in GetRomName() to not work correctly. It may also corrupt something on the stack, when the RPC call completes.

Maximus32 · Jan 28, 2020

sp193 said:
Sony documented that if cached and uncached accesses are mixed, then you can get memory corruption. The buffers seem to be nicely aligned, but there is another possibility for contamination: the allocated memory region has dirty data cache lines over them. It may happen, if the memory was previously used for something else, which was accessed via the cached window.
Flushing/invalidating the data cache lines over the buffer once would fix this sort of thing, which seems to be in line with what you have done.

It is also for this reason, that gsKit_init_global_custom() has a call to FlushCache(), before the allocated memory is used.

Unfortunately, I cannot really see where the dirty cache lines may be introduced. But you might have a clearer picture of where it might be going wrong.

It still took a long time for me to fignally see where it was going wrong, but it turns out this was exactly the problem. All the GSQUEUE data is used from UCAB area. These queues are allocated only once at initialisation. A mode switch de-initializes gsKit, then re-initializes it again. This caused previously cached pieces of memory to be re-used as UCAB memory. Then some time later when the cache flushes, it corrupts the DMA chain to the GS.

I've added special gsKit_alloc_ucab and gsKit_free_ucab functions to handle UCAB memory in a safe way:
https://gitlab.com/ps2max/gsKit/commit/4159a35d9db27d54e8ad07a73c1072414e59b4a8

So at long last, it's fixed!

Thank you all for helping out.

sp193 said:
By the way, what's with the switch to the non-blocking mode for the FILEIO client, in the example?

Code:

fioSetBlockMode(FIO_NOWAIT);

I think it may cause the check in GetRomName() to not work correctly. It may also corrupt something on the stack, when the RPC call completes.

It was a test to make printf (to the iop) nonblocking. I hoped it would stop the printing from "fixing" the issue, but it didn't. It's removed now.

Maximus32 · Feb 4, 2020

We're almost ready for creating the pull requests for the newlib port, but I would like to be sure OPL works first. Could you please try the following test build and report here if it works or not?
https://gitlab.com/ps2max/Open-PS2-Loader/-/jobs/425258870/artifacts/browse

If there's anything that doesn't work, please check if the same problem exists in the latest official OPL build as well.

Tupakaveli · Feb 5, 2020

I'm getting BSOD on boot, logo never shows.

Krah · Feb 5, 2020

Deleted old opl cfg to test from fresh.

Went to settings changed vmode to 1080i. Audio Settings: Enabled everything turn volume up (it seems there is still an audsrv issue? Cause I had no sound).

Clicked Save Changes; Frozen console.

Tested same thing with Akuhaks bot build and had no problem.

Test 2: Load up fresh, switch to PAL widescreen ON.. go to Save Changes; Frozen console.

Test 3: Load up fresh, PAL ws ON, go to Settings HDD Start Mode Auto, Default Menu HDD Games, click OK; Frozen console.

Initiating devices was causing crashes but as I was editing this comment and kept testing all of a sudden they're fine.. trying to Save Changes in either 720p or 1080i is consistently crashing my console though.

Custom Languages seem to work fine but my Custom Themes aren't populating a list

Maximus32 · Feb 5, 2020

Tupakaveli said:
I'm getting BSOD on boot, logo never shows.

I was hoping the first reaction would be something like this

. But thank you for testing. What console are you using? How are you loading OPL (ps2client/usb/uLE,etc..)? Do you have existing saved configs?

Krah said:
Deleted old opl cfg to test from fresh.

Went to settings changed vmode to 1080i. Audio Settings: Enabled everything turn volume up (it seems there is still an audsrv issue? Cause I had no sound).

Clicked Save Changes; Frozen console.

Tested same thing with Akuhaks bot build and had no problem.

Thanks you for testing. Is the sound issue on both builds, or only in the newlib version?

All file IO functions are now using open/read/write instead of fileXioOpen/Read/Write. The changes to newlib have been fairly well tested, but the changes to OPL are still fresh. I'll look into this.

Krah · Feb 5, 2020

Maximus32 said:
Thanks you for testing. Is the sound issue on both builds, or only in the newlib version?

Wait sh*t sorry, I rarely test audio anymore so my cables weren't even plugged in.. audio on newlib branch is fine.

Only issues I'm having now are Saving in HiRes (EDTV 480p even froze I thought it may have only been 3 pass vmodes but I was wrong) and using Custom Themes.
Booting OPL off USB through wLE.

newlib porting challenges

Maximus32

uyjulian

Maximus32

sp193

Maximus32

uyjulian

wisi

Member

sp193

Maximus32

wisi

Member

sp193

uyjulian

Maximus32

sp193

Maximus32

Maximus32

Tupakaveli

VIP

Krah

Maximus32

Krah

Similar threads