PS3 Fault finding YLOD with the SYSCON - First steps and Error reporting

BTW:
Here are the measurements from the other 2 RSX's I mentioned...
ProcessRSX ModelVDDCFBVDDQVDDRVDDIORWVDDIOVDDAPLLGood/Bad
90nmCXD29712.20.233925204681412022000Bad
40nmCXD53013.1700693106000200001350003500000Good
 
@vyktormvmpay25 @squeept could you post some RSX ohm test measurements? Especially the 90nm! Most of the ones @SkaziChris and I've tested are bad. I want to establish a nominal value (+/-) a standard deviation. But we need more measurements to get a statistically significant number. They don't all have to be good either, bad chips are just as useful. Many of the voltage lines are fine whereas just one is bad.
 
I have only dead models of 90 atm. I need to test one from a scrap board that I see 3.2 ohms on vdd core. It may have another issue, left it aside as after reball didn't work and hope is cell with 7 ohms the problem, didn't check uart after reball.
 
I have only dead models of 90 atm. I need to test one from a scrap board that I see 3.2 ohms on vdd core. It may have another issue, left it aside as after reball didn't work and hope is cell with 7 ohms the problem, didn't check uart after reball.
Well, the dead ones are useful too. We also need to establish what dead ones look like.
 
I probed a couple more chips, another 40nm and 90nm. But I noticed a difference between them that confused me before. So I thought I would take a closer look. Apparently, the locations labeled PLL in blue are a bit different between these model revisions...
View attachment 35829
As you can see the 40nm reads OL in a couple of spots that the 90nm doesn't. So I made this probing chart to simplify the locations to test, so it'll always return a comparable value.
View attachment 35830

Thanks.
I will redesign the spreadsheet... and re-do the measurements.
 
Hey just a quick question...
Connected my CECHK syscon and could successfully authenticate. Then I did the ERRLOG, unfortunately all entries look like this
00000000 FFFFFFFF FFFFFFFF
So no useful error codes.

Anyone got an idea what this means? Doesn't look good to me...
 
Quick question about this — how do I orient this image with the board?
Yes, I made a paint.net image with these scaled to the motherboard schematics (with layers I can add/remove)...
RSX PWR MB View 2.jpg
RSX PWR MB View.jpg
 
Im going to reply here for the record, because this thread is more focused in syscon research, originally written here
I'm just going to be adding more confusion now. From devwiki

The syscon itself updates new errors by erasing the oldest. But you can tell it easier when you see the timestamps like when you used Advanced Tools. I suppose the latest year you're seeing in the date would be the latest errors.

If you plan to keep using ERRLOG GET (which I still don't recommend), then before you start, try ERRLOG CLEAR command.

More from devwiki recommending (it's not mentioned, but implied) to go into internal mode and use the command "clear errlog" before "errlog" in order to be sure you'll get the latest errors.
I wrote that the other day, are mostly small notes as reminder, but the info about all that details could be added later, if some of you have some suggestion just tell, or edit the page, and feel free to rewrite my explanations in dirty english :D

And i edited the page right now to add some samples of the errorlogs in a hexeditor view, there are a couple of "weird" things that worths to be mentioned
https://www.psdevwiki.com/ps3/Syscon_Error_Codes#Error_log_format

There are 2 errorlog formats, for mullions (the syscons soldered by BGA) and for sherwoods (with pins all around)
In mullion is splitted in half, at top (first 0x80 bytes) are all the error codes, and at botom the timestamps (last 0x80 bytes). But in sherwood is not splitted and every error code is followed by his timestamp

As you can see in the samples, the syscon sometimes stores the error code 0xFFFFFFFF (only after the errorlog has been filled at least 1 time)... but is not really an error code. We need to think in the errorlog as a continous loop and the error 0xFFFFFFFF indicates where the loop ends
I added some notes in this edit, but is hidden for wiki readers (only visible for wiki editors when you click in one of the page "edit" links) because is mostly speculation
https://www.psdevwiki.com/ps3/index.php?title=Syscon_Error_Codes&diff=64543&oldid=64542

The PS3 Advanced tools are reading the error log by using syscalls (not in raw) and im not sure how it deals with this "loop ending" indicator
The other windows app intended to use it by UART (that really reads the errorlog in raw) probably is not considering is an special indicator and displays it as error code 0xFFFFFFFF

My point is... the error code 0xFFFFFFFF doesnt exists. If the errolog has been cleared and/or never was filled then is normal to have a lot of FF's at the bottom. But if you see something like the examples in wiki where there is a isolated row of 0xFFFFFFFF surrounded by valid error codes it means is an indicator of where the loop ends (in other words, the errorlog did at least 1 loop)
 
Last edited:
Hey just a quick question...
Connected my CECHK syscon and could successfully authenticate. Then I did the ERRLOG, unfortunately all entries look like this
00000000 FFFFFFFF FFFFFFFF
So no useful error codes.

Anyone got an idea what this means? Doesn't look good to me...

I think you are not in internal mode... Can you show the command line output ?

In external mode you need to do ERRLOG GET 00, then ERRLOG GET 01 and so on until 20.

Still, I don't understand why use this mode. Just go into internal mode and it will be easier to get proper errors. If the guide is confusing, you can ask here.
 
Im going to reply here for the record, because this thread is more focused in syscon research, originally written here

I wrote that the other day, are mostly small notes as reminder, but the info about all that details could be added later, if some of you have some suggestion just tell, or edit the page, and feel free to rewrite my explanations in dirty english :D

And i edited the page right now to add some samples of the errorlogs in a hexeditor view, there are a couple of "weird" things that worths to be mentioned
https://www.psdevwiki.com/ps3/Syscon_Error_Codes#Error_log_format

There are 2 errorlog formats, for mullions (the syscons soldered by BGA) and for sherwoods (with pins all around)
In mullion is splitted in half, at top (first 0x80 bytes) are all the error codes, and at botom the timestamps (last 0x80 bytes). But in sherwood is not splitted and every error code is followed by his timestamp

As you can see in the samples, the syscon sometimes stores the error code 0xFFFFFFFF (only after the errorlog has been filled at least 1 time)... but is not really an error code. We need to think in the errorlog as a continous loop and the error 0xFFFFFFFF indicates where the loop ends
I added some notes in this edit, but is hidden for wiki readers (only visible for wiki editors when you click in one of the page "edit" links) because is mostly speculation
https://www.psdevwiki.com/ps3/index.php?title=Syscon_Error_Codes&diff=64543&oldid=64542

The PS3 Advanced tools are reading the error log by using syscalls (not in raw) and im not sure how it deals with this "loop ending" indicator
The other windows app intended to use it by UART (that really reads the errorlog in raw) probably is not considering is an special indicator and displays it as error code 0xFFFFFFFF

My point is... the error code 0xFFFFFFFF doesnt exists. If the errolog has been cleared and/or never was filled then is normal to have a lot of FF's at the bottom. But if you see something like the examples in wiki where there is a isolated row of 0xFFFFFFFF surrounded by valid error codes it means is an indicator of where the loop ends (in other words, the errorlog did at least 1 loop)

I'm afraid your point went over my head ... :D None of this changes what I said about the recommended way to read error codes reliably ?
 
Yeah, in GLOD the error tends to get logged on shutdown in step #90. So you get 90 2120 alot. We've seen them occur at 20 & 40 before too, so it's not always 90. I just poured back over my spreadsheet of errorlogs and see most of the DVE/HDMI errors occur in 80/90 step numbers. And they're often a prelude to 40 3034's. @Kleon1876 documented the entire sequence of events with his console!
  1. He had Errors 80 1001 & the odd 90 2120. The 80 1001's were still occurring all the time, but 6 months later...
  2. ...the error progressed to 80 2022.
  3. 2 months later, a 1601/1701 occurred the moment the the BGA broke. The PLL lost lock because the BGA break occurred while the system was on. So it generated a livelock detection / BE attention error code just that once. That's how 1601/1701 get logged when the clock generator is fine BTW. It just means the console was on when the BGA broke. Other BGAs break when the console cools overnight, so it doesn't affect the PLL.
  4. Subsequent attempts to turn on the console generated 40 3034's.
  5. @kleon reflowed and the 3034/2022's disappeared. But the 80 1001's were still there (that's a separate issue that predated the DVE errors and remained afterward).
  6. Unfortunately he scratched the CPU wile delidding before the reflow, so he also got 80 1103 and 90 2203 errors. That gave him a YLOD/GLOD, even though his RSX reflow was successful. He then sent the console to @squeept who repaired the scratched CPU trace. That took care of the 1103/2203 errors.
  7. Unfortunately the reflow began failing, the 80 2022's returned.
  8. The BGA failed again shortly thereafter and the 40 3034's returned.
Kinda neat that the story is starting to make sense.

I wanted to show an interesting example when Victor was working on a US cok-001 board for the first time, he saw these errors. It also showed 1601/1701, but the 3034 was already there before it as well. Perhaps it ended up nearly reconnecting registering 1601 , and then broke again..

US-cok001.jpg
 
Hey just a quick question...
Connected my CECHK syscon and could successfully authenticate. Then I did the ERRLOG, unfortunately all entries look like this
00000000 FFFFFFFF FFFFFFFF
So no useful error codes.

Anyone got an idea what this means? Doesn't look good to me...
Just copy and paste this into the command terminal.
Code:
ERRLOG GET 00
ERRLOG GET 01
ERRLOG GET 02
ERRLOG GET 03
ERRLOG GET 04
ERRLOG GET 05
ERRLOG GET 06
ERRLOG GET 07
ERRLOG GET 08
ERRLOG GET 09
ERRLOG GET 0A
ERRLOG GET 0B
ERRLOG GET 0C
ERRLOG GET 0D
ERRLOG GET 0E
ERRLOG GET 0F
ERRLOG GET 10
ERRLOG GET 11
ERRLOG GET 12
ERRLOG GET 13
ERRLOG GET 14
ERRLOG GET 15
ERRLOG GET 16
ERRLOG GET 17
ERRLOG GET 18
ERRLOG GET 19
ERRLOG GET 1A
ERRLOG GET 1B
ERRLOG GET 1C
ERRLOG GET 1D
ERRLOG GET 1E
ERRLOG GET 1F
That will automatically run each command to retrieve all 32 codes stored in the log. Copy the text the SYSCON returns and paste into a txt file for safe keeping. Then use the "insert" button in the toolbar for "code". That'll keep the thread tidy. And you won't have to upload pictures to imigur until you reach 10 posts (the point at which you're allowed to attach them directly to the forum).
 
I wanted to show an interesting example when Victor was working on a US cok-001 board for the first time, he saw these errors. It also showed 1601/1701, but the 3034 was already there before it as well. Perhaps it ended up nearly reconnecting registering 1601 , and then broke again..

View attachment 35836
I'm still working up a spreadsheet with error codes. So far I have 223 consoles worth from this thread. I have done some preliminary statistics and the results are interesting, but I need to finish collating the error codes from the TOKIN thread before I'm ready to present them.

The short answer:

The reason why there were 3034's before a 1601/1701 in the log, then 3034/4XXX after, probably means he performed a pressure test and or there was different mounting pressure used with the heatsinks during a test. It temporarily reconnected the BGA, allowing the console to POST. Then it disconnected while the console was on.

The "Oh God, here he goes again!" answer:
I can say that 1601/1701 are common errors. They usually happen when a BGA defect occurs while the system is running (power state 80). Rarely do they happen earlier than power state 80! Basically, Cell encounters a livelock situation because the BGA is teetering/broke while the system was on. After POST, one of the ways the console realizes there is a BGA failure is when "a request for an exclusive lock is denied repeatedly, as many overlapping shared locks keep on interfering each other." (source). I'm not a coder, so I'm not sure if that quoted sentence make much sense. The way I think of it is that's one of the way the console realizes there's an issue, as the power/signaling is intermittently interrupted as the BGA breaks. It complains of "BE attention" and "Livelock Detection" errors if the BGA didn't affect one of the voltage rails. The FlexIO is a common, non voltage, line that would result in this error, if the console was on at the time the BGA broke.

Subsequent attempts to turn on the console will result in 3034/4XXX. That's the purpose of BitTraining, to pick up this kind of fault before allowing the console to proceed with the Power On Sequence. That prevents the console from getting to the point where 1601/1701 errors would occur. So that's why you will see the 1601/1701 once, at the same timestamp. Then followed by 3034/4XXX errors thereafter.
 
I've finally found something that will consistently trigger YLOD on my high mileage A01. Playing TLOU sewers level, when Henry goes to pick you up out of the water, it will trigger YLOD 9/10 times. Clearly this is a high load environment for the PS3. It will trigger A0801001 every time on my system. So my plan is to purchase new NEC OE128 and a hot air station to try and replace them all caps. At least I'm happy I found something that can consistently reproduce the failure. Maybe others can use my save file to try on their random YLOD systems to see if it can reproduce it as well. FYI total run time on this console is 428 days, although my B01 with 703 days still works fine even on this level. I can always reboot and get to XMB + in game, so perhaps it is not the chipset that has failed, but truly the caps.
 
I've finally found something that will consistently trigger YLOD on my high mileage A01. Playing TLOU sewers level, when Henry goes to pick you up out of the water, it will trigger YLOD 9/10 times. Clearly this is a high load environment for the PS3. It will trigger A0801001 every time on my system. So my plan is to purchase new NEC OE128 and a hot air station to try and replace them all caps. At least I'm happy I found something that can consistently reproduce the failure. Maybe others can use my save file to try on their random YLOD systems to see if it can reproduce it as well. FYI total run time on this console is 428 days, although my B01 with 703 days still works fine even on this level. I can always reboot and get to XMB + in game, so perhaps it is not the chipset that has failed, but truly the caps.
Try another PSU first. Piggyback a TaPol next. I'm concerned about the heat needed to replace the tokins damaging the BGA.

Also, would you mind posting the full errorlog? PS3 Advanced Tools dump is perfect.
 
Try another PSU first. Piggyback a TaPol next. I'm concerned about the heat needed to replace the tokins damaging the BGA.

Also, would you mind posting the full errorlog? PS3 Advanced Tools dump is perfect.
I have tried another APS 226 and APS 231, it doesn't matter, still has random YLOD and consistent on this part of TLOU. Ignore the later errors....it was stupid shit I was responsible for (no thermal paste start up....). I don't know why datetimes are fucked up. I do not want to add a piggyback cap, I want to replace with original OE128 cap because I don't want to introduce other variables into the equation.

Code:
Firmware Version: 4.88 (build 50731)
Platform ID: Cok14
Product Code: 00 84
Product Sub Code: 00 01
Hardware Config: 00000000FFFFFFFF
Syscon Fimware Version: 0B8E.0001000000000006 (EEPROM: 0001000000000006)

Bringup Count: 1209, Shutdown Count: 1168
Runtime: 429 Days, 14 Hours, 28 Minutes, 27 Seconds

Error Log
01: A0801001  Tue Jan  3 02:06:00 2006
02: A0801001  Tue Jan  3 02:02:13 2006
03: A0801001  Tue Jan  3 01:58:24 2006
04: A0801001  Tue Jan  3 01:55:31 2006
05: A0801001  Tue Jan  3 00:57:28 2006
06: A0801001  Tue Jan  3 00:47:30 2006
07: A0801001  Tue Jan  3 00:37:27 2006
08: A0801001  Mon Jan  2 12:41:24 2006
09: A0801001  Sun Jan  1 09:06:47 2006
10: A0801001  Sat Dec 31 01:05:29 2005
11: A0801001  Mon Jan  2 04:22:06 2006
12: A0801001  Sat Dec 31 20:27:27 2005
13: A0801001  Sat Dec 31 00:29:45 2005
14: A0801001  Sat Dec 31 00:01:04 2005
15: A0801200  Fri Dec 31 23:59:59 1999
16: A0801200  Fri Dec 31 23:59:59 1999
17: A0801001  Sun Jan 15 00:58:23 2006
18: A0801001  Sat Dec 31 02:36:38 2005
19: A0801001  Sat Dec 31 02:01:04 2005
20: A0801001  Sat Dec 31 01:02:30 2005
21: A0801002  Sat Dec 31 00:00:32 2005
22: A0801001  Sat Dec 31 00:00:01 2005
23: A0902203  Fri Dec 31 23:59:59 1999
24: A0801200  Fri Dec 31 23:59:59 1999
25: A0801200  Fri Dec 31 23:59:59 1999
26: A0801200  Fri Dec 31 23:59:59 1999
27: A0902203  Fri Dec 31 23:59:59 1999
28: A0801200  Fri Dec 31 23:59:59 1999
29: A0801200  Fri Dec 31 23:59:59 1999
30: A0902203  Fri Dec 31 23:59:59 1999
31: A0801200  Fri Dec 31 23:59:59 1999
32: FFFFFFFF  Fri Dec 31 23:59:59 1999



Bonus video of it in action
 
...I do not want to add a piggyback cap, I want to replace with original OE128 cap because I don't want to introduce other variables into the equation.
I hear what you're saying, but using that much heat right next to the RSX/CPU will introduce a great deal of strain to the BGA. It doesn't look like you have a BGA defect yet, but that can and has changed after attempting to remove/replace tokins. I absolutely AM trying to scare you.

This is the reason I made the Tantalizers, so you can remove/replace tokins with a longer lasting solution with the minimal amount of heat.
 

Similar threads

Back
Top