1
0
mirror of git://projects.qi-hardware.com/wernermisc.git synced 2024-11-15 15:35:19 +02:00
wernermisc/m1rc3/norruption/LOG
2011-09-18 12:35:19 -03:00

431 lines
16 KiB
Plaintext

--- Tue 2011-09-06 ------------------------------------------------------------
Running "loop": power-cycle, sleep 2 s, jtag-boot, sleep 70 seconds,
which is enough to boot into FN and render "The Tunnel" for a moment,
then power-cycle again (off-time is 5 s).
Note that the test loop is "open-loop" and will cycle also past any
problems. The first time a corrupt standby (or any other issue) is
observed may therefore be well after the actual event.
1: started around 11:53 (M1 configuration is original, without locking)
(around 500) visually checked boot process; standby was reached normally
--- Wed 2011-09-07 ------------------------------------------------------------
645: neocon stopped working (around 01:58)
666: detected neocon failure at run 666: restarted neocon; urjtag failed
this cycle; back to normal at 667
684: checked LEDs again (first time since ~500) and found that standby
may be failing. stopping test at 685 (around 02:50) for
investigation.
Downloaded the standby bitstream:
wget https://raw.github.com/milkymist/scripts/master/scripts/reflash_m1.sh
chmod 755 reflash_m1.sh
./reflash_m1.sh --read-flash
Found two corruptions in the standby bitstream:
diff -u <(hexdump -C standby.fpg) <(hexdump -C /home/root/.qi/milkymist/read-flash/2011...)
-00000080 00 00 4c 83 00 00 4c 87 00 00 cc 85 d8 47 cc 43 |..L...L......G.C|
+00000080 00 00 4c 83 00 00 4c 87 00 00 c4 80 d8 47 cc 43 |..L...L......G.C|
-00002840 00 08 cc 26 00 00 00 00 00 00 00 00 0c 44 00 98 |...&.........D..|
+00002840 00 00 cc 26 00 00 00 00 00 00 00 00 0c 44 00 98 |...&.........D..|
CRC-checked the partitions:
git clone git://github.com/milkymist/milkymist
cd milkymist/tools/
gcc -Wall -I. -o flterm flterm.c
wget http://milkymist.org/updates/current/for-rc3/boot.4e53273.bin
./flterm --port /dev/ttyUSB0 --kernel boot.4e53273.bin
serialboot
a
only standby.fpg failed the CRC check
Reflashed the standby bitstream:
wget http://milkymist.org/updates/2011-07-13/for-rc3/fjmem.bit
(or http://milkymist.org/updates/fjmem.bit.bz2)
wget http://milkymist.org/updates/current/standby.fpg
jtag
cable milkymist
detect
instruction CFG_OUT 000100 BYPASS
instruction CFG_IN 000101 BYPASS
pld load fjmem.bit
initbus fjmem opcode=000010
frequency 6000000
detectflash 0
endian big
flashmem 0 standby.fpg noverify
M1 enters standby normally again.
Running "loop2": power-cycle, sleep 2 s, jtag-boot, sleep 10 seconds,
which is enough to begin (but not finish) booting RTEMS, then
power-cycle again (off-time is 5 s).
1: started around 05:01. Observed until about 200-300 (06:00-06:30)
that standby was okay.
~730 (08:48): observed that standby didn't load anymore (note: due to
a bug in labsw, power is not turned on in about 5-10% of the cycles,
so the real cycle count should be around 650-700.)
Standby bitstream difference:
-00000080 00 00 4c 83 00 00 4c 87 00 00 cc 85 d8 47 cc 43 |..L...L......G.C|
+00000080 00 00 00 00 00 00 4c 87 00 00 cc 85 d8 47 cc 43 |......L......G.C|
Reflashed standby and locked the NOR. Testing with loop2 again.
1 (09:18): started
... continuing through the night ...
--- Thu 2011-09-08 ------------------------------------------------------------
3483 (03:18): standby is good so far
4325 (07:40): manually ended test. Standby is still good, but starting
with cycle 3704, booting RTEMS failed with
I: Booting from flash...
I: Loading 1889692 bytes from flash...
E: CRC failed (expected aa12a56a, got 68ec25e6)
A CRC check yielded:
Images CRC:
Checking : standby.fpg CRC passed (got c58e8905)
Checking : soc-rescue.fpg CRC passed (got 30dcc535)
Checking : bios-rescue.bin(CRC) CRC passed (got c78353fa)
Checking : splash-rescue.raw CRC passed (got e8ff824f)
Checking : flickernoise.fbi(rescue)(CRC) CRC passed (got aa12a56a)
Checking : soc.fpg CRC passed (got 3a31e737)
Checking : bios.bin(CRC) CRC passed (got 86e23684)
Checking : splash.raw CRC passed (got 978f860c)
Checking : flickernoise.fbi(CRC) CRC failed (expected aa12a56a, got 68ec25e6)
Read back the FlickerNoise partition with
readmem 0x920000 0x0400000 fn.bin
Compare with the original:
wget http://www.milkymist.org/updates/2011-07-13/flickernoise.fbi
md5sum flickernoise.fbi
5b7367e71bda306b080bde124615859b flickernoise.fbi
diff -u <(hexdump -C flickernoise.fbi) <(hexdump -C fn.bin)
...
-0008a380 28 43 00 00 34 64 00 01 58 44 00 00 5c 60 00 1e |(C..4d..XD..\`..|
+0008a380 28 43 00 00 00 00 00 01 58 44 00 00 5c 60 00 1e |(C......XD..\`..|
...
Recovered the FN partition and unlocked the NOR:
flashmem 0x920000 flickernoise.fbi noverify
unlockflash 0 55
New test series with script loop4. This differs from loop2 in that
it uses "pld reconfigure" to return to standby, instead of
power-cycling. If we still observe corruption with this test, then
a software problem would be to blame.
1 (09:11): started
2509 (19:33): standby looks good
All CRC checks pass. Verified that NOR was unlocked:
(load fjmem, etc.)
peek 0 # show old value
poke 0 0x40 0 0x0000 # Word Program
peek 0 # read back status (0x80 if okay, 0x92 if locked)
poke 0 0xff # Read Array (switch back to normal operation)
Took labsw offline to analyze occasional failure to switch. Failure
was difficult to reproduce. Also opened labsw to tighten a loose nut.
Afterwards (Friday run), labsw showed much fewer switch failures.
--- Fri 2011-09-09 ------------------------------------------------------------
New test with script "loop5". This time, we only power cycle but don't
try to boot out of standby. The purpose of this test is to confirm that
NOR corruption does not occur when powering down while in standby.
1 (11:04): started
200 (11:28:): stopped to issue "unlockflash 0 105" to make sure all of
the NOR is unlocked, just in case
Also checked CRCs. All is well.
1 (11:31): started
2637 (16:53): stopped. standby looks good.
All partitions pass the CRC check.
Repeating loop2 to make sure the NOR corruption hasn't disappeared for
an unrelated reason. System is connected to oscilloscope monitoring the
M1 DC in voltage. This connection provides grounding of DC in.
1 (16:56): started
--- Sat 2011-09-10 ------------------------------------------------------------
2428 (04:57): standby still okay
2440 (05::01): disconnected oscilloscope
2463 (05:08): stopped test
All partitions pass the CRC check. Read back the standby partition and
also found no corruption in bitwise comparison. Furthermore, the unused
area showed the expected 0xffff pattern.
1 (05:14): restarted test, without oscilloscope.
2213 (16:11): standby still okay
All partitions pass the CRC check. Unused area of standby shows 0xffff.
Prepared new test (loop7): like loop2, but make a "false start" of
turning on both channels and immediately turn them off again, wait 16
seconds, and only then power up properly. This would roughly correspond
to labsw failing to turn on, as observed in the test runs in which NOR
corruption occurred.
1 (16:27): started loop7 test
... continuing through the night ...
--- Sun 2011-09-11 ------------------------------------------------------------
2001 (11:58): standby okay
All partitions pass the CRC check. Unused area of standby shows 0xffff.
Confirmed writability of NOR at address 0x80000 and at address 0.
Instructions used at address 0x80000:
jtag> peek 0x80000
URJ_BUS_READ(0x00080000) = 0xFFFF (65535)
jtag> poke 0x80000 0x40 0x80000 0xffee
jtag> peek 0x80000
URJ_BUS_READ(0x00080000) = 0x0080 (128)
jtag> poke 0 0xff
jtag> peek 0x80000
URJ_BUS_READ(0x00080000) = 0xFFEE (65518)
--- Mon 2011-09-12 ------------------------------------------------------------
loop8 is similar to loop7. It increases the "false on" period to 10 ms,
which is enough to make the M1 power LED flash. It reduces the cool off
period after the false on.
1 (08:11): started loop8 test
2120 (19:50): standby okay. All partitions pass CRC check.
Going back to the beginning. Test loop (1) runs all the way to rendering.
Maybe it is necessary after all ...
1 (19:52): started loop (1) test (serial console logged in file log9)
70 (21:21): standby okay
--- Tue 2011-09-13 ------------------------------------------------------------
223 (00:39): standby failure
Several corruptions were found:
-00000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................|
+00000000 00 00 00 00 ff ff ff ff ff ff ff ff ff ff ff ff |................|
...
00005510 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................|
*
+00005550 ff ff ff ff ff ff d6 10 ff ff ff ff ff ff ff ff |................|
+00005560 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................|
+*
No CRC errors in other partitions.
1 (00:50): restored standby partition. started loop again (log9.1)
613 (14:04): standby okay
639 (14:38): standby failure. Other CRCs okay. Corruption:
...
-00000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................|
+00000000 00 00 ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................|
...
-000014b0 00 c0 cc 26 00 00 00 00 00 00 00 00 0c 44 00 30 |...&.........D.0|
+000014b0 00 00 cc 26 00 00 00 00 00 00 00 00 0c 44 00 30 |...&.........D.0|
...
1 (14:46): restored and verified standby partition. started loop again (log9.2)
23 (15:15): ok
117 (17:16): standby failure. Other CRCs okay. Corruption:
...
000050e0 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................|
*
+00005140 00 00 ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................|
...
1 (17:23): restored and verified standby partition. started loop again (log9.3)
1 (17:30): restarted after X crash
... continuing through the night ...
--- Wed 2011-09-14 ------------------------------------------------------------
... continuing through the whole day ...
--- Thu 2011-09-15 ------------------------------------------------------------
1493 (01:46): standby okay
1584 (03:44): standby failure. CRC failure in standby and soc-rescue.
Corruption in standby:
-00000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................|
+00000000 ff ff ff ff 00 00 ff ff ff ff ff ff ff ff ff ff |................|
...
-00000a00 00 98 cc 26 00 00 00 00 00 00 00 00 0c 44 00 a0 |...&.........D..|
+00000a00 00 00 cc 26 00 00 00 00 00 00 00 00 0c 44 00 a0 |...&.........D..|
...
Read back soc-rescue:
jtag> readmem 0x0A0000 0x0180000 soc-rescue.bin
Corruption:
-00020460 50 00 50 00 00 0f 55 5f ff ff dd ee 00 00 00 00 |P.P...U_........|
+00020460 00 00 50 00 00 0f 55 5f ff ff dd ee 00 00 00 00 |..P...U_........|
Restore standby, then soc-rescue:
wget http://milkymist.org/updates/current/soc-rescue.fpg
jtag> flashmem 0xA0000 soc-rescue.fpg noverify
1 (04:00): verified CRCs. started loop again (log9.4)
750 (20:12): standby okay. Interrupted to dump first 256 bytes of NOR with
script "peek". No differences found.
752 (20:13): resumed. standby okay
... continuing into the night ...
--- Fri 2011-09-16 ------------------------------------------------------------
980 (01:10): standby okay. Interrupted to run "peek". Found non-fatal damage:
-00000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................|
+00000000 00 00 ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................|
981 (01:12): resumed. standby okay.
1121 (04:13): standby failed. Other CRCs okay. Corruptions:
-00000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................|
+00000000 00 00 ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................|
...
-00000800 00 08 cc 26 00 00 00 00 00 00 00 00 0c 44 00 20 |...&.........D. |
+00000800 00 00 cc 26 00 00 00 00 00 00 00 00 0c 44 00 20 |...&.........D. |
...
-00001000 00 c8 cc 26 00 00 00 00 00 00 00 00 0c 44 00 90 |...&.........D..|
+00001000 00 00 cc 26 00 00 00 00 00 00 00 00 0c 44 00 90 |...&.........D..|
1 (04:20): restored and verified standby. started loop again (log9.5)
516 (15:28): standby okay. Interrupted to "peek". No differences found.
517 (15:29): resumed. standby okay.
597 (17:13): standby failed. Other CRCs okay. Corruption:
-00001160 00 d0 cc 26 00 00 00 00 00 00 00 00 0c 44 00 50 |...&.........D.P|
+00001160 00 00 cc 26 00 00 00 00 00 00 00 00 0c 44 00 50 |...&.........D.P|
1 (17:21): restored and verified standby. started loop again (log9.6)
... continuing into the night ...
--- Sat 2011-09-17 ------------------------------------------------------------
352 (00:57): standby okay
422 (02:27): standby failed. Other CRCs okay. Corruption:
-00002000 00 70 cc 26 00 00 00 00 00 00 00 00 0c 44 00 28 |.p.&.........D.(|
+00002000 00 00 cc 26 00 00 00 00 00 00 00 00 0c 44 00 28 |...&.........D.(|
1 (02:36): restored and verified standby. started loop again (log9.7)
501 (13:25): standby okay. Interrupted to run "peek". Found non-fatal damage:
-00000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................|
+00000000 ff ff ff ff ff ff 81 4c ff ff ff ff ff ff ff ff |.......L........|
1 (13:26): restarted (oops). standby okay. (still log9.7)
473 (23:39) standby okay
... continuing into the next day ...
--- Sun 2011-09-18 ------------------------------------------------------------
500 (00:14): standby failed. CRC failed in flickernoise-rescue, too.
Corruptions in standby:
-00000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................|
+00000000 ff ff ff ff 00 00 81 4c ff ff ff ff ff ff ff ff |.......L........|
...
-00002ff0 00 20 cc 26 00 00 00 00 00 00 00 00 0c 44 00 78 |. .&.........D.x|
+00002ff0 00 20 c4 04 00 00 00 00 00 00 00 00 0c 44 00 78 |. ...........D.x|
Corruption in flickernoise-rescue:
-00120000 ba 80 08 00 34 10 00 00 e0 00 00 1b 40 c8 00 0a |....4.......@...|
+00120000 00 00 08 00 34 10 00 00 e0 00 00 1b 40 c8 00 0a |....4.......@...|
1 (00:36): restored and verified everything. started loop again (log9.8)
233 (05:38): standby okay
260 (06:13): standby failed. Other CRCs okay. Corruption:
-000014b0 00 c0 cc 26 00 00 00 00 00 00 00 00 0c 44 00 30 |...&.........D.0|
+000014b0 00 00 cc 26 00 00 00 00 00 00 00 00 0c 44 00 30 |...&.........D.0|
1 (06:17): restored and verified standby. started loop again (log9.9)
3 (06:20): standby okay
75 (07:53): ok
82 (08:02): standby failed. Other CRCs okay. Corruption:
00001500 00 10 cc 26 00 00 00 00 00 00 00 00 0c 44 00 30 |...&.........D.0|
-00001510 00 90 cc 26 00 00 00 00 00 00 00 00 0c 44 00 30 |...&.........D.0|
+*
1 (08:19): restored and verified standby. started loop again (log9.10)
171 (12:00): standby okay
194 (12:30): standby failed. Other CRCs okay. Corruptions:
-00000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................|
+00000000 ff ff 28 40 ff ff ff ff ff ff ff ff ff ff ff ff |..(@............|
...
-00000440 00 88 cc 26 00 00 00 00 00 00 00 00 0c 44 00 40 |...&.........D.@|
+00000440 00 00 cc 26 00 00 00 00 00 00 00 00 0c 44 00 40 |...&.........D.@|
1 (12:33): restored and verified standby. started loop again (log9.11)
Summary of recent experiments (all with "loop"):
Last cycle ob- First cycle ob- Non-fatal Offsets of corrupted
served without served with corruption words in standby
fatal corruption fatal corr. (at end)
--------------- --------------- ------------ -------------------------------
70 ( 1h29) 223 ( 4h47) ? 0, 2, 0x5556
613 (13h14) 639 (13h48) ? 0, 0x14b0
23 ( 0h29) 117 ( 2h30) ? 0x5140
1493 (32h16) 1584 (34h14) ? 4, 0xa00; soc-rescue 0x20460
981 (21h12) 1121 (24h13) 980 (21h10) 0, 0x800, 0x1000
517 (11h09) 597 (12h53) - 0x1160
352 ( 7h36) 422 ( 9h06) - 0x2000
973 (21h03) 1002 (22h22) 501 (10h50) 6, 0x2ff2; fn-rescue 0x120000
233 ( 5h02) 260 ( 5h37) - 0x14b0
75 ( 1h36) 82 ( 1h45) - 0x1510
171 ( 3h41) 194 ( 4h11) - 2, 0x440