diff --git a/m1rc3/norruption/1/LOG b/m1rc3/norruption/1/LOG new file mode 100644 index 0000000..6e5972c --- /dev/null +++ b/m1rc3/norruption/1/LOG @@ -0,0 +1,478 @@ +--- Tue 2011-09-06 ------------------------------------------------------------ + +Running "loop": power-cycle, sleep 2 s, jtag-boot, sleep 70 seconds, +which is enough to boot into FN and render "The Tunnel" for a moment, +then power-cycle again (off-time is 5 s). + +Note that the test loop is "open-loop" and will cycle also past any +problems. The first time a corrupt standby (or any other issue) is +observed may therefore be well after the actual event. + +1: started around 11:53 (M1 configuration is original, without locking) +(around 500) visually checked boot process; standby was reached normally + +--- Wed 2011-09-07 ------------------------------------------------------------ + +645: neocon stopped working (around 01:58) +666: detected neocon failure at run 666: restarted neocon; urjtag failed + this cycle; back to normal at 667 +684: checked LEDs again (first time since ~500) and found that standby + may be failing. stopping test at 685 (around 02:50) for + investigation. + +Downloaded the standby bitstream: + + wget https://raw.github.com/milkymist/scripts/master/scripts/reflash_m1.sh + chmod 755 reflash_m1.sh + + ./reflash_m1.sh --read-flash + +Found two corruptions in the standby bitstream: + + diff -u <(hexdump -C standby.fpg) <(hexdump -C /home/root/.qi/milkymist/read-flash/2011...) + +-00000080 00 00 4c 83 00 00 4c 87 00 00 cc 85 d8 47 cc 43 |..L...L......G.C| ++00000080 00 00 4c 83 00 00 4c 87 00 00 c4 80 d8 47 cc 43 |..L...L......G.C| + +-00002840 00 08 cc 26 00 00 00 00 00 00 00 00 0c 44 00 98 |...&.........D..| ++00002840 00 00 cc 26 00 00 00 00 00 00 00 00 0c 44 00 98 |...&.........D..| + +CRC-checked the partitions: + + git clone git://github.com/milkymist/milkymist + cd milkymist/tools/ + gcc -Wall -I. -o flterm flterm.c + wget http://milkymist.org/updates/current/for-rc3/boot.4e53273.bin + ./flterm --port /dev/ttyUSB0 --kernel boot.4e53273.bin + + serialboot + a + + only standby.fpg failed the CRC check + +Reflashed the standby bitstream: + + wget http://milkymist.org/updates/2011-07-13/for-rc3/fjmem.bit + (or http://milkymist.org/updates/fjmem.bit.bz2) + wget http://milkymist.org/updates/current/standby.fpg + + jtag + + cable milkymist + detect + instruction CFG_OUT 000100 BYPASS + instruction CFG_IN 000101 BYPASS + pld load fjmem.bit + initbus fjmem opcode=000010 + frequency 6000000 + detectflash 0 + endian big + flashmem 0 standby.fpg noverify + +M1 enters standby normally again. + +Running "loop2": power-cycle, sleep 2 s, jtag-boot, sleep 10 seconds, +which is enough to begin (but not finish) booting RTEMS, then +power-cycle again (off-time is 5 s). + +1: started around 05:01. Observed until about 200-300 (06:00-06:30) +that standby was okay. +~730 (08:48): observed that standby didn't load anymore (note: due to +a bug in labsw, power is not turned on in about 5-10% of the cycles, +so the real cycle count should be around 650-700.) + +Standby bitstream difference: + +-00000080 00 00 4c 83 00 00 4c 87 00 00 cc 85 d8 47 cc 43 |..L...L......G.C| ++00000080 00 00 00 00 00 00 4c 87 00 00 cc 85 d8 47 cc 43 |......L......G.C| + +Reflashed standby and locked the NOR. Testing with loop2 again. + +1 (09:18): started +... continuing through the night ... + +--- Thu 2011-09-08 ------------------------------------------------------------ + +3483 (03:18): standby is good so far +4325 (07:40): manually ended test. Standby is still good, but starting + with cycle 3704, booting RTEMS failed with + + I: Booting from flash... + I: Loading 1889692 bytes from flash... + E: CRC failed (expected aa12a56a, got 68ec25e6) + +A CRC check yielded: + +Images CRC: + Checking : standby.fpg CRC passed (got c58e8905) + Checking : soc-rescue.fpg CRC passed (got 30dcc535) + Checking : bios-rescue.bin(CRC) CRC passed (got c78353fa) + Checking : splash-rescue.raw CRC passed (got e8ff824f) + Checking : flickernoise.fbi(rescue)(CRC) CRC passed (got aa12a56a) + Checking : soc.fpg CRC passed (got 3a31e737) + Checking : bios.bin(CRC) CRC passed (got 86e23684) + Checking : splash.raw CRC passed (got 978f860c) + Checking : flickernoise.fbi(CRC) CRC failed (expected aa12a56a, got 68ec25e6) + +Read back the FlickerNoise partition with + + readmem 0x920000 0x0400000 fn.bin + +Compare with the original: + + wget http://www.milkymist.org/updates/2011-07-13/flickernoise.fbi + md5sum flickernoise.fbi + 5b7367e71bda306b080bde124615859b flickernoise.fbi + + diff -u <(hexdump -C flickernoise.fbi) <(hexdump -C fn.bin) + +... +-0008a380 28 43 00 00 34 64 00 01 58 44 00 00 5c 60 00 1e |(C..4d..XD..\`..| ++0008a380 28 43 00 00 00 00 00 01 58 44 00 00 5c 60 00 1e |(C......XD..\`..| +... + +Recovered the FN partition and unlocked the NOR: + + flashmem 0x920000 flickernoise.fbi noverify + unlockflash 0 55 + +New test series with script loop4. This differs from loop2 in that +it uses "pld reconfigure" to return to standby, instead of +power-cycling. If we still observe corruption with this test, then +a software problem would be to blame. + +1 (09:11): started +2509 (19:33): standby looks good + +All CRC checks pass. Verified that NOR was unlocked: + + (load fjmem, etc.) + peek 0 # show old value + poke 0 0x40 0 0x0000 # Word Program + peek 0 # read back status (0x80 if okay, 0x92 if locked) + poke 0 0xff # Read Array (switch back to normal operation) + +Took labsw offline to analyze occasional failure to switch. Failure +was difficult to reproduce. Also opened labsw to tighten a loose nut. +Afterwards (Friday run), labsw showed much fewer switch failures. + +--- Fri 2011-09-09 ------------------------------------------------------------ + +New test with script "loop5". This time, we only power cycle but don't +try to boot out of standby. The purpose of this test is to confirm that +NOR corruption does not occur when powering down while in standby. + +1 (11:04): started +200 (11:28:): stopped to issue "unlockflash 0 105" to make sure all of + the NOR is unlocked, just in case + +Also checked CRCs. All is well. + +1 (11:31): started +2637 (16:53): stopped. standby looks good. + +All partitions pass the CRC check. + +Repeating loop2 to make sure the NOR corruption hasn't disappeared for +an unrelated reason. System is connected to oscilloscope monitoring the +M1 DC in voltage. This connection provides grounding of DC in. + +1 (16:56): started + +--- Sat 2011-09-10 ------------------------------------------------------------ + +2428 (04:57): standby still okay +2440 (05::01): disconnected oscilloscope +2463 (05:08): stopped test + +All partitions pass the CRC check. Read back the standby partition and +also found no corruption in bitwise comparison. Furthermore, the unused +area showed the expected 0xffff pattern. + +1 (05:14): restarted test, without oscilloscope. +2213 (16:11): standby still okay + +All partitions pass the CRC check. Unused area of standby shows 0xffff. + +Prepared new test (loop7): like loop2, but make a "false start" of +turning on both channels and immediately turn them off again, wait 16 +seconds, and only then power up properly. This would roughly correspond +to labsw failing to turn on, as observed in the test runs in which NOR +corruption occurred. + +1 (16:27): started loop7 test +... continuing through the night ... + +--- Sun 2011-09-11 ------------------------------------------------------------ + +2001 (11:58): standby okay + +All partitions pass the CRC check. Unused area of standby shows 0xffff. + +Confirmed writability of NOR at address 0x80000 and at address 0. +Instructions used at address 0x80000: + + jtag> peek 0x80000 + URJ_BUS_READ(0x00080000) = 0xFFFF (65535) + jtag> poke 0x80000 0x40 0x80000 0xffee + jtag> peek 0x80000 + URJ_BUS_READ(0x00080000) = 0x0080 (128) + jtag> poke 0 0xff + jtag> peek 0x80000 + URJ_BUS_READ(0x00080000) = 0xFFEE (65518) + +--- Mon 2011-09-12 ------------------------------------------------------------ + +loop8 is similar to loop7. It increases the "false on" period to 10 ms, +which is enough to make the M1 power LED flash. It reduces the cool off +period after the false on. + +1 (08:11): started loop8 test +2120 (19:50): standby okay. All partitions pass CRC check. + +Going back to the beginning. Test loop (1) runs all the way to rendering. +Maybe it is necessary after all ... + +1 (19:52): started loop (1) test (serial console logged in file log9) +70 (21:21): standby okay + +--- Tue 2011-09-13 ------------------------------------------------------------ + +223 (00:39): standby failure + +Several corruptions were found: + +-00000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................| ++00000000 00 00 00 00 ff ff ff ff ff ff ff ff ff ff ff ff |................| +... + 00005510 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................| + * ++00005550 ff ff ff ff ff ff d6 10 ff ff ff ff ff ff ff ff |................| ++00005560 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................| ++* + +No CRC errors in other partitions. + +1 (00:50): restored standby partition. started loop again (log9.1) +613 (14:04): standby okay +639 (14:38): standby failure. Other CRCs okay. Corruption: + +... +-00000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................| ++00000000 00 00 ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................| +... +-000014b0 00 c0 cc 26 00 00 00 00 00 00 00 00 0c 44 00 30 |...&.........D.0| ++000014b0 00 00 cc 26 00 00 00 00 00 00 00 00 0c 44 00 30 |...&.........D.0| +... + +1 (14:46): restored and verified standby partition. started loop again (log9.2) +23 (15:15): ok +117 (17:16): standby failure. Other CRCs okay. Corruption: + +... + 000050e0 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................| + * ++00005140 00 00 ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................| +... + +1 (17:23): restored and verified standby partition. started loop again (log9.3) +1 (17:30): restarted after X crash +... continuing through the night ... + +--- Wed 2011-09-14 ------------------------------------------------------------ + +... continuing through the whole day ... + +--- Thu 2011-09-15 ------------------------------------------------------------ + +1493 (01:46): standby okay +1584 (03:44): standby failure. CRC failure in standby and soc-rescue. + +Corruption in standby: + +-00000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................| ++00000000 ff ff ff ff 00 00 ff ff ff ff ff ff ff ff ff ff |................| +... +-00000a00 00 98 cc 26 00 00 00 00 00 00 00 00 0c 44 00 a0 |...&.........D..| ++00000a00 00 00 cc 26 00 00 00 00 00 00 00 00 0c 44 00 a0 |...&.........D..| +... + +Read back soc-rescue: + +jtag> readmem 0x0A0000 0x0180000 soc-rescue.bin + +Corruption: + +-00020460 50 00 50 00 00 0f 55 5f ff ff dd ee 00 00 00 00 |P.P...U_........| ++00020460 00 00 50 00 00 0f 55 5f ff ff dd ee 00 00 00 00 |..P...U_........| + +Restore standby, then soc-rescue: + +wget http://milkymist.org/updates/current/soc-rescue.fpg +jtag> flashmem 0xA0000 soc-rescue.fpg noverify + +1 (04:00): verified CRCs. started loop again (log9.4) +750 (20:12): standby okay. Interrupted to dump first 256 bytes of NOR with + script "peek". No differences found. +752 (20:13): resumed. standby okay +... continuing into the night ... + +--- Fri 2011-09-16 ------------------------------------------------------------ + +980 (01:10): standby okay. Interrupted to run "peek". Found non-fatal damage: + +-00000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................| ++00000000 00 00 ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................| + +981 (01:12): resumed. standby okay. +1121 (04:13): standby failed. Other CRCs okay. Corruptions: + +-00000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................| ++00000000 00 00 ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................| +... +-00000800 00 08 cc 26 00 00 00 00 00 00 00 00 0c 44 00 20 |...&.........D. | ++00000800 00 00 cc 26 00 00 00 00 00 00 00 00 0c 44 00 20 |...&.........D. | +... +-00001000 00 c8 cc 26 00 00 00 00 00 00 00 00 0c 44 00 90 |...&.........D..| ++00001000 00 00 cc 26 00 00 00 00 00 00 00 00 0c 44 00 90 |...&.........D..| + +1 (04:20): restored and verified standby. started loop again (log9.5) +516 (15:28): standby okay. Interrupted to "peek". No differences found. +517 (15:29): resumed. standby okay. +597 (17:13): standby failed. Other CRCs okay. Corruption: + +-00001160 00 d0 cc 26 00 00 00 00 00 00 00 00 0c 44 00 50 |...&.........D.P| ++00001160 00 00 cc 26 00 00 00 00 00 00 00 00 0c 44 00 50 |...&.........D.P| + +1 (17:21): restored and verified standby. started loop again (log9.6) +... continuing into the night ... + +--- Sat 2011-09-17 ------------------------------------------------------------ + +352 (00:57): standby okay +422 (02:27): standby failed. Other CRCs okay. Corruption: + +-00002000 00 70 cc 26 00 00 00 00 00 00 00 00 0c 44 00 28 |.p.&.........D.(| ++00002000 00 00 cc 26 00 00 00 00 00 00 00 00 0c 44 00 28 |...&.........D.(| + +1 (02:36): restored and verified standby. started loop again (log9.7) +501 (13:25): standby okay. Interrupted to run "peek". Found non-fatal damage: + +-00000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................| ++00000000 ff ff ff ff ff ff 81 4c ff ff ff ff ff ff ff ff |.......L........| + +1 (13:26): restarted (oops). standby okay. (still log9.7) +473 (23:39) standby okay +... continuing into the next day ... + +--- Sun 2011-09-18 ------------------------------------------------------------ + +500 (00:14): standby failed. CRC failed in flickernoise-rescue, too. + +Corruptions in standby: + +-00000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................| ++00000000 ff ff ff ff 00 00 81 4c ff ff ff ff ff ff ff ff |.......L........| +... +-00002ff0 00 20 cc 26 00 00 00 00 00 00 00 00 0c 44 00 78 |. .&.........D.x| ++00002ff0 00 20 c4 04 00 00 00 00 00 00 00 00 0c 44 00 78 |. ...........D.x| + +Corruption in flickernoise-rescue: + +-00120000 ba 80 08 00 34 10 00 00 e0 00 00 1b 40 c8 00 0a |....4.......@...| ++00120000 00 00 08 00 34 10 00 00 e0 00 00 1b 40 c8 00 0a |....4.......@...| + +1 (00:36): restored and verified everything. started loop again (log9.8) +233 (05:38): standby okay +260 (06:13): standby failed. Other CRCs okay. Corruption: + +-000014b0 00 c0 cc 26 00 00 00 00 00 00 00 00 0c 44 00 30 |...&.........D.0| ++000014b0 00 00 cc 26 00 00 00 00 00 00 00 00 0c 44 00 30 |...&.........D.0| + +1 (06:17): restored and verified standby. started loop again (log9.9) +3 (06:20): standby okay +75 (07:53): ok +82 (08:02): standby failed. Other CRCs okay. Corruption: + + 00001500 00 10 cc 26 00 00 00 00 00 00 00 00 0c 44 00 30 |...&.........D.0| +-00001510 00 90 cc 26 00 00 00 00 00 00 00 00 0c 44 00 30 |...&.........D.0| ++* + +1 (08:19): restored and verified standby. started loop again (log9.10) +171 (12:00): standby okay +194 (12:30): standby failed. Other CRCs okay. Corruptions: + +-00000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................| ++00000000 ff ff 28 40 ff ff ff ff ff ff ff ff ff ff ff ff |..(@............| +... +-00000440 00 88 cc 26 00 00 00 00 00 00 00 00 0c 44 00 40 |...&.........D.@| ++00000440 00 00 cc 26 00 00 00 00 00 00 00 00 0c 44 00 40 |...&.........D.@| + +1 (12:33): restored and verified standby. started loop again (log9.11) +162 (16:03): standby okay +168 (16:25): standby failed. Other CRCs okay. Corruption: + +-00002000 00 70 cc 26 00 00 00 00 00 00 00 00 0c 44 00 28 |.p.&.........D.(| ++00002000 00 00 cc 26 00 00 00 00 00 00 00 00 0c 44 00 28 |...&.........D.(| + +1 (16:27): restored and verified standby. started loop again (log9.12) +341 (23:49): standby okay +... continuing into the next day ... + +--- Mon 2011-09-19 ------------------------------------------------------------ + +359 (00:12): standby okay +582 (05:02): standby okay. Interrupted to "peek". No differences found. +583 (05:04): resumed. standby okay. +1023 (14:36): standby okay. Interrupted to "peek". No differences found. +1024 (14:37): resumed. standby okay. +1452 (23:53): standby okay +... continuing into the next day ... + +--- Tue 2011-09-20 ------------------------------------------------------------ + +1520 (01:22): standby okay. Interrupted to "peek". No differences found. +1521 (01:23): resumed. standby okay. +2013 (12:03): standby okay. Interrupted to "peek". No differences found. +2014 (12:04): resumed. standby okay. +2452 (21:33): standby okay +... continuing into the next day ... + +--- Wed 2011-09-21 ------------------------------------------------------------ + +2602 (00:48): standby okay +2603 (00:49): standby okay. Interrupted to "peek". No differences found. +2604 (00:52): resumed. standby okay. +2999 (09:25): standby okay. Interrupted to "peek". No differences found. +3000 (09:27): resumed. standby okay. +3488 (20:01): standby okay. Interrupted to "peek". No differences found. +3489 (20:02): resumed. standby okay. +3561 (21:36): standby okay +... continuing into the next day ... + +--- Thu 2011-09-22 ------------------------------------------------------------ + +3673 (00:01): standby okay +3972 (06:30): standby okay. Interrupted to "peek". No differences found. +3973 (06:32): resumed. standby okay. +4000 (07:08): standby okay. giving up. + + +Summary of recent experiments (all with "loop"): + +Last cycle ob- First cycle ob- Non-fatal Offsets of corrupted +served without served with corruption words in standby +fatal corruption fatal corr. (at end) +--------------- --------------- ------------ ------------------------------- + 70 ( 1h29) 223 ( 4h47) ? 0, 2, 0x5556 + 613 (13h14) 639 (13h48) ? 0, 0x14b0 + 23 ( 0h29) 117 ( 2h30) ? 0x5140 +1493 (32h16) 1584 (34h14) ? 4, 0xa00; soc-rescue 0x20460 + 981 (21h12) 1121 (24h13) 980 (21h10) 0, 0x800, 0x1000 + 517 (11h09) 597 (12h53) - 0x1160 + 352 ( 7h36) 422 ( 9h06) - 0x2000 + 973 (21h03) 1002 (22h22) 501 (10h50) 6, 0x2ff2; fn-rescue 0x120000 + 233 ( 5h02) 260 ( 5h37) - 0x14b0 + 75 ( 1h36) 82 ( 1h45) - 0x1510 + 171 ( 3h41) 194 ( 4h11) - 2, 0x440 + 162 ( 3h30) 168 ( 3h52) - 0x2000 diff --git a/m1rc3/norruption/README b/m1rc3/norruption/1/README similarity index 100% rename from m1rc3/norruption/README rename to m1rc3/norruption/1/README diff --git a/m1rc3/norruption/loop b/m1rc3/norruption/1/loop similarity index 100% rename from m1rc3/norruption/loop rename to m1rc3/norruption/1/loop diff --git a/m1rc3/norruption/loop2 b/m1rc3/norruption/1/loop2 similarity index 100% rename from m1rc3/norruption/loop2 rename to m1rc3/norruption/1/loop2 diff --git a/m1rc3/norruption/loop4 b/m1rc3/norruption/1/loop4 similarity index 100% rename from m1rc3/norruption/loop4 rename to m1rc3/norruption/1/loop4 diff --git a/m1rc3/norruption/loop5 b/m1rc3/norruption/1/loop5 similarity index 100% rename from m1rc3/norruption/loop5 rename to m1rc3/norruption/1/loop5 diff --git a/m1rc3/norruption/loop7 b/m1rc3/norruption/1/loop7 similarity index 100% rename from m1rc3/norruption/loop7 rename to m1rc3/norruption/1/loop7 diff --git a/m1rc3/norruption/peek b/m1rc3/norruption/1/peek similarity index 100% rename from m1rc3/norruption/peek rename to m1rc3/norruption/1/peek diff --git a/m1rc3/norruption/mtemp b/m1rc3/norruption/2/mtemp similarity index 100% rename from m1rc3/norruption/mtemp rename to m1rc3/norruption/2/mtemp diff --git a/m1rc3/norruption/tco.py b/m1rc3/norruption/2/tco.py similarity index 100% rename from m1rc3/norruption/tco.py rename to m1rc3/norruption/2/tco.py diff --git a/m1rc3/norruption/LOG b/m1rc3/norruption/LOG index 6e5972c..4a8fd22 100644 --- a/m1rc3/norruption/LOG +++ b/m1rc3/norruption/LOG @@ -1,478 +1,2 @@ ---- Tue 2011-09-06 ------------------------------------------------------------ - -Running "loop": power-cycle, sleep 2 s, jtag-boot, sleep 70 seconds, -which is enough to boot into FN and render "The Tunnel" for a moment, -then power-cycle again (off-time is 5 s). - -Note that the test loop is "open-loop" and will cycle also past any -problems. The first time a corrupt standby (or any other issue) is -observed may therefore be well after the actual event. - -1: started around 11:53 (M1 configuration is original, without locking) -(around 500) visually checked boot process; standby was reached normally - ---- Wed 2011-09-07 ------------------------------------------------------------ - -645: neocon stopped working (around 01:58) -666: detected neocon failure at run 666: restarted neocon; urjtag failed - this cycle; back to normal at 667 -684: checked LEDs again (first time since ~500) and found that standby - may be failing. stopping test at 685 (around 02:50) for - investigation. - -Downloaded the standby bitstream: - - wget https://raw.github.com/milkymist/scripts/master/scripts/reflash_m1.sh - chmod 755 reflash_m1.sh - - ./reflash_m1.sh --read-flash - -Found two corruptions in the standby bitstream: - - diff -u <(hexdump -C standby.fpg) <(hexdump -C /home/root/.qi/milkymist/read-flash/2011...) - --00000080 00 00 4c 83 00 00 4c 87 00 00 cc 85 d8 47 cc 43 |..L...L......G.C| -+00000080 00 00 4c 83 00 00 4c 87 00 00 c4 80 d8 47 cc 43 |..L...L......G.C| - --00002840 00 08 cc 26 00 00 00 00 00 00 00 00 0c 44 00 98 |...&.........D..| -+00002840 00 00 cc 26 00 00 00 00 00 00 00 00 0c 44 00 98 |...&.........D..| - -CRC-checked the partitions: - - git clone git://github.com/milkymist/milkymist - cd milkymist/tools/ - gcc -Wall -I. -o flterm flterm.c - wget http://milkymist.org/updates/current/for-rc3/boot.4e53273.bin - ./flterm --port /dev/ttyUSB0 --kernel boot.4e53273.bin - - serialboot - a - - only standby.fpg failed the CRC check - -Reflashed the standby bitstream: - - wget http://milkymist.org/updates/2011-07-13/for-rc3/fjmem.bit - (or http://milkymist.org/updates/fjmem.bit.bz2) - wget http://milkymist.org/updates/current/standby.fpg - - jtag - - cable milkymist - detect - instruction CFG_OUT 000100 BYPASS - instruction CFG_IN 000101 BYPASS - pld load fjmem.bit - initbus fjmem opcode=000010 - frequency 6000000 - detectflash 0 - endian big - flashmem 0 standby.fpg noverify - -M1 enters standby normally again. - -Running "loop2": power-cycle, sleep 2 s, jtag-boot, sleep 10 seconds, -which is enough to begin (but not finish) booting RTEMS, then -power-cycle again (off-time is 5 s). - -1: started around 05:01. Observed until about 200-300 (06:00-06:30) -that standby was okay. -~730 (08:48): observed that standby didn't load anymore (note: due to -a bug in labsw, power is not turned on in about 5-10% of the cycles, -so the real cycle count should be around 650-700.) - -Standby bitstream difference: - --00000080 00 00 4c 83 00 00 4c 87 00 00 cc 85 d8 47 cc 43 |..L...L......G.C| -+00000080 00 00 00 00 00 00 4c 87 00 00 cc 85 d8 47 cc 43 |......L......G.C| - -Reflashed standby and locked the NOR. Testing with loop2 again. - -1 (09:18): started -... continuing through the night ... - ---- Thu 2011-09-08 ------------------------------------------------------------ - -3483 (03:18): standby is good so far -4325 (07:40): manually ended test. Standby is still good, but starting - with cycle 3704, booting RTEMS failed with - - I: Booting from flash... - I: Loading 1889692 bytes from flash... - E: CRC failed (expected aa12a56a, got 68ec25e6) - -A CRC check yielded: - -Images CRC: - Checking : standby.fpg CRC passed (got c58e8905) - Checking : soc-rescue.fpg CRC passed (got 30dcc535) - Checking : bios-rescue.bin(CRC) CRC passed (got c78353fa) - Checking : splash-rescue.raw CRC passed (got e8ff824f) - Checking : flickernoise.fbi(rescue)(CRC) CRC passed (got aa12a56a) - Checking : soc.fpg CRC passed (got 3a31e737) - Checking : bios.bin(CRC) CRC passed (got 86e23684) - Checking : splash.raw CRC passed (got 978f860c) - Checking : flickernoise.fbi(CRC) CRC failed (expected aa12a56a, got 68ec25e6) - -Read back the FlickerNoise partition with - - readmem 0x920000 0x0400000 fn.bin - -Compare with the original: - - wget http://www.milkymist.org/updates/2011-07-13/flickernoise.fbi - md5sum flickernoise.fbi - 5b7367e71bda306b080bde124615859b flickernoise.fbi - - diff -u <(hexdump -C flickernoise.fbi) <(hexdump -C fn.bin) - -... --0008a380 28 43 00 00 34 64 00 01 58 44 00 00 5c 60 00 1e |(C..4d..XD..\`..| -+0008a380 28 43 00 00 00 00 00 01 58 44 00 00 5c 60 00 1e |(C......XD..\`..| -... - -Recovered the FN partition and unlocked the NOR: - - flashmem 0x920000 flickernoise.fbi noverify - unlockflash 0 55 - -New test series with script loop4. This differs from loop2 in that -it uses "pld reconfigure" to return to standby, instead of -power-cycling. If we still observe corruption with this test, then -a software problem would be to blame. - -1 (09:11): started -2509 (19:33): standby looks good - -All CRC checks pass. Verified that NOR was unlocked: - - (load fjmem, etc.) - peek 0 # show old value - poke 0 0x40 0 0x0000 # Word Program - peek 0 # read back status (0x80 if okay, 0x92 if locked) - poke 0 0xff # Read Array (switch back to normal operation) - -Took labsw offline to analyze occasional failure to switch. Failure -was difficult to reproduce. Also opened labsw to tighten a loose nut. -Afterwards (Friday run), labsw showed much fewer switch failures. - ---- Fri 2011-09-09 ------------------------------------------------------------ - -New test with script "loop5". This time, we only power cycle but don't -try to boot out of standby. The purpose of this test is to confirm that -NOR corruption does not occur when powering down while in standby. - -1 (11:04): started -200 (11:28:): stopped to issue "unlockflash 0 105" to make sure all of - the NOR is unlocked, just in case - -Also checked CRCs. All is well. - -1 (11:31): started -2637 (16:53): stopped. standby looks good. - -All partitions pass the CRC check. - -Repeating loop2 to make sure the NOR corruption hasn't disappeared for -an unrelated reason. System is connected to oscilloscope monitoring the -M1 DC in voltage. This connection provides grounding of DC in. - -1 (16:56): started - ---- Sat 2011-09-10 ------------------------------------------------------------ - -2428 (04:57): standby still okay -2440 (05::01): disconnected oscilloscope -2463 (05:08): stopped test - -All partitions pass the CRC check. Read back the standby partition and -also found no corruption in bitwise comparison. Furthermore, the unused -area showed the expected 0xffff pattern. - -1 (05:14): restarted test, without oscilloscope. -2213 (16:11): standby still okay - -All partitions pass the CRC check. Unused area of standby shows 0xffff. - -Prepared new test (loop7): like loop2, but make a "false start" of -turning on both channels and immediately turn them off again, wait 16 -seconds, and only then power up properly. This would roughly correspond -to labsw failing to turn on, as observed in the test runs in which NOR -corruption occurred. - -1 (16:27): started loop7 test -... continuing through the night ... - ---- Sun 2011-09-11 ------------------------------------------------------------ - -2001 (11:58): standby okay - -All partitions pass the CRC check. Unused area of standby shows 0xffff. - -Confirmed writability of NOR at address 0x80000 and at address 0. -Instructions used at address 0x80000: - - jtag> peek 0x80000 - URJ_BUS_READ(0x00080000) = 0xFFFF (65535) - jtag> poke 0x80000 0x40 0x80000 0xffee - jtag> peek 0x80000 - URJ_BUS_READ(0x00080000) = 0x0080 (128) - jtag> poke 0 0xff - jtag> peek 0x80000 - URJ_BUS_READ(0x00080000) = 0xFFEE (65518) - ---- Mon 2011-09-12 ------------------------------------------------------------ - -loop8 is similar to loop7. It increases the "false on" period to 10 ms, -which is enough to make the M1 power LED flash. It reduces the cool off -period after the false on. - -1 (08:11): started loop8 test -2120 (19:50): standby okay. All partitions pass CRC check. - -Going back to the beginning. Test loop (1) runs all the way to rendering. -Maybe it is necessary after all ... - -1 (19:52): started loop (1) test (serial console logged in file log9) -70 (21:21): standby okay - ---- Tue 2011-09-13 ------------------------------------------------------------ - -223 (00:39): standby failure - -Several corruptions were found: - --00000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................| -+00000000 00 00 00 00 ff ff ff ff ff ff ff ff ff ff ff ff |................| -... - 00005510 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................| - * -+00005550 ff ff ff ff ff ff d6 10 ff ff ff ff ff ff ff ff |................| -+00005560 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................| -+* - -No CRC errors in other partitions. - -1 (00:50): restored standby partition. started loop again (log9.1) -613 (14:04): standby okay -639 (14:38): standby failure. Other CRCs okay. Corruption: - -... --00000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................| -+00000000 00 00 ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................| -... --000014b0 00 c0 cc 26 00 00 00 00 00 00 00 00 0c 44 00 30 |...&.........D.0| -+000014b0 00 00 cc 26 00 00 00 00 00 00 00 00 0c 44 00 30 |...&.........D.0| -... - -1 (14:46): restored and verified standby partition. started loop again (log9.2) -23 (15:15): ok -117 (17:16): standby failure. Other CRCs okay. Corruption: - -... - 000050e0 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................| - * -+00005140 00 00 ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................| -... - -1 (17:23): restored and verified standby partition. started loop again (log9.3) -1 (17:30): restarted after X crash -... continuing through the night ... - ---- Wed 2011-09-14 ------------------------------------------------------------ - -... continuing through the whole day ... - ---- Thu 2011-09-15 ------------------------------------------------------------ - -1493 (01:46): standby okay -1584 (03:44): standby failure. CRC failure in standby and soc-rescue. - -Corruption in standby: - --00000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................| -+00000000 ff ff ff ff 00 00 ff ff ff ff ff ff ff ff ff ff |................| -... --00000a00 00 98 cc 26 00 00 00 00 00 00 00 00 0c 44 00 a0 |...&.........D..| -+00000a00 00 00 cc 26 00 00 00 00 00 00 00 00 0c 44 00 a0 |...&.........D..| -... - -Read back soc-rescue: - -jtag> readmem 0x0A0000 0x0180000 soc-rescue.bin - -Corruption: - --00020460 50 00 50 00 00 0f 55 5f ff ff dd ee 00 00 00 00 |P.P...U_........| -+00020460 00 00 50 00 00 0f 55 5f ff ff dd ee 00 00 00 00 |..P...U_........| - -Restore standby, then soc-rescue: - -wget http://milkymist.org/updates/current/soc-rescue.fpg -jtag> flashmem 0xA0000 soc-rescue.fpg noverify - -1 (04:00): verified CRCs. started loop again (log9.4) -750 (20:12): standby okay. Interrupted to dump first 256 bytes of NOR with - script "peek". No differences found. -752 (20:13): resumed. standby okay -... continuing into the night ... - ---- Fri 2011-09-16 ------------------------------------------------------------ - -980 (01:10): standby okay. Interrupted to run "peek". Found non-fatal damage: - --00000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................| -+00000000 00 00 ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................| - -981 (01:12): resumed. standby okay. -1121 (04:13): standby failed. Other CRCs okay. Corruptions: - --00000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................| -+00000000 00 00 ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................| -... --00000800 00 08 cc 26 00 00 00 00 00 00 00 00 0c 44 00 20 |...&.........D. | -+00000800 00 00 cc 26 00 00 00 00 00 00 00 00 0c 44 00 20 |...&.........D. | -... --00001000 00 c8 cc 26 00 00 00 00 00 00 00 00 0c 44 00 90 |...&.........D..| -+00001000 00 00 cc 26 00 00 00 00 00 00 00 00 0c 44 00 90 |...&.........D..| - -1 (04:20): restored and verified standby. started loop again (log9.5) -516 (15:28): standby okay. Interrupted to "peek". No differences found. -517 (15:29): resumed. standby okay. -597 (17:13): standby failed. Other CRCs okay. Corruption: - --00001160 00 d0 cc 26 00 00 00 00 00 00 00 00 0c 44 00 50 |...&.........D.P| -+00001160 00 00 cc 26 00 00 00 00 00 00 00 00 0c 44 00 50 |...&.........D.P| - -1 (17:21): restored and verified standby. started loop again (log9.6) -... continuing into the night ... - ---- Sat 2011-09-17 ------------------------------------------------------------ - -352 (00:57): standby okay -422 (02:27): standby failed. Other CRCs okay. Corruption: - --00002000 00 70 cc 26 00 00 00 00 00 00 00 00 0c 44 00 28 |.p.&.........D.(| -+00002000 00 00 cc 26 00 00 00 00 00 00 00 00 0c 44 00 28 |...&.........D.(| - -1 (02:36): restored and verified standby. started loop again (log9.7) -501 (13:25): standby okay. Interrupted to run "peek". Found non-fatal damage: - --00000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................| -+00000000 ff ff ff ff ff ff 81 4c ff ff ff ff ff ff ff ff |.......L........| - -1 (13:26): restarted (oops). standby okay. (still log9.7) -473 (23:39) standby okay -... continuing into the next day ... - ---- Sun 2011-09-18 ------------------------------------------------------------ - -500 (00:14): standby failed. CRC failed in flickernoise-rescue, too. - -Corruptions in standby: - --00000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................| -+00000000 ff ff ff ff 00 00 81 4c ff ff ff ff ff ff ff ff |.......L........| -... --00002ff0 00 20 cc 26 00 00 00 00 00 00 00 00 0c 44 00 78 |. .&.........D.x| -+00002ff0 00 20 c4 04 00 00 00 00 00 00 00 00 0c 44 00 78 |. ...........D.x| - -Corruption in flickernoise-rescue: - --00120000 ba 80 08 00 34 10 00 00 e0 00 00 1b 40 c8 00 0a |....4.......@...| -+00120000 00 00 08 00 34 10 00 00 e0 00 00 1b 40 c8 00 0a |....4.......@...| - -1 (00:36): restored and verified everything. started loop again (log9.8) -233 (05:38): standby okay -260 (06:13): standby failed. Other CRCs okay. Corruption: - --000014b0 00 c0 cc 26 00 00 00 00 00 00 00 00 0c 44 00 30 |...&.........D.0| -+000014b0 00 00 cc 26 00 00 00 00 00 00 00 00 0c 44 00 30 |...&.........D.0| - -1 (06:17): restored and verified standby. started loop again (log9.9) -3 (06:20): standby okay -75 (07:53): ok -82 (08:02): standby failed. Other CRCs okay. Corruption: - - 00001500 00 10 cc 26 00 00 00 00 00 00 00 00 0c 44 00 30 |...&.........D.0| --00001510 00 90 cc 26 00 00 00 00 00 00 00 00 0c 44 00 30 |...&.........D.0| -+* - -1 (08:19): restored and verified standby. started loop again (log9.10) -171 (12:00): standby okay -194 (12:30): standby failed. Other CRCs okay. Corruptions: - --00000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................| -+00000000 ff ff 28 40 ff ff ff ff ff ff ff ff ff ff ff ff |..(@............| -... --00000440 00 88 cc 26 00 00 00 00 00 00 00 00 0c 44 00 40 |...&.........D.@| -+00000440 00 00 cc 26 00 00 00 00 00 00 00 00 0c 44 00 40 |...&.........D.@| - -1 (12:33): restored and verified standby. started loop again (log9.11) -162 (16:03): standby okay -168 (16:25): standby failed. Other CRCs okay. Corruption: - --00002000 00 70 cc 26 00 00 00 00 00 00 00 00 0c 44 00 28 |.p.&.........D.(| -+00002000 00 00 cc 26 00 00 00 00 00 00 00 00 0c 44 00 28 |...&.........D.(| - -1 (16:27): restored and verified standby. started loop again (log9.12) -341 (23:49): standby okay -... continuing into the next day ... - ---- Mon 2011-09-19 ------------------------------------------------------------ - -359 (00:12): standby okay -582 (05:02): standby okay. Interrupted to "peek". No differences found. -583 (05:04): resumed. standby okay. -1023 (14:36): standby okay. Interrupted to "peek". No differences found. -1024 (14:37): resumed. standby okay. -1452 (23:53): standby okay -... continuing into the next day ... - ---- Tue 2011-09-20 ------------------------------------------------------------ - -1520 (01:22): standby okay. Interrupted to "peek". No differences found. -1521 (01:23): resumed. standby okay. -2013 (12:03): standby okay. Interrupted to "peek". No differences found. -2014 (12:04): resumed. standby okay. -2452 (21:33): standby okay -... continuing into the next day ... - ---- Wed 2011-09-21 ------------------------------------------------------------ - -2602 (00:48): standby okay -2603 (00:49): standby okay. Interrupted to "peek". No differences found. -2604 (00:52): resumed. standby okay. -2999 (09:25): standby okay. Interrupted to "peek". No differences found. -3000 (09:27): resumed. standby okay. -3488 (20:01): standby okay. Interrupted to "peek". No differences found. -3489 (20:02): resumed. standby okay. -3561 (21:36): standby okay -... continuing into the next day ... - ---- Thu 2011-09-22 ------------------------------------------------------------ - -3673 (00:01): standby okay -3972 (06:30): standby okay. Interrupted to "peek". No differences found. -3973 (06:32): resumed. standby okay. -4000 (07:08): standby okay. giving up. - - -Summary of recent experiments (all with "loop"): - -Last cycle ob- First cycle ob- Non-fatal Offsets of corrupted -served without served with corruption words in standby -fatal corruption fatal corr. (at end) ---------------- --------------- ------------ ------------------------------- - 70 ( 1h29) 223 ( 4h47) ? 0, 2, 0x5556 - 613 (13h14) 639 (13h48) ? 0, 0x14b0 - 23 ( 0h29) 117 ( 2h30) ? 0x5140 -1493 (32h16) 1584 (34h14) ? 4, 0xa00; soc-rescue 0x20460 - 981 (21h12) 1121 (24h13) 980 (21h10) 0, 0x800, 0x1000 - 517 (11h09) 597 (12h53) - 0x1160 - 352 ( 7h36) 422 ( 9h06) - 0x2000 - 973 (21h03) 1002 (22h22) 501 (10h50) 6, 0x2ff2; fn-rescue 0x120000 - 233 ( 5h02) 260 ( 5h37) - 0x14b0 - 75 ( 1h36) 82 ( 1h45) - 0x1510 - 171 ( 3h41) 194 ( 4h11) - 2, 0x440 - 162 ( 3h30) 168 ( 3h52) - 0x2000 +The log from the first round of tests has moved. It's now in +1/LOG