--- Tue 2011-09-06 ------------------------------------------------------------ Running "loop": power-cycle, sleep 2 s, jtag-boot, sleep 70 seconds, which is enough to boot into FN and render "The Tunnel" for a moment, then power-cycle again (off-time is 5 s). Note that the test loop is "open-loop" and will cycle also past any problems. The first time a corrupt standby (or any other issue) is observed may therefore be well after the actual event. 1: started around 11:53 (M1 configuration is original, without locking) (around 500) visually checked boot process; standby was reached normally --- Wed 2011-09-07 ------------------------------------------------------------ 645: neocon stopped working (around 01:58) 666: detected neocon failure at run 666: restarted neocon; urjtag failed this cycle; back to normal at 667 684: checked LEDs again (first time since ~500) and found that standby may be failing. stopping test at 685 (around 02:50) for investigation. Downloaded the standby bitstream: wget https://raw.github.com/milkymist/scripts/master/scripts/reflash_m1.sh chmod 755 reflash_m1.sh ./reflash_m1.sh --read-flash Found two corruptions in the standby bitstream: diff -u <(hexdump -C standby.fpg) <(hexdump -C /home/root/.qi/milkymist/read-flash/2011...) -00000080 00 00 4c 83 00 00 4c 87 00 00 cc 85 d8 47 cc 43 |..L...L......G.C| +00000080 00 00 4c 83 00 00 4c 87 00 00 c4 80 d8 47 cc 43 |..L...L......G.C| -00002840 00 08 cc 26 00 00 00 00 00 00 00 00 0c 44 00 98 |...&.........D..| +00002840 00 00 cc 26 00 00 00 00 00 00 00 00 0c 44 00 98 |...&.........D..| CRC-checked the partitions: git clone git://github.com/milkymist/milkymist cd milkymist/tools/ gcc -Wall -I. -o flterm flterm.c wget http://milkymist.org/updates/current/for-rc3/boot.4e53273.bin ./flterm --port /dev/ttyUSB0 --kernel boot.4e53273.bin serialboot a only standby.fpg failed the CRC check Reflashed the standby bitstream: wget http://milkymist.org/updates/2011-07-13/for-rc3/fjmem.bit (or http://milkymist.org/updates/fjmem.bit.bz2) wget http://milkymist.org/updates/current/standby.fpg jtag cable milkymist detect instruction CFG_OUT 000100 BYPASS instruction CFG_IN 000101 BYPASS pld load fjmem.bit initbus fjmem opcode=000010 frequency 6000000 detectflash 0 endian big flashmem 0 standby.fpg noverify M1 enters standby normally again. Running "loop2": power-cycle, sleep 2 s, jtag-boot, sleep 10 seconds, which is enough to begin (but not finish) booting RTEMS, then power-cycle again (off-time is 5 s). 1: started around 05:01. Observed until about 200-300 (06:00-06:30) that standby was okay. ~730 (08:48): observed that standby didn't load anymore (note: due to a bug in labsw, power is not turned on in about 5-10% of the cycles, so the real cycle count should be around 650-700.) Standby bitstream difference: -00000080 00 00 4c 83 00 00 4c 87 00 00 cc 85 d8 47 cc 43 |..L...L......G.C| +00000080 00 00 00 00 00 00 4c 87 00 00 cc 85 d8 47 cc 43 |......L......G.C| Reflashed standby and locked the NOR. Testing with loop2 again. 1 (09:18): started ... continuing through the night ... --- Thu 2011-09-08 ------------------------------------------------------------ 3483 (03:18): standby is good so far 4325 (07:40): manually ended test. Standby is still good, but starting with cycle 3704, booting RTEMS failed with I: Booting from flash... I: Loading 1889692 bytes from flash... E: CRC failed (expected aa12a56a, got 68ec25e6) A CRC check yielded: Images CRC: Checking : standby.fpg CRC passed (got c58e8905) Checking : soc-rescue.fpg CRC passed (got 30dcc535) Checking : bios-rescue.bin(CRC) CRC passed (got c78353fa) Checking : splash-rescue.raw CRC passed (got e8ff824f) Checking : flickernoise.fbi(rescue)(CRC) CRC passed (got aa12a56a) Checking : soc.fpg CRC passed (got 3a31e737) Checking : bios.bin(CRC) CRC passed (got 86e23684) Checking : splash.raw CRC passed (got 978f860c) Checking : flickernoise.fbi(CRC) CRC failed (expected aa12a56a, got 68ec25e6) Read back the FlickerNoise partition with readmem 0x920000 0x0400000 fn.bin Compare with the original: wget http://www.milkymist.org/updates/2011-07-13/flickernoise.fbi md5sum flickernoise.fbi 5b7367e71bda306b080bde124615859b flickernoise.fbi diff -u <(hexdump -C flickernoise.fbi) <(hexdump -C fn.bin) ... -0008a380 28 43 00 00 34 64 00 01 58 44 00 00 5c 60 00 1e |(C..4d..XD..\`..| +0008a380 28 43 00 00 00 00 00 01 58 44 00 00 5c 60 00 1e |(C......XD..\`..| ... Recovered the FN partition and unlocked the NOR: flashmem 0x920000 flickernoise.fbi noverify unlockflash 0 55 New test series with script loop4. This differs from loop2 in that it uses "pld reconfigure" to return to standby, instead of power-cycling. If we still observe corruption with this test, then a software problem would be to blame. 1 (09:11): started 2509 (19:33): standby looks good All CRC checks pass. Verified that NOR was unlocked: (load fjmem, etc.) peek 0 # show old value poke 0 0x40 0 0x0000 # Word Program peek 0 # read back status (0x80 if okay, 0x92 if locked) poke 0 0xff # Read Array (switch back to normal operation) Took labsw offline to analyze occasional failure to switch. Failure was difficult to reproduce. Also opened labsw to tighten a loose nut. Afterwards (Friday run), labsw showed much fewer switch failures. --- Fri 2011-09-09 ------------------------------------------------------------ New test with script "loop5". This time, we only power cycle but don't try to boot out of standby. The purpose of this test is to confirm that NOR corruption does not occur when powering down while in standby. 1 (11:04): started 200 (11:28:): stopped to issue "unlockflash 0 105" to make sure all of the NOR is unlocked, just in case Also checked CRCs. All is well. 1 (11:31): started 2637 (16:53): stopped. standby looks good. All partitions pass the CRC check. Repeating loop2 to make sure the NOR corruption hasn't disappeared for an unrelated reason. System is connected to oscilloscope monitoring the M1 DC in voltage. This connection provides grounding of DC in. 1 (16:56): started --- Sat 2011-09-10 ------------------------------------------------------------ 2428 (04:57): standby still okay 2440 (05::01): disconnected oscilloscope 2463 (05:08): stopped test All partitions pass the CRC check. Read back the standby partition and also found no corruption in bitwise comparison. Furthermore, the unused area showed the expected 0xffff pattern. 1 (05:14): restarted test, without oscilloscope. 2213 (16:11): standby still okay All partitions pass the CRC check. Unused area of standby shows 0xffff. Prepared new test (loop7): like loop2, but make a "false start" of turning on both channels and immediately turn them off again, wait 16 seconds, and only then power up properly. This would roughly correspond to labsw failing to turn on, as observed in the test runs in which NOR corruption occurred. 1 (16:27): started loop7 test 5 (16:32): standby okay