--- Tue 2011-09-06 ------------------------------------------------------------ Running "loop": power-cycle, sleep 2 s, jtag-boot, sleep 70 seconds, which is enough to boot into FN and render "The Tunnel" for a moment, then power-cycle again (off-time is 5 s). Note that the test loop is "open-loop" and will cycle also past any problems. The first time a corrupt standby (or any other issue) is observed may therefore be well after the actual event. 1: started around 11:53 (M1 configuration is original, without locking) (around 500) visually checked boot process; standby was reached normally --- Wed 2011-09-07 ------------------------------------------------------------ 645: neocon stopped working (around 01:58) 666: detected neocon failure at run 666: restarted neocon; urjtag failed this cycle; back to normal at 667 684: checked LEDs again (first time since ~500) and found that standby may be failing. stopping test at 685 (around 02:50) for investigation. Downloaded the standby bitstream: wget https://raw.github.com/milkymist/scripts/master/scripts/reflash_m1.sh chmod 755 reflash_m1.sh ./reflash_m1.sh --read-flash Found two corruptions in the standby bitstream: diff -u <(hexdump -C standby.fpg) <(hexdump -C /home/root/.qi/milkymist/read-flash/2011...) -00000080 00 00 4c 83 00 00 4c 87 00 00 cc 85 d8 47 cc 43 |..L...L......G.C| +00000080 00 00 4c 83 00 00 4c 87 00 00 c4 80 d8 47 cc 43 |..L...L......G.C| -00002840 00 08 cc 26 00 00 00 00 00 00 00 00 0c 44 00 98 |...&.........D..| +00002840 00 00 cc 26 00 00 00 00 00 00 00 00 0c 44 00 98 |...&.........D..| CRC-checked the partitions: git clone git://github.com/milkymist/milkymist cd milkymist/tools/ gcc -Wall -I. -o flterm flterm.c wget http://milkymist.org/updates/current/for-rc3/boot.4e53273.bin ./flterm --port /dev/ttyUSB0 --kernel boot.4e53273.bin serialboot a only standby.fpg failed the CRC check Reflashed the standby bitstream: wget http://milkymist.org/updates/2011-07-13/for-rc3/fjmem.bit (or http://milkymist.org/updates/fjmem.bit.bz2) wget http://milkymist.org/updates/current/standby.fpg jtag cable milkymist detect instruction CFG_OUT 000100 BYPASS instruction CFG_IN 000101 BYPASS pld load fjmem.bit initbus fjmem opcode=000010 frequency 6000000 detectflash 0 endian big flashmem 0 standby.fpg noverify M1 enters standby normally again. Running "loop2": power-cycle, sleep 2 s, jtag-boot, sleep 10 seconds, which is enough to begin (but not finish) booting RTEMS, then power-cycle again (off-time is 5 s). 1: started around 05:01. Observed until about 200-300 (06:00-06:30) that standby was okay. ~730 (08:48): observed that standby didn't load anymore (note: due to a bug in labsw, power is not turned on in about 5-10% of the cycles, so the real cycle count should be around 650-700.) Standby bitstream difference: -00000080 00 00 4c 83 00 00 4c 87 00 00 cc 85 d8 47 cc 43 |..L...L......G.C| +00000080 00 00 00 00 00 00 4c 87 00 00 cc 85 d8 47 cc 43 |......L......G.C| Reflashed standby and locked the NOR. Testing with loop2 again. 1 (09:18): started ... continuing through the night ... --- Thu 2011-09-08 ------------------------------------------------------------ 3483 (03:18): standby is good so far 4325 (07:40): manually ended test. Standby is still good, but starting with cycle 3704, booting RTEMS failed with I: Booting from flash... I: Loading 1889692 bytes from flash... E: CRC failed (expected aa12a56a, got 68ec25e6) A CRC check yielded: Images CRC: Checking : standby.fpg CRC passed (got c58e8905) Checking : soc-rescue.fpg CRC passed (got 30dcc535) Checking : bios-rescue.bin(CRC) CRC passed (got c78353fa) Checking : splash-rescue.raw CRC passed (got e8ff824f) Checking : flickernoise.fbi(rescue)(CRC) CRC passed (got aa12a56a) Checking : soc.fpg CRC passed (got 3a31e737) Checking : bios.bin(CRC) CRC passed (got 86e23684) Checking : splash.raw CRC passed (got 978f860c) Checking : flickernoise.fbi(CRC) CRC failed (expected aa12a56a, got 68ec25e6) Read back the FlickerNoise partition with readmem 0x920000 0x0400000 fn.bin Compare with the original: wget http://www.milkymist.org/updates/2011-07-13/flickernoise.fbi md5sum flickernoise.fbi 5b7367e71bda306b080bde124615859b flickernoise.fbi diff -u <(hexdump -C flickernoise.fbi) <(hexdump -C fn.bin) ... -0008a380 28 43 00 00 34 64 00 01 58 44 00 00 5c 60 00 1e |(C..4d..XD..\`..| +0008a380 28 43 00 00 00 00 00 01 58 44 00 00 5c 60 00 1e |(C......XD..\`..| ... Recovered the FN partition and unlocked the NOR: flashmem 0x920000 flickernoise.fbi noverify unlockflash 0 55 New test series with script loop4. This differs from loop2 in that it uses "pld reconfigure" to return to standby, instead of power-cycling. If we still observe corruption with this test, then a software problem would be to blame. 1 (09:11): started 2509 (19:33): standby looks good All CRC checks pass. Verified that NOR was unlocked: (load fjmem, etc.) peek 0 # show old value poke 0 0x40 0 0x0000 # Word Program peek 0 # read back status (0x80 if okay, 0x92 if locked) poke 0 0xff # Read Array (switch back to normal operation) --- Fri 2011-09-09 ------------------------------------------------------------ New test with script "loop5". This time, we only power cycle but don't try to boot out of standby. The purpose of this test is to confirm that NOR corruption does not occur when powering down while in standby. 1 (11:04): started