--- Tue 2011-09-06 ------------------------------------------------------------ Running "loop": power-cycle, sleep 2 s, jtag-boot, sleep 70 seconds, which is enough to boot into FN and render "The Tunnel" for a moment, then power-cycle again (off-time is 5 s). Note that the test loop is "open-loop" and will cycle also past any problems. The first time a corrupt standby (or any other issue) is observed may therefore be well after the actual event. 1: started around 11:53 (M1 configuration is original, without locking) (around 500) visually checked boot process; standby was reached normally --- Wed 2011-09-07 ------------------------------------------------------------ 645: neocon stopped working (around 01:58) 666: detected neocon failure at run 666: restarted neocon; urjtag failed this cycle; back to normal at 667 684: checked LEDs again (first time since ~500) and found that standby may be failing. stopping test at 685 (around 02:50) for investigation. Downloaded the standby bitstream: wget https://raw.github.com/milkymist/scripts/master/scripts/reflash_m1.sh chmod 755 reflash_m1.sh ./reflash_m1.sh --read-flash Found two corruptions in the standby bitstream: diff -u <(hexdump -C standby.fpg) <(hexdump -C /home/root/.qi/milkymist/read-flash/2011...) -00000080 00 00 4c 83 00 00 4c 87 00 00 cc 85 d8 47 cc 43 |..L...L......G.C| +00000080 00 00 4c 83 00 00 4c 87 00 00 c4 80 d8 47 cc 43 |..L...L......G.C| -00002840 00 08 cc 26 00 00 00 00 00 00 00 00 0c 44 00 98 |...&.........D..| +00002840 00 00 cc 26 00 00 00 00 00 00 00 00 0c 44 00 98 |...&.........D..| CRC-checked the partitions: git clone git://github.com/milkymist/milkymist cd milkymist/tools/ gcc -Wall -I. -o flterm flterm.c wget http://milkymist.org/updates/current/for-rc3/boot.4e53273.bin ./flterm --port /dev/ttyUSB0 --kernel boot.4e53273.bin serialboot a only standby.fpg failed the CRC check Reflashed the standby bitstream: wget http://milkymist.org/updates/2011-07-13/for-rc3/fjmem.bit (or http://milkymist.org/updates/fjmem.bit.bz2) wget http://milkymist.org/updates/current/standby.fpg jtag cable milkymist detect instruction CFG_OUT 000100 BYPASS instruction CFG_IN 000101 BYPASS pld load fjmem.bit initbus fjmem opcode=000010 frequency 6000000 detectflash 0 endian big flashmem 0 standby.fpg noverify M1 enters standby normally again. Running "loop2": power-cycle, sleep 2 s, jtag-boot, sleep 10 seconds, which is enough to begin (but not finish) booting RTEMS, then power-cycle again (off-time is 5 s). 1: started around 05:01. Observed until about 200-300 (06:00-06:30) that standby was okay. ~730 (08:48): observed that standby didn't load anymore (note: due to a bug in labsw, power is not turned on in about 5-10% of the cycles, so the real cycle count should be around 650-700.) Standby bitstream difference: -00000080 00 00 4c 83 00 00 4c 87 00 00 cc 85 d8 47 cc 43 |..L...L......G.C| +00000080 00 00 00 00 00 00 4c 87 00 00 cc 85 d8 47 cc 43 |......L......G.C| Reflashed standby and locked the NOR. Testing with loop2 again. 1 (09:18): started ... continuing through the night ... --- Thu 2011-09-08 ------------------------------------------------------------ 3483 (03:18): standby is good so far 4325 (07:40): manually ended test. Standby is still good, but starting with cycle 3704, booting RTEMS failed with I: Booting from flash... I: Loading 1889692 bytes from flash... E: CRC failed (expected aa12a56a, got 68ec25e6) A CRC check yielded: Images CRC: Checking : standby.fpg CRC passed (got c58e8905) Checking : soc-rescue.fpg CRC passed (got 30dcc535) Checking : bios-rescue.bin(CRC) CRC passed (got c78353fa) Checking : splash-rescue.raw CRC passed (got e8ff824f) Checking : flickernoise.fbi(rescue)(CRC) CRC passed (got aa12a56a) Checking : soc.fpg CRC passed (got 3a31e737) Checking : bios.bin(CRC) CRC passed (got 86e23684) Checking : splash.raw CRC passed (got 978f860c) Checking : flickernoise.fbi(CRC) CRC failed (expected aa12a56a, got 68ec25e6) Read back the FlickerNoise partition with readmem 0x920000 0x0400000 fn.bin Compare with the original: wget http://www.milkymist.org/updates/2011-07-13/flickernoise.fbi md5sum flickernoise.fbi 5b7367e71bda306b080bde124615859b flickernoise.fbi diff -u <(hexdump -C flickernoise.fbi) <(hexdump -C fn.bin) ... -0008a380 28 43 00 00 34 64 00 01 58 44 00 00 5c 60 00 1e |(C..4d..XD..\`..| +0008a380 28 43 00 00 00 00 00 01 58 44 00 00 5c 60 00 1e |(C......XD..\`..| ... Recovered the FN partition and unlocked the NOR: flashmem 0x920000 flickernoise.fbi noverify unlockflash 0 55 New test series with script loop4. This differs from loop2 in that it uses "pld reconfigure" to return to standby, instead of power-cycling. If we still observe corruption with this test, then a software problem would be to blame. 1 (09:11): started 2509 (19:33): standby looks good All CRC checks pass. Verified that NOR was unlocked: (load fjmem, etc.) peek 0 # show old value poke 0 0x40 0 0x0000 # Word Program peek 0 # read back status (0x80 if okay, 0x92 if locked) poke 0 0xff # Read Array (switch back to normal operation)