--- Tue 2011-09-06 ------------------------------------------------------------

Running "loop": power-cycle, sleep 2 s, jtag-boot, sleep 70 seconds,
which is enough to boot into FN and render "The Tunnel" for a moment,
then power-cycle again (off-time is 5 s).

Note that the test loop is "open-loop" and will cycle also past any
problems. The first time a corrupt standby (or any other issue) is
observed may therefore be well after the actual event.

1: started around 11:53 (M1 configuration is original, without locking)
(around 500) visually checked boot process; standby was reached normally

--- Wed 2011-09-07 ------------------------------------------------------------

645: neocon stopped working (around 01:58)
666: detected neocon failure at run 666: restarted neocon; urjtag failed
     this cycle; back to normal at 667
684: checked LEDs again (first time since ~500) and found that standby
     may be failing. stopping test at 685 (around 02:50) for
     investigation.

Downloaded the standby bitstream:

  wget https://raw.github.com/milkymist/scripts/master/scripts/reflash_m1.sh
  chmod 755 reflash_m1.sh

  ./reflash_m1.sh --read-flash

Found two corruptions in the standby bitstream:

  diff -u <(hexdump -C standby.fpg) <(hexdump -C /home/root/.qi/milkymist/read-flash/2011...)

-00000080  00 00 4c 83 00 00 4c 87  00 00 cc 85 d8 47 cc 43  |..L...L......G.C|
+00000080  00 00 4c 83 00 00 4c 87  00 00 c4 80 d8 47 cc 43  |..L...L......G.C|

-00002840  00 08 cc 26 00 00 00 00  00 00 00 00 0c 44 00 98  |...&.........D..|
+00002840  00 00 cc 26 00 00 00 00  00 00 00 00 0c 44 00 98  |...&.........D..|

CRC-checked the partitions:

  git clone git://github.com/milkymist/milkymist
  cd milkymist/tools/
  gcc -Wall -I. -o flterm flterm.c
  wget http://milkymist.org/updates/current/for-rc3/boot.4e53273.bin
  ./flterm --port /dev/ttyUSB0 --kernel boot.4e53273.bin

  serialboot
  a

  only standby.fpg failed the CRC check

Reflashed the standby bitstream:

  wget http://milkymist.org/updates/2011-07-13/for-rc3/fjmem.bit
  (or http://milkymist.org/updates/fjmem.bit.bz2)
  wget http://milkymist.org/updates/current/standby.fpg

  jtag

  cable milkymist
  detect
  instruction CFG_OUT  000100 BYPASS
  instruction CFG_IN   000101 BYPASS
  pld load fjmem.bit
  initbus fjmem opcode=000010
  frequency 6000000
  detectflash 0
  endian big
  flashmem 0 standby.fpg noverify

M1 enters standby normally again.

Running "loop2": power-cycle, sleep 2 s, jtag-boot, sleep 10 seconds,
which is enough to begin (but not finish) booting RTEMS, then
power-cycle again (off-time is 5 s).

1: started around 05:01. Observed until about 200-300 (06:00-06:30)
that standby was okay.
~730 (08:48): observed that standby didn't load anymore (note: due to
a bug in labsw, power is not turned on in about 5-10% of the cycles,
so the real cycle count should be around 650-700.)

Standby bitstream difference:

-00000080  00 00 4c 83 00 00 4c 87  00 00 cc 85 d8 47 cc 43  |..L...L......G.C|
+00000080  00 00 00 00 00 00 4c 87  00 00 cc 85 d8 47 cc 43  |......L......G.C|

Reflashed standby and locked the NOR. Testing with loop2 again.

1 (09:18): started
... continuing through the night ...

--- Thu 2011-09-08 ------------------------------------------------------------

3483 (03:18): standby is good so far
4325 (07:40): manually ended test. Standby is still good, but starting
    with cycle 3704, booting RTEMS failed with

    I: Booting from flash...
    I: Loading 1889692 bytes from flash...
    E: CRC failed (expected aa12a56a, got 68ec25e6)

A CRC check yielded:

Images CRC:
  Checking : standby.fpg        CRC passed (got c58e8905)
  Checking : soc-rescue.fpg     CRC passed (got 30dcc535)
  Checking : bios-rescue.bin(CRC)       CRC passed (got c78353fa)
  Checking : splash-rescue.raw  CRC passed (got e8ff824f)
  Checking : flickernoise.fbi(rescue)(CRC)      CRC passed (got aa12a56a)
  Checking : soc.fpg    CRC passed (got 3a31e737)
  Checking : bios.bin(CRC)      CRC passed (got 86e23684)
  Checking : splash.raw CRC passed (got 978f860c)
  Checking : flickernoise.fbi(CRC)      CRC failed (expected aa12a56a, got 68ec25e6)

Read back the FlickerNoise partition with

  readmem 0x920000 0x0400000 fn.bin

Compare with the original:

  wget http://www.milkymist.org/updates/2011-07-13/flickernoise.fbi
  md5sum flickernoise.fbi
  5b7367e71bda306b080bde124615859b  flickernoise.fbi

  diff -u <(hexdump -C flickernoise.fbi) <(hexdump -C fn.bin)

...
-0008a380  28 43 00 00 34 64 00 01  58 44 00 00 5c 60 00 1e  |(C..4d..XD..\`..|
+0008a380  28 43 00 00 00 00 00 01  58 44 00 00 5c 60 00 1e  |(C......XD..\`..|
...

Recovered the FN partition and unlocked the NOR:

  flashmem 0x920000 flickernoise.fbi noverify
  unlockflash 0 55

New test series with script loop4. This differs from loop2 in that
it uses "pld reconfigure" to return to standby, instead of
power-cycling. If we still observe corruption with this test, then
a software problem would be to blame.

1 (09:11): started
2509 (19:33): standby looks good

All CRC checks pass. Verified that NOR was unlocked:

  (load fjmem, etc.)
  peek 0		# show old value
  poke 0 0x40 0 0x0000	# Word Program
  peek 0		# read back status (0x80 if okay, 0x92 if locked)
  poke 0 0xff		# Read Array (switch back to normal operation)

Took labsw offline to analyze occasional failure to switch. Failure
was difficult to reproduce. Also opened labsw to tighten a loose nut.
Afterwards (Friday run), labsw showed much fewer switch failures.

--- Fri 2011-09-09 ------------------------------------------------------------

New test with script "loop5". This time, we only power cycle but don't
try to boot out of standby. The purpose of this test is to confirm that
NOR corruption does not occur when powering down while in standby.

1 (11:04): started
200 (11:28:): stopped to issue "unlockflash 0 105" to make sure all of
  the NOR is unlocked, just in case

Also checked CRCs. All is well.

1 (11:31): started
2637 (16:53): stopped. standby looks good.

All partitions pass the CRC check.

Repeating loop2 to make sure the NOR corruption hasn't disappeared for
an unrelated reason. System is connected to oscilloscope monitoring the
M1 DC in voltage. This connection provides grounding of DC in.

1 (16:56): started

--- Sat 2011-09-10 ------------------------------------------------------------

2428 (04:57): standby still okay
2440 (05::01): disconnected oscilloscope
2463 (05:08): stopped test

All partitions pass the CRC check. Read back the standby partition and
also found no corruption in bitwise comparison. Furthermore, the unused
area showed the expected 0xffff pattern.

1 (05:14): restarted test, without oscilloscope.
2213 (16:11): standby still okay

All partitions pass the CRC check. Unused area of standby shows 0xffff.

Prepared new test (loop7): like loop2, but make a "false start" of
turning on both channels and immediately turn them off again, wait 16
seconds, and only then power up properly. This would roughly correspond
to labsw failing to turn on, as observed in the test runs in which NOR
corruption occurred.

1 (16:27): started loop7 test
... continuing through the night ...

--- Sun 2011-09-11 ------------------------------------------------------------

2001 (11:58): standby okay

All partitions pass the CRC check. Unused area of standby shows 0xffff.

Confirmed writability of NOR at address 0x80000 and at address 0.
Instructions used at address 0x80000:

  jtag> peek 0x80000
  URJ_BUS_READ(0x00080000) = 0xFFFF (65535)
  jtag> poke 0x80000 0x40 0x80000 0xffee 
  jtag> peek 0x80000
  URJ_BUS_READ(0x00080000) = 0x0080 (128)
  jtag> poke 0 0xff   
  jtag> peek 0x80000
  URJ_BUS_READ(0x00080000) = 0xFFEE (65518)

--- Mon 2011-09-12 ------------------------------------------------------------

loop8 is similar to loop7. It increases the "false on" period to 10 ms,
which is enough to make the M1 power LED flash. It reduces the cool off
period after the false on.

1 (08:11): started loop8 test
2120 (19:50): standby okay. All partitions pass CRC check.

Going back to the beginning. Test loop (1) runs all the way to rendering.
Maybe it is necessary after all ...

1 (19:52): started loop (1) test (serial console logged in file log9)
70 (21:21): standby okay

--- Tue 2011-09-13 ------------------------------------------------------------

223 (00:39): standby failure

Several corruptions were found:

-00000000  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
+00000000  00 00 00 00 ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
...
 00005510  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
 *
+00005550  ff ff ff ff ff ff d6 10  ff ff ff ff ff ff ff ff  |................|
+00005560  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
+*

No CRC errors in other partitions.

1 (00:50): restored standby partition. started loop again (log9.1)
613 (14:04): standby okay
639 (14:38): standby failure. Other CRCs okay. Corruption:

...
-00000000  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
+00000000  00 00 ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
...
-000014b0  00 c0 cc 26 00 00 00 00  00 00 00 00 0c 44 00 30  |...&.........D.0|
+000014b0  00 00 cc 26 00 00 00 00  00 00 00 00 0c 44 00 30  |...&.........D.0|
...

1 (14:46): restored and verified standby partition. started loop again (log9.2)
23 (15:15): ok
117 (17:16): standby failure. Other CRCs okay. Corruption:

...
 000050e0  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
 *
+00005140  00 00 ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
...

1 (17:23): restored and verified standby partition. started loop again (log9.3)
1 (17:30): restarted after X crash
... continuing through the night ...

--- Wed 2011-09-14 ------------------------------------------------------------

... continuing through the whole day ...

--- Thu 2011-09-15 ------------------------------------------------------------

1493 (01:46): standby okay
1584 (03:44): standby failure. CRC failure in standby and soc-rescue.

Corruption in standby:

-00000000  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
+00000000  ff ff ff ff 00 00 ff ff  ff ff ff ff ff ff ff ff  |................|
...
-00000a00  00 98 cc 26 00 00 00 00  00 00 00 00 0c 44 00 a0  |...&.........D..|
+00000a00  00 00 cc 26 00 00 00 00  00 00 00 00 0c 44 00 a0  |...&.........D..|
...

Read back soc-rescue:

jtag> readmem 0x0A0000 0x0180000 soc-rescue.bin

Corruption:

-00020460  50 00 50 00 00 0f 55 5f  ff ff dd ee 00 00 00 00  |P.P...U_........|
+00020460  00 00 50 00 00 0f 55 5f  ff ff dd ee 00 00 00 00  |..P...U_........|

Restore standby, then soc-rescue:

wget http://milkymist.org/updates/current/soc-rescue.fpg
jtag> flashmem 0xA0000 soc-rescue.fpg noverify

1 (04:00): verified CRCs. started loop again (log9.4)
750 (20:12): standby okay. Interrupted to dump first 256 bytes of NOR with
  script "peek". No differences found.
752 (20:13): resumed. standby okay
... continuing into the night ...

--- Fri 2011-09-16 ------------------------------------------------------------

980 (01:10): standby okay. Interrupted to run "peek". Found non-fatal damage:

-00000000  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
+00000000  00 00 ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|

981 (01:12): resumed. standby okay.
1121 (04:13): standby failed. Other CRCs okay. Corruptions:

-00000000  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
+00000000  00 00 ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
...
-00000800  00 08 cc 26 00 00 00 00  00 00 00 00 0c 44 00 20  |...&.........D. |
+00000800  00 00 cc 26 00 00 00 00  00 00 00 00 0c 44 00 20  |...&.........D. |
...
-00001000  00 c8 cc 26 00 00 00 00  00 00 00 00 0c 44 00 90  |...&.........D..|
+00001000  00 00 cc 26 00 00 00 00  00 00 00 00 0c 44 00 90  |...&.........D..|

1 (04:20): restored and verified standby. started loop again (log9.5)
516 (15:28): standby okay. Interrupted to "peek". No differences found.
517 (15:29): resumed. standby okay.
597 (17:13): standby failed. Other CRCs okay. Corruption:

-00001160  00 d0 cc 26 00 00 00 00  00 00 00 00 0c 44 00 50  |...&.........D.P|
+00001160  00 00 cc 26 00 00 00 00  00 00 00 00 0c 44 00 50  |...&.........D.P|

1 (17:21): restored and verified standby. started loop again (log9.6)
... continuing into the night ...

--- Sat 2011-09-17 ------------------------------------------------------------

352 (00:57): standby okay
422 (02:27): standby failed. Other CRCs okay. Corruption:

-00002000  00 70 cc 26 00 00 00 00  00 00 00 00 0c 44 00 28  |.p.&.........D.(|
+00002000  00 00 cc 26 00 00 00 00  00 00 00 00 0c 44 00 28  |...&.........D.(|

1 (02:36): restored and verified standby. started loop again (log9.7)
501 (13:25): standby okay. Interrupted to run "peek". Found non-fatal damage:

-00000000  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
+00000000  ff ff ff ff ff ff 81 4c  ff ff ff ff ff ff ff ff  |.......L........|

1 (13:26): restarted (oops). standby okay. (still log9.7)
473 (23:39) standby okay
... continuing into the next day ...

--- Sun 2011-09-18 ------------------------------------------------------------

500 (00:14): standby failed. CRC failed in flickernoise-rescue, too.

Corruptions in standby:

-00000000  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
+00000000  ff ff ff ff 00 00 81 4c  ff ff ff ff ff ff ff ff  |.......L........|
...
-00002ff0  00 20 cc 26 00 00 00 00  00 00 00 00 0c 44 00 78  |. .&.........D.x|
+00002ff0  00 20 c4 04 00 00 00 00  00 00 00 00 0c 44 00 78  |. ...........D.x|

Corruption in flickernoise-rescue:

-00120000  ba 80 08 00 34 10 00 00  e0 00 00 1b 40 c8 00 0a  |....4.......@...|
+00120000  00 00 08 00 34 10 00 00  e0 00 00 1b 40 c8 00 0a  |....4.......@...|

1 (00:36): restored and verified everything. started loop again (log9.8)


Summary of recent experiments (all with "loop"):

Last cycle ob-	First cycle ob-	Non-fatal	Offsets of corrupted
served without	served with	corruption	words in standby
fatal corruption fatal corr.			(at end)
  70 ( 1h29)	 223 ( 4h47)	?		0, 2, 0x5556
 613 (13h14)	 639 (13h48)	?		0, 0x14b0
  23 ( 0h29)	 117 ( 2h30)	?		0x5140
1493 (32h16)	1584 (34h14)	?		4, 0xa00; soc-rescue 0x20460
 981 (21h12)	1121 (24h13)	 980 (21h10)	0, 0x800, 0x1000
 517 (11h09)	 597 (12h53)	-		0x1160
 352 ( 7h36)	 422 ( 9h06)	-		0x2000
 973 (21h03)	1002 (22h22)	 501 (10h50)	6, 0x2ff2; fn-rescue 0x120000