1
0
mirror of git://projects.qi-hardware.com/wernermisc.git synced 2024-11-15 14:30:36 +02:00
wernermisc/m1rc3/norruption/LOG

205 lines
7.0 KiB
Plaintext
Raw Normal View History

--- Tue 2011-09-06 ------------------------------------------------------------
Running "loop": power-cycle, sleep 2 s, jtag-boot, sleep 70 seconds,
which is enough to boot into FN and render "The Tunnel" for a moment,
then power-cycle again (off-time is 5 s).
Note that the test loop is "open-loop" and will cycle also past any
problems. The first time a corrupt standby (or any other issue) is
observed may therefore be well after the actual event.
1: started around 11:53 (M1 configuration is original, without locking)
(around 500) visually checked boot process; standby was reached normally
--- Wed 2011-09-07 ------------------------------------------------------------
645: neocon stopped working (around 01:58)
666: detected neocon failure at run 666: restarted neocon; urjtag failed
this cycle; back to normal at 667
684: checked LEDs again (first time since ~500) and found that standby
may be failing. stopping test at 685 (around 02:50) for
investigation.
Downloaded the standby bitstream:
wget https://raw.github.com/milkymist/scripts/master/scripts/reflash_m1.sh
chmod 755 reflash_m1.sh
./reflash_m1.sh --read-flash
Found two corruptions in the standby bitstream:
diff -u <(hexdump -C standby.fpg) <(hexdump -C /home/root/.qi/milkymist/read-flash/2011...)
-00000080 00 00 4c 83 00 00 4c 87 00 00 cc 85 d8 47 cc 43 |..L...L......G.C|
+00000080 00 00 4c 83 00 00 4c 87 00 00 c4 80 d8 47 cc 43 |..L...L......G.C|
-00002840 00 08 cc 26 00 00 00 00 00 00 00 00 0c 44 00 98 |...&.........D..|
+00002840 00 00 cc 26 00 00 00 00 00 00 00 00 0c 44 00 98 |...&.........D..|
CRC-checked the partitions:
git clone git://github.com/milkymist/milkymist
cd milkymist/tools/
gcc -Wall -I. -o flterm flterm.c
wget http://milkymist.org/updates/current/for-rc3/boot.4e53273.bin
./flterm --port /dev/ttyUSB0 --kernel boot.4e53273.bin
serialboot
a
only standby.fpg failed the CRC check
Reflashed the standby bitstream:
wget http://milkymist.org/updates/2011-07-13/for-rc3/fjmem.bit
(or http://milkymist.org/updates/fjmem.bit.bz2)
wget http://milkymist.org/updates/current/standby.fpg
jtag
cable milkymist
detect
instruction CFG_OUT 000100 BYPASS
instruction CFG_IN 000101 BYPASS
pld load fjmem.bit
initbus fjmem opcode=000010
frequency 6000000
detectflash 0
endian big
flashmem 0 standby.fpg noverify
M1 enters standby normally again.
Running "loop2": power-cycle, sleep 2 s, jtag-boot, sleep 10 seconds,
which is enough to begin (but not finish) booting RTEMS, then
power-cycle again (off-time is 5 s).
1: started around 05:01. Observed until about 200-300 (06:00-06:30)
that standby was okay.
~730 (08:48): observed that standby didn't load anymore (note: due to
a bug in labsw, power is not turned on in about 5-10% of the cycles,
so the real cycle count should be around 650-700.)
Standby bitstream difference:
-00000080 00 00 4c 83 00 00 4c 87 00 00 cc 85 d8 47 cc 43 |..L...L......G.C|
+00000080 00 00 00 00 00 00 4c 87 00 00 cc 85 d8 47 cc 43 |......L......G.C|
Reflashed standby and locked the NOR. Testing with loop2 again.
1 (09:18): started
... continuing through the night ...
--- Thu 2011-09-08 ------------------------------------------------------------
3483 (03:18): standby is good so far
4325 (07:40): manually ended test. Standby is still good, but starting
with cycle 3704, booting RTEMS failed with
I: Booting from flash...
I: Loading 1889692 bytes from flash...
E: CRC failed (expected aa12a56a, got 68ec25e6)
A CRC check yielded:
Images CRC:
Checking : standby.fpg CRC passed (got c58e8905)
Checking : soc-rescue.fpg CRC passed (got 30dcc535)
Checking : bios-rescue.bin(CRC) CRC passed (got c78353fa)
Checking : splash-rescue.raw CRC passed (got e8ff824f)
Checking : flickernoise.fbi(rescue)(CRC) CRC passed (got aa12a56a)
Checking : soc.fpg CRC passed (got 3a31e737)
Checking : bios.bin(CRC) CRC passed (got 86e23684)
Checking : splash.raw CRC passed (got 978f860c)
Checking : flickernoise.fbi(CRC) CRC failed (expected aa12a56a, got 68ec25e6)
Read back the FlickerNoise partition with
readmem 0x920000 0x0400000 fn.bin
Compare with the original:
wget http://www.milkymist.org/updates/2011-07-13/flickernoise.fbi
md5sum flickernoise.fbi
5b7367e71bda306b080bde124615859b flickernoise.fbi
diff -u <(hexdump -C flickernoise.fbi) <(hexdump -C fn.bin)
...
-0008a380 28 43 00 00 34 64 00 01 58 44 00 00 5c 60 00 1e |(C..4d..XD..\`..|
+0008a380 28 43 00 00 00 00 00 01 58 44 00 00 5c 60 00 1e |(C......XD..\`..|
...
Recovered the FN partition and unlocked the NOR:
flashmem 0x920000 flickernoise.fbi noverify
unlockflash 0 55
New test series with script loop4. This differs from loop2 in that
it uses "pld reconfigure" to return to standby, instead of
power-cycling. If we still observe corruption with this test, then
a software problem would be to blame.
1 (09:11): started
2509 (19:33): standby looks good
All CRC checks pass. Verified that NOR was unlocked:
(load fjmem, etc.)
peek 0 # show old value
poke 0 0x40 0 0x0000 # Word Program
peek 0 # read back status (0x80 if okay, 0x92 if locked)
poke 0 0xff # Read Array (switch back to normal operation)
Took labsw offline to analyze occasional failure to switch. Failure
was difficult to reproduce. Also opened labsw to tighten a loose nut.
Afterwards (Friday run), labsw showed much fewer switch failures.
--- Fri 2011-09-09 ------------------------------------------------------------
New test with script "loop5". This time, we only power cycle but don't
try to boot out of standby. The purpose of this test is to confirm that
NOR corruption does not occur when powering down while in standby.
1 (11:04): started
200 (11:28:): stopped to issue "unlockflash 0 105" to make sure all of
the NOR is unlocked, just in case
Also checked CRCs. All is well.
1 (11:31): started
2637 (16:53): stopped. standby looks good.
All partitions pass the CRC check.
Repeating loop2 to make sure the NOR corruption hasn't disappeared for
an unrelated reason. System is connected to oscilloscope monitoring the
M1 DC in voltage. This connection provides grounding of DC in.
1 (16:56): started
--- Sat 2011-09-10 ------------------------------------------------------------
2428 (04:57): standby still okay
2440 (05::01): disconnected oscilloscope
2463 (05:08): stopped test
All partitions pass the CRC check. Read back the standby partition and
also found no corruption in bitwise comparison. Furthermore, the unused
area showed the expected 0xffff pattern.
1 (05:14): restarted test, without oscilloscope.
2213 (16:11): standby still okay
All partitions pass the CRC check. Unused area of standby shows 0xffff.
Prepared new test (loop7): like loop2, but make a "false start" of
turning on both channels and immediately turn them off again, wait 16
seconds, and only then power up properly. This would roughly correspond
to labsw failing to turn on, as observed in the test runs in which NOR
corruption occurred.
1 (16:27): started loop7 test
5 (16:32): standby okay