"pos" counts from zero so the middle point is (nibbles-skip)/2 while
the code assumed "pos" was an index into the actual buffer and the
middle point would therefore have been (nibbles+skip)/2
Combined with the 84 MHz speedup, this yields:
1 3 gap+0 2 0 1 3
------- ------- ------- ------- ------- ------- -------
118 36 1 138
89 35 1 22
59 36 1 51
30 36 1 80
So in this case we gain only about 3 samples or 10% by making this risky
optimization. (The gain would be higher at higher sample rates, lower at
lower rates.)
We also de-optimize the start bit (DAT0=0) phase for now. In the
12 MHz scenario, this produces the following results:
1 3 gap+0 2 0 1 3
------- ------- ------- ------- ------- ------- -------
8 38 26 100 146
102 38 26 6
52 39 26 55
147 41 26 105
97 39 26 10 146
Note that the gap now includes the start bit phase, since the clock
change may complicate the calculation of how many 12 MHz samples
it corresponds to.
The gap is just as long as when waiting for an event but the "start bit"
on DAT0 has vanished completely:
1 3 gap 2 0 1 3
------- ------- ------- ------- ------- ------- -------
53 77 16 146
66 79 1 146
81 79 132
96 79 117
This seems to make no difference for the gap but the "start bit" (DAT0
pulled low) seem to get 1-2 samples shorter:
1 3 gap 2 0 1 3
------- ------- ------- ------- ------- ------- -------
146 77 9 60
147 13 79 9 45
145 28 79 10 29
146 43 79 9 15
A series of measurements of
A# ./ubb-patgen -f 41kHz 1
A# ./ubb-patgen -f 41kHz -c
B# ./ubb-la -f 12 -n 10
yielded these results:
1 3 gap 2 0 1 3
------- ------- ------- ------- ------- ------- -------
106 77 11 98
120 78 11 83
134 79 11 68
3 79 11 53
18 78 11 39
33 78 11 24
47 79 11 9
62 79 5 6 140 147
77 79 11 125 147
Where for example the last entry corresponds to
...1{146}3{77}
0{11}1{125}3{147}...
Since this looks as if DAT1 was 1 for 77 samples before the first capture
ended, was 0 throughout the pulling low of DAT0 (11 cycles), stayed low
for another 125 cycles, and then went high for the 146.29 nominal
half-period, we thus get a gap length of 2*146-77-11-125 = 79
We inherited the cast from ubb-patgen where the buffer was "const" and thus
had to be cast for the the non-const argument of physmem_xlat. We never
needed a cast in ubb-la, though.
Since physmem_xlat now uses "const" as well, the cast is even doubly
superfluous.
Third time lucky, I hope. -fno-tree-cselim is much more specific than
disabling all optimization and results in a considerably less severe
performance reduction (about 30-40% of -O0).
While -O1 gets rid of the unexpected read in the simple code of a synthetic
test, it's still there in the more complex environment we have in ubb-la.c
Turning off optimization completely seems to do the trick.
Note that the pull-ups on DAT1 through DAT3 and the pull-whichever-way on
DAT0 are likely to get in the way of any real-life use. But it's good enough
for exploring the system's characteristics and limitations.