WE ARE AT THE FAIRE RIGHT NOW! 🙂 JOIN US!
Meanwhile we’ll keep posting the story behind the project…
Me and NJay finally got together for a hacking jam session to find out what the heck is going wrong with the I²C comms bus between the RasPi and the Power Bridges. This is where his shiny new portable oscilloscope came in just handy.
Running the entire system “as is”, with Joystick control, turned out to be too confusing to analyse, so I used my simple test program and commented out the “read” part; this way the only operation between the RasPi and the Bridges would be a single “write” of the PWM register, which I changed to be done every second. This gave Njay time and focus to look at the bus signals and search for the problem.
And soon enough it became obvious: some bytes were not getting through; some Write operations got all 3 bytes in, but some others got stuck after the 2nd byte and the 3rd one was never received.
Very patiently, Njay drilled down into the transition between the 2nd and 3rd bytes, trying to spot the difference between a good and bad transition. We both had forgotten all about the intricacies of the I²C protocol, so this was an excruciating exercise. Eventually we hypothesized that the “Acknowledge” bit sent by the Bridge was not correctly detected by the RasPi. In fact its timing looked a little too random for our taste, so Njay disabled all the code inside the Bridge except the I²C routines, to see if the problem was a race condition caused by the PWM/MOSFET control processing. And it worked, the comms became flawless after that. With nothing but I²C hardware interrupts to process, the tiny Atmel microprocessor behaved perfectly.
So, convinced that the problem was in the race between I²C interrupts and PWM timer interrupts, I talked Njay into giving higher priority to I²C interrupts. Which means “not disabling other interrupts while servicing PWM interrupts”, to allow I²C routines to interrupt even during a PWM routine. This looked like a good idea around midnight; the fact that Njay told me that there are 3 separate Interrupt Servicing Routines for PWM failed to raise the alarm in our heads. I was too tired and just wanted to see it working, and kept pressing him to do it. My bad. 🙂 We turned it on, and obviously…
What happened was that, by letting all interrupts run when PWM routines are running, we let the door open for several PWM routines to run over each other…. which, when the Active Cycle of the PWM is very short, may actually result in MOSFET ON-configurations that are less desirable, like, say, a direct short-circuit to the battery. Which was what happened. The protection fuse blew, preventing damage to the larger components and any possibility of fire, but it was too late for the MOSFETs. And so Njay took the bridge home to repair. It cost him the whole weekend to replace the burned components and re-test the bridge. 😦 Sorry, dude.
Meantime, he dug further into the problem and found the source of it: there is a hardware bug in the I²C machine of the BCM2835 Broadcom chip of the RasPi. It sometimes fails to recognize “Clock Stretching”, which is a very common thing that I²C Slaves do: the Master tells them something, they acknowledge it back, but they also hang on to the Clock line for a bit to prevent the Master from sending more data because they are busy doing something else. They release the line when they are ready to continue, and the Master must finish what it started.This is the part that the RasPi was failing to do, it kind of “forgot” to go on and left the bus in an inconsistent state all the way up until a new operation is started by the software. Through the force of repetition, eventually the bus recovers and normal communication is restored. But my measurements indicate that around 40% of packets were being lost. Bad, Pi, bad!!!
I like the I²C bus design, I think it’s quite clever; unfortunately it is always tricky and messy to work with unless you work with it every day. We’ve both suffered from this syndrome back when we worked together at a certain telecommunications company, so I can forgive the Broadcomm hardware engineer who so spectacularly dropped the ball. 😉
Njay found more talking on the nets about working around this horrendous bug and suggested that I lower the baud rate from its standard 100kHz. So I took the opportunity to upgrade the very old Debian Linux on my RasPi (it only takes AN HOUR to do, with a slow SD card like mine) and try out the new kernel driver with a parameter that lets us specify the bit frequency of the bus. It got better at 50kHz, but not perfect. And I wanted perfect. 🙂
So, I tried many different frequencies, going all the way up to 550kHz, something that Njay would certainly disapprove 😉 , but I wanted to see for myself what the limits were. And lo and behold, the communications work flawlessly at 500kHz or higher. The timing will always go right for the RasPi with such thin bit times (in this specific system, with this specific Slave hardware and software).
Even in the digital domain, tuning is imperative. 🙂