Good on you Alan Burlison.
This is not meant to be bagging you in any way. Your code did what it needed to do. Great success. My initial response in a forum comment was actually directed at the people who where offering non-working ideas of using a UART to get some hardware help.
My first suggestion of using a timer to help out is partly fleshed out below, but not fully functional. The reason it is not complete is that when I started to fill out the code it became obvious that with a bit more optimization there is plenty clocks to do the full-monty as Bit-Banging without having to unroll any loops.
The second bit of action shown here is my other suggestion. One of the "use a UART" people said that you could use an inverter to fix up the START-BIT problem. I thought "Well - if you are going to throw a 74XX at it, why not use the SPI and have 140 clock cycles free." Again this is not a complete solution, but is a "proof of concept" to show how the hardware can help.
Finally the third piece is a version of bit banging out a WS2811 that I came up with. Sans a WS2811 because I don't have any. It does not do anything better than Alans code. It is just a bit more optimized (1/2 size) and easier to read due to no loop unrolling and path-lengthening.
It does not break any new ground, there is no magic in it that no one has ever used. It is just a little bit me showing off and a little bit of practice for me. I have been away from the assembler for several years and am just trying to build up my confidence a little bit.
Anyways - On with the show
Step 1: Using TCO to Generate the Waveform
So I have added a quite useless picture of the code instead.
It at least has the code/comments in glorious technicolor. If anyone wants the ASM file then send me a mail on here with your real email address and I will FWD it to you.
But back to the point.
This method of generating the pulses actually is slower (by one clock) than just pure bit banging. However it has one big advantage. All your free clock cycles (14 of them) are in one contiguous block. The bit banging version has a total of 15 free clocks, but they are broken up into two blocks AND the output-test must go at the start which limits some of the other tricks you could have used.
The astute out there will notice that the scope shows the waveform at 400Khz. My AVR on the desk here is clocked at 8Mhz not 16. So it is apples for apples.
Step 2: Using TOC1A/B and SPI With a 74XX IC
This one uses some external 74XX logic. I this case I used a Hex Open Collector Inverter and did some wired OR logic. There are many ways this could be done with a single chip. The other obvious ones are a 7400 and a 74138.
Three different outputs need to be mixed together to make the final waveform that is trace 2
|PD5/OC1A||Output Compare 1 A||Trace B|
|PD4/OC1B||Output Compare 1 B||Trace A|
|PB6/MISO||SPI Master In Slave Out||Trace 1|
Also Output Compare 1B must be fed back into
to give the master clock for the SPI peripheral in SLAVE mode. This is the yellow wire in my photo above.
The reason we can get the SPI to work in this way is that in SLAVE mode the module can not insert a stop bit the way it does in MASTER mode. It is marching to the beat of someone else's drum. When the next clock comes in, it has to just comply and give out the next data bit (if it is ready) or fail otherwise. Speaking of failing. You only have 9 clock cycles to load the data register once the last byte is clear. This means you cutting it a bit fine to use interrupts unless you use a "stupid AVR trick" to shave a few cycles of the interrupt response time.
Step 3: Bit Banging and Saving a Few More Clock Cycles
800khz on a 16Mhz AVR is 20 clock cycles.
20 clock cycles on an AVR is a LOT. We are not talking about PIC12/16C here with 4 ticks per instruction and only one real register. The AVR does a lot per clock cycle. If there was not the requirement to shuffle the RGB order then the AVR could do this serial job without breaking a sweat.
In fact the only thing the AVR does not shine at is changing bits in I/O registers. This takes two clock cycles as shown in the details for the SBI instruction below. The CPU has to read the register, modify it and write it back. It is one of the few non-branching instructions in the AVR to take two clocks. (Note: the AVR XMega has fixed this issue and now is only 1 clock)
Using this instruction in time critical paths is not much fun as Alans code showed. He had to jump and hop all over the place to equalise the path lengths.
sbrc r19, 7 ; test hi bit clear
rjmp 3f ; true, skip pin hi -> lo
cbi %[port], %[pin] ; false, pin hi -> lo
3: sbrc r19, 7 ; equalise delay of both code paths
4: nop ; pulse timing delay
So if the actual CBI and SBI instructions are going to take 2 clock cycles anyway and then you have to waste 2 clock cycles to equalise the path lengths, why not just do the read modify write yourself. This will take 3 cycles total.
IN R16, PortX ; Read the current state of the register
ORI R16, PinX ; Set the Xth bit high
OUT PortX, R16 ; Write the new value out to the register
The next thing you can do to save time is move everything outside the loop you can. Because this code is using 100% of the CPU time, there is no risk something else is going to change PortX. Also because no other code is running we can use as many CPU registers as we like.
So do this IN-ing and AND/OR-ing way outside the loop.
IN PinLo, PortX ; Make a copy of the byte in PortX
ANDI PinLo, 0xFE ; Modify it to be the value to write to make pin lo
IN PinHi, PortX ; Make a copy of the byte in PortX
ORI PinHi, 0x01 ; Modify it to be the value to write to make pin hi
out PortX, PinLo ; Set the output pin LOW
This has now made the whole bit toggling, serial shift, bit counting and looping take only 9 clocks. This leaves 11 clocks free for loading data and shuffling.
Again this would be heaps of time if not for the out of order RGB thing. Because of the out of order RGB thing we can not just treat each byte read as the next one going out. We have to make a decision on where to save the newly read byte to a buffer and where from a buffer to get the next byte to send.
This is where the IJMP instruction comes to the rescue. Its page from the AVR Instruction set is shown above. We are using it like a case/switch statement in a software state-machine. In each state we can set what the NEXT state should be without having to do any evaluations.
We can do this because we always know what colour the next byte is going to be
If we are currently processing the RED byte the next byte WILL be GREEN
If we are currently processing the GREEN byte the next byte WILL be BLUE
If we are currently processing the BLUE byte the next byte WILL be RED
eg. In the red state we can simply say
STATE = GREEN
We don't have to say
if (SOMETHING) then STATE = GREEN else STATE = BLUE
This saves a few clocks by not having to evaluate anything.
The whole code is shown as a picture here. Again send me a PM if you want me to email it to you.
The comments in the code are hopefully enough even let someone unfamiliar with AVR-ASM understand it.
Step 4: Using a UART With Out External Inverter
1, 5 UART bits per 1 WS2811 bit
2, UART in 8 bit mode
4, The entry point to the serialiser is not bit 0
I am not going to write the code for that as it is a waste of time on the AVR as you still have to count clocks on entry and it does not gain that much free time anyway. On the XMega with DMA it is a different proposition though. It could free most of your XMega CPU.
(I didn't know what to do as a photo for this step so I just did the brass robot from The Etchinator)
Step 5: Conclusion
What to conclude.
1, Alan did a fine job and it worked.
2, I am a tosser that just wants to show off how you can do things in less clocks on an AVR
3, People leaving comments on HaD should put up or shut up