Good on you Alan Burlison.
This is not meant to be bagging you in any way. Your code did what it needed to do. Great success. My initial response in a forum comment was actually directed at the people who where offering non-working ideas of using a UART to get some hardware help.
My first suggestion of using a timer to help out is partly fleshed out below, but not fully functional. The reason it is not complete is that when I started to fill out the code it became obvious that with a bit more optimization there is plenty clocks to do the full-monty as Bit-Banging without having to unroll any loops.
The second bit of action shown here is my other suggestion. One of the "use a UART" people said that you could use an inverter to fix up the START-BIT problem. I thought "Well - if you are going to throw a 74XX at it, why not use the SPI and have 140 clock cycles free." Again this is not a complete solution, but is a "proof of concept" to show how the hardware can help.
Finally the third piece is a version of bit banging out a WS2811 that I came up with. Sans a WS2811 because I don't have any. It does not do anything better than Alans code. It is just a bit more optimized (1/2 size) and easier to read due to no loop unrolling and path-lengthening.
It does not break any new ground, there is no magic in it that no one has ever used. It is just a little bit me showing off and a little bit of practice for me. I have been away from the assembler for several years and am just trying to build up my confidence a little bit.
Anyways - On with the show