I believe the optimization came because the denominator was a power of two. In my memory, the function counted up all of the bytes being sent and checked to see that the sum was a power of 16 (I think 16 bytes made a single USB endpoint or something; I still don’t fully understand USB).
For starters, you can split up a larger modulo into smaller ones:
X = (A + B); X % n = (A % n + B % n) % n
So our 16 bit number X can be split into an upper and lower byte:
X = (X & 0xFF) + (X >> 8)
so
X % 16 = ((X & 0xFF) % 16 + (X >>8) % 16) % 16
This is probably what the compiler was doing in the background anyway, but the real magic came from this neat trick:
x % 2^n = x & (2^n - 1).
so
x % 16 = x & 15
So a 16 bit modulo just became three bitwise ANDs.
Edit: and before anybody thinks I’m good a math, I’m pretty sure I found a forum post where someone was solving exactly my problem, and I just copy/pasted it in.
Edit2: I’m pretty sure I left it here, but I think you can further optimize by just ignoring the upper byte entirely. Again, only because 16 is a power of 2 and works nicely with bitwise arithmatic.
Oh right, duh. Thanks.