Storing mask bits in the low lanes of a regular vector register seems to create more problems than it solves.
i'm very happy you've started a longer-form, text-based, programming-centric, habit of communication, instead of making me parse your twitter threads. thanks, casey!
Found this interesting video that talks about an actual implementation (in hardware, as far as I can tell), and about exactly this problem (timestamp link): https://youtu.be/WzID6kk8RNs?t=567
The presentation is by Roger Espasa, from what I saw on the mailing list he's the co-chair of the RISC-V Vector work group / committee / thingy. He does mention the extra wiring, connecting the lanes, and a lot of complication needed for an out of order core, but it's hard (for me) to gauge the actual complexity from the presentation (like, did they work for 3 years on just this problem or was it work as usual)