Is there an Armv8-A intrinsic for 16-byte wide VTBL?

According to my regularly used source Searchable Neon Arm Intrinsic Guide, there are only these (four classes of) intrinsics for lookup table with 8 byte target register (uint8x8 and poly8x8_t variants omitted for brevity).

int8x8_t vtbl1_s8 (int8x8_t a, int8x8_t b)
int8x8_t vtbl2_s8 (int8x8x2_t a, int8x8_t b)
int8x8_t vtbl3_s8 (int8x8x3_t a, int8x8_t b)
int8x8_t vtbl4_s8 (int8x8x4_t a, int8x8_t b)

To a surprise my source code

uint8x16_t oddeven(uint8x16_t a) {
    auto l = vget_low_u8(a);
    auto h = vget_high_u8(a);
    auto lh = vuzp_u8(l,h);
    return vcombine_u8(lh.val[0], lh.val[1]);

produced this practically single instruction code for odd/even interleaving of a 16-byte vector:

adrp    x8, .LCPI0_0
ldr     q1, [x8, :lo12:.LCPI0_0]
tbl     v0.16b, { v0.16b }, v1.16b

So there it is, tbl v0.16.b, { } variant apparently performing a full 16->16 permutation of the original data in a single instruction. Is this (un)documented, or can it be otherwise produced with intrinsics?

See full code and listing in

2 answers

  • answered 2019-11-14 04:53 Jake 'Alquimista' LEE

    No, there is no intrinsics for 16byte permutation even though the tbl instruction on aarch64 accepts it.

  • answered 2019-11-14 05:33 Peter Cordes

    You can find it in the intrinsics guide by searching on tbl (the instruction mnemonic), then "search within page" for 16 until you get to some uint8x16_t versions of it to find the intrinsics naming scheme for them. uint8x16_t vqtbl1q_u8 (uint8x16_t t, uint8x16_t idx)

    (Thanks to @RossRidge for pointing out the correct name in the first place; the point of this answer is to suggest a way to find intrinsics based on a known instruction mnemonic. It works better for Intel's x86 intrinsic finder where the element size is part of the mnemonic, so searching on the asm mnemonic usually narrows down the list of intrinsic results sufficiently to scan visually.)