#208 new
Peter Johnson

Code quality deteriorates for macros with > 5 arguments

Reported by Peter Johnson | June 25th, 2011 @ 07:52 PM

Originally posted on Trac by Martin Sander Martin@MartinSander.de
Original Trac Ticket


In order to write large blocks of nearly repeating code, I rely on macros with %rep blocks. It turns out that the code quality, as measured by execution speed, deteriorates, if there are more than 5 macro parameters. Consider the following example:

%macro CopyComplexToBlock 8
    %define k_DestReg   %1
    %define k_SrcReg    %2
    %define k_BlockHtW  %3
    %define k_SrcLenW   %4
    %define k_cx        %5
    %assign W           %6
    %assign k_BlockLen  %7
    %assign k_bias      %8

;   code part of macro....

This macro would have to be called like:

 CopyComplexToBlock rdi, rsi, r8, r9, rcx, 4, 48, 128

Without any modification of the code part of this macro, execution is more than 10% faster, if only the numerical constants (which are %assigned rather than %defined) are moved out of the macro header:

%macro CopyComplexToBlock 5
    %define k_DestReg   %1
    %define k_SrcReg    %2
    %define k_BlockHtW  %3
    %define k_SrcLenW   %4
    %define k_cx        %5

;  exactly the same macro code part as above...

This form of the macro would have to be called like:

 %assign W           4 
    %assign k_BlockLen 48 
    %assign k_bias    128 
    CopyComplexToBlock rdi, rsi, r8, r9, rcx

Instead of one line, I need four lines, but the code executes considerably faster!
Here is the actual code, following after the header:

 mov k_cx, k_BlockHtW
    %assign k_off  -k_bias
    %rep (k_BlockLen*W) / 8
        wmovupf xmm0, [k_SrcReg+k_off],         W 
        wmovupf xmm1, [k_SrcReg+k_off+xmmSize], W
        wmovupf xmm2, [k_SrcReg+k_off+2*xmmSize], W 
        wmovupf xmm3, [k_SrcReg+k_off+3*xmmSize], W 
        wmovapf xmm4, xmm0, W 
        wshufpf xmm0, xmm1, 10001000b, W 
        wmovapf xmm5, xmm2, W 
        wshufpf xmm2, xmm3, 10001000b, W 
        wshufpf xmm4, xmm1, 11011101b, W 
        wmovapf [k_DestReg+k_off], xmm0, W 
        wshufpf xmm5, xmm3, 11011101b, W 
        wmovapf [k_DestReg+k_off+2*xmmSize], xmm2, W 
        wmovapf [k_DestReg+k_off+xmmSize],   xmm4, W 
        wmovapf [k_DestReg+k_off+3*xmmSize], xmm5, W 
    add k_SrcReg,  k_SrcLenW 
    add k_DestReg, k_BlockLen*W 
    sub k_cx, 2*W 
    jnz %%CopyComplexToBlockNextRow

In a separate macro definition file, the command wmovapf is translated into movaps, movapd, vmovaps, or vmovapd, depending on the choice of target processor and single or double data, as selected by the parameter W. Same procedure for wmovupf and wshufpf.

I am not sure where to locate the problem. My best guess would be that the preprocessor just gets confused with too many macro arguments. On the other hand, 8 arguments is not that many and should not drive the assembler into difficulties.

Trac Attachments

h3. Trac Comments

              Changed 9 months ago by peter@tortall.net

In some initial testing I'm not seeing much difference.  It's likely something in the wmov* macros which is having a ripple effect.  Can you post those?


              Changed 3 months ago by anonymous

Sorry for not answering for a long time. I checked several times after my initial post, but did not see your comment then...

Here are the macros:

%ifdef AVX_available
    %define wmovups vmovups
    %define wmovaps vmovaps
    %define wmovupd vmovupd
    %define wmovapd vmovapd

    %macro  wshufps 3-4
        %if %0 = 4; 3-xmm operand form
            vshufps %1, %2, %3, %4
            vshufps %1, %1, %2, %3

%else;  no AVX available
    %define wmovups movups
    %define wmovaps movaps
    %define wmovupd movupd
    %define wmovapd movapd

    %macro  wshufps 3-4
        %if %0 = 4; 3-xmm operand form
            movaps %1, %2
            shufps %1, %3, %4
            shufps %1, %2, %3

%macro wmovapf 3
    ;  %1:XMMReg, %2:src, %3:W
    %if %3=4;  float accuracy
        wmovaps %1, %2
    %else;  double accuracy
        wmovapd %1, %2

%macro wmovupf 3
    ;  %1:XMMReg, %2:src, %3:W
    %if %3=4;  float accuracy
        wmovups %1, %2
    %else;  double accuracy
        wmovupd %1, %2

%macro wshufpf 4-5
    ;  %1:XMMRegDest, %2:XMMRegSrc1, (%3:XMMRegSrc2), %3: bitmask, %4-5:W
    %if %0=5;  four-operand form of (v)shufpd or (v)shufps
        %if %5=4;  float precision
            wshufps %1, %2, %3, %4
        %else;  double precision
            wshufpd %1, %2, %3, %4
    %else;   three-operand form
        %if %4=4;  float
            wshufps %1, %2, %3
        %else;  double
            wshufpd %1, %2, %3

Comments and changes to this ticket

  • Swayer Scott

    Swayer Scott June 11th, 2019 @ 11:38 AM

    This site is doing a very good job by providing best tours in washington dc a huge platform to discuss different topics. It is really helpful for students like me. I learned a lot from this site. It also supports me in my academics. I read, write, and share my thoughts through this forum. Thanks a lot.

Please Sign in or create a free account to add a new ticket.

With your very own profile, you can contribute to projects, track your activity, watch tickets, receive and update tickets through your email and much more.

New-ticket Create new ticket

Create your profile

Help contribute to this project by taking a few moments to create your personal profile. Create your profile ยป

The Yasm Modular Assembler Project

Shared Ticket Bins

People watching this ticket