From owner-chemistry&$at$&ccl.net Thu Sep 29 18:49:05 2005 From: "CCL" To: CCL Subject: CCL: Cleaning up dusty deck fortran and converting to C/C++ Message-Id: <-29405-050929160616-28406-34rr3HPj6mMmi67mbdt+PA---server.ccl.net> X-Original-From: "Alex. A. Granovsky" Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset="iso-8859-1" Date: Fri, 30 Sep 2005 00:04:31 +0400 MIME-Version: 1.0 Sent to CCL by: "Alex. A. Granovsky" [gran---classic.chem.msu.su] --Replace strange characters with the "at" sign to recover email address--. Dear Perry, > You also need the exact GCC. Which version are you using? gcc version 3.3.3 > Anyway, when I use -msse2, I seem to get references to %xmmN and such > in my assembly language output. What are you trying? > > Anyway, assuming gcc 3 or better: > > cc -S -std=c99 -march=pentium4 -msse2 -mfpmath=sse -c bar.c > > For me, that turns the core of the loop into a set of sse vector ops. Using command line: gcc -S -std=c99 -march=pentium4 -msse2 -mfpmath=sse -c test.c, I have gotten extremely unoptimized inner loop using scalar sse2 instructions. After adding -O3 option to the command line above, the inner loop becomes: .L10: movapd %xmm1, %xmm2 mulsd (%ebx,%eax), %xmm2 movsd %xmm2, (%esi,%eax) addl $8, %eax subl $1, %edx jne .L10 As you can see, the multiplication and memory references are still using scalar SSE2 and loop is not unrolled. Using icc, loop becomes unrolled, but still scalar and preserves strict memory access ordering. Using ifort 9.0 (Windows version, thus MS style listing) & original Fortran code: $B1$11: ; Preds $B1$9 $B1$11 movapd xmm2, XMMWORD PTR [esi+ebx*8] ;11.17 mulpd xmm2, xmm0 ;11.8 movapd xmm3, XMMWORD PTR [esi+ebx*8+16] ;11.17 movapd xmm4, XMMWORD PTR [esi+ebx*8+32] ;11.17 movapd xmm5, XMMWORD PTR [esi+ebx*8+48] ;11.17 movapd XMMWORD PTR [edi+ebx*8], xmm2 ;11.8 mulpd xmm3, xmm0 ;11.8 movapd XMMWORD PTR [edi+ebx*8+16], xmm3 ;11.8 mulpd xmm4, xmm0 ;11.8 movapd XMMWORD PTR [edi+ebx*8+32], xmm4 ;11.8 mulpd xmm5, xmm0 ;11.8 movapd XMMWORD PTR [edi+ebx*8+48], xmm5 ;11.8 add ebx, 8 ;11.8 cmp ebx, ecx ;11.8 jb $B1$11 ; Prob 97% ;11.8 jmp $B1$16 ; Prob 100% ;11.8 ALIGN 4 ALIGN 4 ; LOE eax ecx ebx esi edi xmm0 xmm1 $B1$14: ; Preds $B1$9 $B1$14 movsd xmm2, QWORD PTR [esi+ebx*8] ;11.17 movhpd xmm2, QWORD PTR [esi+ebx*8+8] ;11.17 mulpd xmm2, xmm0 ;11.8 movsd xmm3, QWORD PTR [esi+ebx*8+16] ;11.17 movhpd xmm3, QWORD PTR [esi+ebx*8+24] ;11.17 movsd xmm4, QWORD PTR [esi+ebx*8+32] ;11.17 movhpd xmm4, QWORD PTR [esi+ebx*8+40] ;11.17 movsd xmm5, QWORD PTR [esi+ebx*8+48] ;11.17 movhpd xmm5, QWORD PTR [esi+ebx*8+56] ;11.17 movapd XMMWORD PTR [edi+ebx*8], xmm2 ;11.8 mulpd xmm3, xmm0 ;11.8 movapd XMMWORD PTR [edi+ebx*8+16], xmm3 ;11.8 mulpd xmm4, xmm0 ;11.8 movapd XMMWORD PTR [edi+ebx*8+32], xmm4 ;11.8 mulpd xmm5, xmm0 ;11.8 movapd XMMWORD PTR [edi+ebx*8+48], xmm5 ;11.8 add ebx, 8 ;11.8 cmp ebx, ecx ;11.8 jb $B1$14 ; Prob 97% ;11.8 As you can see, there are two versions of the inner loop for two possible cases of data alignment, as it should be. Both are unrolled & vectorized, i.e, use vectorized memory access and multiplication commands. Thus, I would suppose that in C99 passing arrays does not imply that there are no aliasing across them, on the other hand it is now impossible to use restrict qualifier to explicitly declare this. This effectively disables advanced unrolling & vectorization Best regards, Alex Granovsky