[gentoo-science] [Fwd: [atlas-devel] 3.7.31 and threading problem]

public inbox for gentoo-science@lists.gentoo.org
 help / color / mirror / Atom feed

* [gentoo-science] [Fwd: [atlas-devel] 3.7.31 and threading problem]
@ 2007-05-18 19:06 M. Edward (Ed) Borasky
  2007-05-18 19:31 ` Markus Dittrich
  0 siblings, 1 reply; 5+ messages in thread
From: M. Edward (Ed) Borasky @ 2007-05-18 19:06 UTC (permalink / raw
  To: gentoo-science

[-- Attachment #1: Type: text/plain, Size: 42 bytes --]

Is this applicable to the Gentoo ebuilds?

[-- Attachment #2: [atlas-devel] 3.7.31 and threading problem.eml --]
[-- Type: message/rfc822, Size: 5961 bytes --]

From: Clint Whaley <whaley@cs.utsa.edu>
To: math-atlas-devel@lists.sourceforge.net
Subject: [atlas-devel] 3.7.31 and threading problem
Date: Thu, 17 May 2007 16:45:32 -0500
Message-ID: <200705172145.l4HLjW1A007591@pandora1.cs.utsa.edu>

Guys,

OK, ATLAS 3.7.31 has finally escaped.  It's been a while since the last 
developer release, but I've actually been working pretty fulltime on it
for a while now.

I've added support for MIPS, and some assembly kernels tuned for static
MIPS archs. I've also update config to handle the MIPS/Linux system I
had access to.  I've also finally got OS X/G5 (AKA PPC970) support in
the new framework.  It took a lot of hoop-jumping.  I presently have full
config support only for OS X/G5 (including 64 & 32 bit arch defs); I have no
idea how OS X/G4 will do, and don't know if things will work under Linux or
AIX (all on the to-do list, if/when I get time & access).

More importantly, we now have much better kernels for PowerPC970FX.  Single
precision now has an assembly kernel, which I wrote because the 
compiler-controlled altivec kernel had to be massaged with every compiler
release, and so it was proving impossible to keep up to date.  In assembly,
I could make things a good deal faster, and so we now achieve almost 79%
of peak in the kernel (the old kernel got something like 62% of peak).

For double precision, a student in my Fundamentals of High Performance
Optimization class, Tony Castaldo, found a cool trick for PPC970 FPU
code: you get much better performance if you issue your instructions
in sets of four (generally, 4 integer or load ops, followed by 4 fpops, etc).
Tony also noticed that by mixing the iterations of the M-loop you could push
performance slighly higher yet.  With these tricks, ATLAS's kernel performance
went from roughly 75% of peak to over 82%.

Therefore, on my 2Ghz PowerPC970, I can now achieve over 6Gflop in a DGEMM,
and almost 12Gflop in SGEMM for one processor (though you have to run big
problems to see these high numbers).

I also noticed something on the threading front which may be critical for
some of you.  For very large problems (eg N > 2K) ATLAS's threaded
performance dropped badly, to below serial.  The reason is tied into
memory allocation.  I'm pretty sure there's a fix that will allow the
threaded code to handle this better, but in the meantime, if you experience
this problem, pump up the maximum amount of workspace ATLAS is allowed by
increasing the macro ATL_MaxMalloc in ATLAS/atlas_lv3.h.  It is presently at
16MB; for my machine, I set it to 160MB, and then I never saw the problem
again :)

Cheers,
Clint

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Math-atlas-devel mailing list
Math-atlas-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/math-atlas-devel

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [gentoo-science] [Fwd: [atlas-devel] 3.7.31 and threading problem]
  2007-05-18 19:06 [gentoo-science] [Fwd: [atlas-devel] 3.7.31 and threading problem] M. Edward (Ed) Borasky
@ 2007-05-18 19:31 ` Markus Dittrich
  2007-05-18 19:57   ` M. Edward (Ed) Borasky
  0 siblings, 1 reply; 5+ messages in thread
From: Markus Dittrich @ 2007-05-18 19:31 UTC (permalink / raw
  To: gentoo-science

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Fri, 18 May 2007, M. Edward (Ed) Borasky wrote:

> Is this applicable to the Gentoo ebuilds?
>

Hi Ed,

What exactly are you referring to? The part about
ATL_MaxMalloc? If so, I haven't done any benchmarking
yet, but I don't see a problem increasing this
value to 160M as suggested by Clint.

cheers,
Markus


- -- 
Markus Dittrich (markusle)
Gentoo Linux Developer
Scientific applications
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFGTf8txlRwCwb7k40RAmfSAJ9r5K/RD+4rxJbFm7NDH+ONWoy1ZwCeKLNL
y0jHAOSCmQI+V5mKzcWsjcE=
=LpY6
-----END PGP SIGNATURE-----
-- 
gentoo-science@gentoo.org mailing list



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [gentoo-science] [Fwd: [atlas-devel] 3.7.31 and threading problem]
  2007-05-18 19:31 ` Markus Dittrich
@ 2007-05-18 19:57   ` M. Edward (Ed) Borasky
  2007-05-18 20:33     ` Markus Dittrich
  0 siblings, 1 reply; 5+ messages in thread
From: M. Edward (Ed) Borasky @ 2007-05-18 19:57 UTC (permalink / raw
  To: gentoo-science

Markus Dittrich wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On Fri, 18 May 2007, M. Edward (Ed) Borasky wrote:
>
>> Is this applicable to the Gentoo ebuilds?
>>
>
> Hi Ed,
>
> What exactly are you referring to? The part about
> ATL_MaxMalloc? If so, I haven't done any benchmarking
> yet, but I don't see a problem increasing this
> value to 160M as suggested by Clint.
>
> cheers,
> Markus
>
>
> - -- Markus Dittrich (markusle)
> Gentoo Linux Developer
> Scientific applications
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.6 (GNU/Linux)
>
> iD8DBQFGTf8txlRwCwb7k40RAmfSAJ9r5K/RD+4rxJbFm7NDH+ONWoy1ZwCeKLNL
> y0jHAOSCmQI+V5mKzcWsjcE=
> =LpY6
> -----END PGP SIGNATURE-----
Yeah ... I just got an Athlon64 X2 4200+ and the first thing I did when 
I got the machine stabilized was build "blas-atlas" and "lapack-atlas" 
(3.7.30). New versions of Atlas generally show up in Portage or the 
science overlay within a day after Clint releases them, so I don't 
usually test the upstream source.

The best I got out of my machine was something like 7 GFLOPS on a 32-bit 
test with 3.7.30, and there were some cases that looked like they should 
have done better. So I definitely want to test this 160M setting.
-- 
gentoo-science@gentoo.org mailing list



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [gentoo-science] [Fwd: [atlas-devel] 3.7.31 and threading problem]
  2007-05-18 19:57   ` M. Edward (Ed) Borasky
@ 2007-05-18 20:33     ` Markus Dittrich
  2007-05-19  4:38       ` M. Edward (Ed) Borasky
  0 siblings, 1 reply; 5+ messages in thread
From: Markus Dittrich @ 2007-05-18 20:33 UTC (permalink / raw
  To: gentoo-science

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Fri, 18 May 2007, M. Edward (Ed) Borasky wrote:
> Yeah ... I just got an Athlon64 X2 4200+ and the first thing I did 
> when I got the machine stabilized was build "blas-atlas" and 
> "lapack-atlas" (3.7.30). New versions of Atlas generally show up in 
> Portage or the science overlay within a day after Clint releases 
> them, so I don't usually test the upstream source.
>
> The best I got out of my machine was something like 7 GFLOPS on a 
> 32-bit test with 3.7.30, and there were some cases that looked like 
> they should have done better. So I definitely want to test this 
> 160M setting.
> --

Nice! Please let us know how your benchmarking goes.
The new version should hit portage
by tomorrow and I plan to go with 160M by default
unless I notice something strange during testing.

Best,
Markus

- ---

Markus Dittrich (markusle)
Gentoo Linux Developer
Scientific applications
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFGTg2nxlRwCwb7k40RAjOYAJ4/Va9/W8UM/v7EqmOgXLlH120cfgCdEIpQ
J4hpzeIHuRE+TgR8/Vo8rEU=
=1RAJ
-----END PGP SIGNATURE-----
-- 
gentoo-science@gentoo.org mailing list



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [gentoo-science] [Fwd: [atlas-devel] 3.7.31 and threading problem]
  2007-05-18 20:33     ` Markus Dittrich
@ 2007-05-19  4:38       ` M. Edward (Ed) Borasky
  0 siblings, 0 replies; 5+ messages in thread
From: M. Edward (Ed) Borasky @ 2007-05-19  4:38 UTC (permalink / raw
  To: gentoo-science

[-- Attachment #1: Type: text/plain, Size: 1368 bytes --]

Markus Dittrich wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On Fri, 18 May 2007, M. Edward (Ed) Borasky wrote:
>> Yeah ... I just got an Athlon64 X2 4200+ and the first thing I did 
>> when I got the machine stabilized was build "blas-atlas" and 
>> "lapack-atlas" (3.7.30). New versions of Atlas generally show up in 
>> Portage or the science overlay within a day after Clint releases 
>> them, so I don't usually test the upstream source.
>>
>> The best I got out of my machine was something like 7 GFLOPS on a 
>> 32-bit test with 3.7.30, and there were some cases that looked like 
>> they should have done better. So I definitely want to test this 160M 
>> setting.
>> -- 
>
> Nice! Please let us know how your benchmarking goes.
> The new version should hit portage
> by tomorrow and I plan to go with 160M by default
> unless I notice something strange during testing.
>
> Best,
> Markus
>
> - ---
>
> Markus Dittrich (markusle)
> Gentoo Linux Developer
> Scientific applications
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.6 (GNU/Linux)
>
> iD8DBQFGTg2nxlRwCwb7k40RAjOYAJ4/Va9/W8UM/v7EqmOgXLlH120cfgCdEIpQ
> J4hpzeIHuRE+TgR8/Vo8rEU=
> =1RAJ
> -----END PGP SIGNATURE-----
I just installed 3.7.31. It looks like there are no significant 
performance differences -- I've attached the two "SUMMARY.LOG" files 
from 3.7.30 and 3.7.31.



[-- Attachment #2: SUMMARY.LOG --]
[-- Type: text/plain, Size: 8935 bytes --]


*******************************************************************************
*******************************************************************************
*******************************************************************************
*       BEGAN ATLAS3.7.30 INSTALL OF SECTION 0-0-0 ON 05/13/2007 AT 16:54     *
*******************************************************************************
*******************************************************************************
*******************************************************************************





IN STAGE 1 INSTALL:  SYSTEM PROBE/AUX COMPILE
   Level 1 cache size calculated as 64KB.

   dFPU: Combined muladd instruction with 5 cycle pipeline.
         Apparent number of registers : 32
         Register-register performance=1691.12MFLOPS
   sFPU: Combined muladd instruction with 5 cycle pipeline.
         Apparent number of registers : 32
         Register-register performance=1576.58MFLOPS


IN STAGE 2 INSTALL:  TYPE-DEPENDENT TUNING


STAGE 2-1: TUNING PREC='d' (precision 1 of 4)


   STAGE 2-1-1 : BUILDING BLOCK MATMUL TUNE
      The best matmul kernel was ATL_dmm4x1x90_x87.c, NB=52, written by R. Clint Whaley
      Performance: 4030.58MFLOPS (182.38 percent of of detected clock rate)
        (Gen case got 2222.47MFLOPS)
      mmNN   : ma=1, lat=4, nb=28, mu=4, nu=1 ku=28, ff=0, if=5, nf=1
               Performance = 2182.04 (54.14 of copy matmul, 98.73 of clock)
      mmNT   : ma=1, lat=4, nb=28, mu=4, nu=1 ku=28, ff=0, if=5, nf=1
               Performance = 1957.55 (48.57 of copy matmul, 88.58 of clock)
      mmTN   : ma=1, lat=8, nb=28, mu=4, nu=1 ku=28, ff=0, if=5, nf=1
               Performance = 2063.71 (51.20 of copy matmul, 93.38 of clock)
      mmTT   : ma=1, lat=5, nb=28, mu=4, nu=1 ku=28, ff=0, if=5, nf=1
               Performance = 1808.69 (44.87 of copy matmul, 81.84 of clock)



   STAGE 2-1-2: CacheEdge DETECTION
      CacheEdge set to 2097152 bytes


   STAGE 2-1-3: LARGE/SMALL CASE CROSSOVER DETECTION


   STAGE 2-1-3: COPY/NO-COPY CROSSOVER DETECTION
      done.


   STAGE 2-1-4: LEVEL 3 BLAS TUNE
      done.


   STAGE 2-1-5: GEMV TUNE
      gemvN : chose routine 3:ATL_gemvN_1x1_1a.c written by R. Clint Whaley
              Yunroll=32, Xunroll=1, using 90 percent of L1
              Performance = 848.61 (21.05 of copy matmul, 38.40 of clock)
      gemvT : chose routine 105:ATL_gemvT_2x16_1.c written by R. Clint Whaley
              Yunroll=2, Xunroll=16, using 90 percent of L1
              Performance = 830.40 (20.60 of copy matmul, 37.57 of clock)


   STAGE 2-1-6: GER TUNE
      ger : chose routine 1:ATL_ger1_axpy.c written by R. Clint Whaley
            mu=16, nu=1, using  0.84 percent of L1 Cache
              Performance = 618.42 (15.34 of copy matmul, 27.98 of clock)


STAGE 2-2: TUNING PREC='s' (precision 2 of 4)


   STAGE 2-2-1 : BUILDING BLOCK MATMUL TUNE
      The best matmul kernel was ATL_smm14x1x84_sse.c, NB=84, written by R. Clint Whaley
      Performance: 7779.10MFLOPS (352.00 percent of of detected clock rate)
        (Gen case got 1961.99MFLOPS)
      mmNN   : ma=1, lat=4, nb=28, mu=4, nu=1 ku=28, ff=0, if=6, nf=1
               Performance = 2020.39 (25.97 of copy matmul, 91.42 of clock)
      mmNT   : ma=1, lat=7, nb=28, mu=4, nu=1 ku=28, ff=0, if=6, nf=1
               Performance = 1708.96 (21.97 of copy matmul, 77.33 of clock)
      mmTN   : ma=1, lat=4, nb=28, mu=4, nu=1 ku=28, ff=0, if=6, nf=1
               Performance = 1984.23 (25.51 of copy matmul, 89.78 of clock)
      mmTT   : ma=1, lat=7, nb=28, mu=4, nu=1 ku=28, ff=0, if=6, nf=1
               Performance = 1683.02 (21.64 of copy matmul, 76.15 of clock)



   STAGE 2-2-2: CacheEdge DETECTION
      CacheEdge set to 2097152 bytes


   STAGE 2-2-3: LARGE/SMALL CASE CROSSOVER DETECTION


   STAGE 2-2-3: COPY/NO-COPY CROSSOVER DETECTION
      done.


   STAGE 2-2-4: LEVEL 3 BLAS TUNE
      done.


   STAGE 2-2-5: GEMV TUNE
      gemvN : chose routine 3:ATL_gemvN_1x1_1a.c written by R. Clint Whaley
              Yunroll=32, Xunroll=1, using 87 percent of L1
              Performance = 1208.92 (15.54 of copy matmul, 54.70 of clock)
      gemvT : chose routine 101:ATL_gemvT_mm.c written by R. Clint Whaley
              Yunroll=0, Xunroll=0, using 87 percent of L1
              Performance = 1258.14 (16.17 of copy matmul, 56.93 of clock)


   STAGE 2-2-6: GER TUNE
      ger : chose routine 1:ATL_ger1_axpy.c written by R. Clint Whaley
            mu=16, nu=1, using  0.97 percent of L1 Cache
              Performance = 1142.53 (14.69 of copy matmul, 51.70 of clock)


STAGE 2-3: TUNING PREC='z' (precision 3 of 4)


   STAGE 2-3-1 : BUILDING BLOCK MATMUL TUNE
      The best matmul kernel was ATL_dmm4x1x90_x87.c, NB=60, written by R. Clint Whaley
      Performance: 3977.18MFLOPS (179.96 percent of of detected clock rate)
        (Gen case got 2140.87MFLOPS)
      mmNN   : ma=1, lat=3, nb=24, mu=4, nu=1 ku=24, ff=0, if=5, nf=1
               Performance = 2217.59 (55.76 of copy matmul, 100.34 of clock)
      mmNT   : ma=1, lat=4, nb=24, mu=4, nu=1 ku=24, ff=0, if=5, nf=1
               Performance = 2050.31 (51.55 of copy matmul, 92.77 of clock)
      mmTN   : ma=1, lat=8, nb=24, mu=4, nu=1 ku=24, ff=0, if=5, nf=1
               Performance = 2086.29 (52.46 of copy matmul, 94.40 of clock)
      mmTT   : ma=1, lat=6, nb=24, mu=4, nu=1 ku=24, ff=0, if=5, nf=1
               Performance = 1977.85 (49.73 of copy matmul, 89.50 of clock)



   STAGE 2-3-2: CacheEdge DETECTION
      CacheEdge set to 2097152 bytes
      zdNKB set to 0 bytes


   STAGE 2-3-3: LARGE/SMALL CASE CROSSOVER DETECTION


   STAGE 2-3-3: COPY/NO-COPY CROSSOVER DETECTION
      done.


   STAGE 2-3-4: LEVEL 3 BLAS TUNE
      done.


   STAGE 2-3-5: GEMV TUNE
      gemvN : chose routine 3:ATL_cgemvN_1x1_1a.c written by R. Clint Whaley
              Yunroll=32, Xunroll=1, using 87 percent of L1
              Performance = 1710.40 (43.01 of copy matmul, 77.39 of clock)
      gemvT : chose routine 101:ATL_cgemvT_mm.c written by R. Clint Whaley
              Yunroll=0, Xunroll=0, using 87 percent of L1
              Performance = 1047.50 (26.34 of copy matmul, 47.40 of clock)


   STAGE 2-3-6: GER TUNE
      ger : chose routine 1:ATL_cger1_axpy.c written by R. Clint Whaley
            mu=16, nu=1, using  0.76 percent of L1 Cache
              Performance = 1256.75 (31.60 of copy matmul, 56.87 of clock)


STAGE 2-4: TUNING PREC='c' (precision 4 of 4)


   STAGE 2-4-1 : BUILDING BLOCK MATMUL TUNE
      The best matmul kernel was ATL_smm14x1x84_sse.c, NB=84, written by R. Clint Whaley
      Performance: 7579.67MFLOPS (342.97 percent of of detected clock rate)
        (Gen case got 1886.85MFLOPS)
      mmNN   : ma=1, lat=2, nb=24, mu=4, nu=1 ku=24, ff=0, if=5, nf=1
               Performance = 2038.76 (26.90 of copy matmul, 92.25 of clock)
      mmNT   : ma=1, lat=5, nb=24, mu=4, nu=1 ku=24, ff=0, if=5, nf=1
               Performance = 1673.32 (22.08 of copy matmul, 75.72 of clock)
      mmTN   : ma=1, lat=5, nb=24, mu=4, nu=1 ku=24, ff=0, if=5, nf=1
               Performance = 2006.47 (26.47 of copy matmul, 90.79 of clock)
      mmTT   : ma=1, lat=2, nb=24, mu=4, nu=1 ku=24, ff=0, if=5, nf=1
               Performance = 1773.72 (23.40 of copy matmul, 80.26 of clock)



   STAGE 2-4-2: CacheEdge DETECTION
      CacheEdge set to 2097152 bytes
      csNKB set to 0 bytes


   STAGE 2-4-3: LARGE/SMALL CASE CROSSOVER DETECTION


   STAGE 2-4-3: COPY/NO-COPY CROSSOVER DETECTION
      done.


   STAGE 2-4-4: LEVEL 3 BLAS TUNE
      done.


   STAGE 2-4-5: GEMV TUNE
      gemvN : chose routine 3:ATL_cgemvN_1x1_1a.c written by R. Clint Whaley
              Yunroll=32, Xunroll=1, using 87 percent of L1
              Performance = 3235.79 (42.69 of copy matmul, 146.42 of clock)
      gemvT : chose routine 101:ATL_cgemvT_mm.c written by R. Clint Whaley
              Yunroll=0, Xunroll=0, using 87 percent of L1
              Performance = 1215.66 (16.04 of copy matmul, 55.01 of clock)


   STAGE 2-4-6: GER TUNE
      ger : chose routine 1:ATL_cger1_axpy.c written by R. Clint Whaley
            mu=16, nu=1, using  0.75 percent of L1 Cache
              Performance = 2341.82 (30.90 of copy matmul, 105.96 of clock)


STAGE 3: GENERAL LIBRARY BUILD


STAGE 4: POST-BUILD TUNING
   done.


STAGE 4: Threading install

*******************************************************************************
*******************************************************************************
*******************************************************************************
*      FINISHED ATLAS3.7.30 INSTALL OF SECTION 0-0-0 ON 05/13/2007 AT 17:27   *
*******************************************************************************
*******************************************************************************
*******************************************************************************




[-- Attachment #3: SUMMARY.LOG --]
[-- Type: text/plain, Size: 8934 bytes --]


*******************************************************************************
*******************************************************************************
*******************************************************************************
*       BEGAN ATLAS3.7.31 INSTALL OF SECTION 0-0-0 ON 05/18/2007 AT 20:10     *
*******************************************************************************
*******************************************************************************
*******************************************************************************





IN STAGE 1 INSTALL:  SYSTEM PROBE/AUX COMPILE
   Level 1 cache size calculated as 64KB.

   dFPU: Combined muladd instruction with 5 cycle pipeline.
         Apparent number of registers : 32
         Register-register performance=1686.67MFLOPS
   sFPU: Combined muladd instruction with 5 cycle pipeline.
         Apparent number of registers : 32
         Register-register performance=1582.42MFLOPS


IN STAGE 2 INSTALL:  TYPE-DEPENDENT TUNING


STAGE 2-1: TUNING PREC='d' (precision 1 of 4)


   STAGE 2-1-1 : BUILDING BLOCK MATMUL TUNE
      The best matmul kernel was ATL_dmm4x1x90_x87.c, NB=52, written by R. Clint Whaley
      Performance: 4019.86MFLOPS (181.89 percent of of detected clock rate)
        (Gen case got 2226.96MFLOPS)
      mmNN   : ma=1, lat=4, nb=28, mu=4, nu=1 ku=28, ff=0, if=5, nf=1
               Performance = 2191.38 (54.51 of copy matmul, 99.16 of clock)
      mmNT   : ma=1, lat=4, nb=28, mu=4, nu=1 ku=28, ff=0, if=5, nf=1
               Performance = 1964.52 (48.87 of copy matmul, 88.89 of clock)
      mmTN   : ma=1, lat=8, nb=28, mu=4, nu=1 ku=28, ff=0, if=5, nf=1
               Performance = 2071.76 (51.54 of copy matmul, 93.74 of clock)
      mmTT   : ma=1, lat=5, nb=28, mu=4, nu=1 ku=28, ff=0, if=5, nf=1
               Performance = 1810.81 (45.05 of copy matmul, 81.94 of clock)



   STAGE 2-1-2: CacheEdge DETECTION
      CacheEdge set to 2097152 bytes


   STAGE 2-1-3: LARGE/SMALL CASE CROSSOVER DETECTION


   STAGE 2-1-3: COPY/NO-COPY CROSSOVER DETECTION
      done.


   STAGE 2-1-4: LEVEL 3 BLAS TUNE
      done.


   STAGE 2-1-5: GEMV TUNE
      gemvN : chose routine 3:ATL_gemvN_1x1_1a.c written by R. Clint Whaley
              Yunroll=32, Xunroll=1, using 90 percent of L1
              Performance = 843.81 (20.99 of copy matmul, 38.18 of clock)
      gemvT : chose routine 105:ATL_gemvT_2x16_1.c written by R. Clint Whaley
              Yunroll=2, Xunroll=16, using 90 percent of L1
              Performance = 824.29 (20.51 of copy matmul, 37.30 of clock)


   STAGE 2-1-6: GER TUNE
      ger : chose routine 1:ATL_ger1_axpy.c written by R. Clint Whaley
            mu=16, nu=1, using  0.84 percent of L1 Cache
              Performance = 629.63 (15.66 of copy matmul, 28.49 of clock)


STAGE 2-2: TUNING PREC='s' (precision 2 of 4)


   STAGE 2-2-1 : BUILDING BLOCK MATMUL TUNE
      The best matmul kernel was ATL_smm14x1x84_sse.c, NB=84, written by R. Clint Whaley
      Performance: 7799.90MFLOPS (352.94 percent of of detected clock rate)
        (Gen case got 1937.24MFLOPS)
      mmNN   : ma=1, lat=4, nb=28, mu=4, nu=1 ku=28, ff=0, if=6, nf=1
               Performance = 2018.70 (25.88 of copy matmul, 91.34 of clock)
      mmNT   : ma=1, lat=7, nb=28, mu=4, nu=1 ku=28, ff=0, if=6, nf=1
               Performance = 1728.63 (22.16 of copy matmul, 78.22 of clock)
      mmTN   : ma=1, lat=4, nb=28, mu=4, nu=1 ku=28, ff=0, if=6, nf=1
               Performance = 1987.15 (25.48 of copy matmul, 89.92 of clock)
      mmTT   : ma=1, lat=7, nb=28, mu=4, nu=1 ku=28, ff=0, if=6, nf=1
               Performance = 1682.88 (21.58 of copy matmul, 76.15 of clock)



   STAGE 2-2-2: CacheEdge DETECTION
      CacheEdge set to 2097152 bytes


   STAGE 2-2-3: LARGE/SMALL CASE CROSSOVER DETECTION


   STAGE 2-2-3: COPY/NO-COPY CROSSOVER DETECTION
      done.


   STAGE 2-2-4: LEVEL 3 BLAS TUNE
      done.


   STAGE 2-2-5: GEMV TUNE
      gemvN : chose routine 3:ATL_gemvN_1x1_1a.c written by R. Clint Whaley
              Yunroll=32, Xunroll=1, using 87 percent of L1
              Performance = 1207.78 (15.48 of copy matmul, 54.65 of clock)
      gemvT : chose routine 101:ATL_gemvT_mm.c written by R. Clint Whaley
              Yunroll=0, Xunroll=0, using 87 percent of L1
              Performance = 1244.38 (15.95 of copy matmul, 56.31 of clock)


   STAGE 2-2-6: GER TUNE
      ger : chose routine 1:ATL_ger1_axpy.c written by R. Clint Whaley
            mu=16, nu=1, using  0.97 percent of L1 Cache
              Performance = 1133.84 (14.54 of copy matmul, 51.30 of clock)


STAGE 2-3: TUNING PREC='z' (precision 3 of 4)


   STAGE 2-3-1 : BUILDING BLOCK MATMUL TUNE
      The best matmul kernel was ATL_dmm4x1x90_x87.c, NB=60, written by R. Clint Whaley
      Performance: 4003.44MFLOPS (181.15 percent of of detected clock rate)
        (Gen case got 2136.97MFLOPS)
      mmNN   : ma=1, lat=3, nb=24, mu=4, nu=1 ku=24, ff=0, if=5, nf=1
               Performance = 2206.52 (55.12 of copy matmul, 99.84 of clock)
      mmNT   : ma=1, lat=4, nb=24, mu=4, nu=1 ku=24, ff=0, if=5, nf=1
               Performance = 2049.25 (51.19 of copy matmul, 92.73 of clock)
      mmTN   : ma=1, lat=8, nb=24, mu=4, nu=1 ku=24, ff=0, if=5, nf=1
               Performance = 2089.87 (52.20 of copy matmul, 94.56 of clock)
      mmTT   : ma=1, lat=6, nb=24, mu=4, nu=1 ku=24, ff=0, if=5, nf=1
               Performance = 1976.67 (49.37 of copy matmul, 89.44 of clock)



   STAGE 2-3-2: CacheEdge DETECTION
      CacheEdge set to 2097152 bytes
      zdNKB set to 0 bytes


   STAGE 2-3-3: LARGE/SMALL CASE CROSSOVER DETECTION


   STAGE 2-3-3: COPY/NO-COPY CROSSOVER DETECTION
      done.


   STAGE 2-3-4: LEVEL 3 BLAS TUNE
      done.


   STAGE 2-3-5: GEMV TUNE
      gemvN : chose routine 3:ATL_cgemvN_1x1_1a.c written by R. Clint Whaley
              Yunroll=32, Xunroll=1, using 87 percent of L1
              Performance = 1710.41 (42.72 of copy matmul, 77.39 of clock)
      gemvT : chose routine 101:ATL_cgemvT_mm.c written by R. Clint Whaley
              Yunroll=0, Xunroll=0, using 87 percent of L1
              Performance = 1048.27 (26.18 of copy matmul, 47.43 of clock)


   STAGE 2-3-6: GER TUNE
      ger : chose routine 1:ATL_cger1_axpy.c written by R. Clint Whaley
            mu=16, nu=1, using  0.76 percent of L1 Cache
              Performance = 1255.67 (31.36 of copy matmul, 56.82 of clock)


STAGE 2-4: TUNING PREC='c' (precision 4 of 4)


   STAGE 2-4-1 : BUILDING BLOCK MATMUL TUNE
      The best matmul kernel was ATL_smm14x1x84_sse.c, NB=84, written by R. Clint Whaley
      Performance: 7579.67MFLOPS (342.97 percent of of detected clock rate)
        (Gen case got 1882.85MFLOPS)
      mmNN   : ma=1, lat=2, nb=24, mu=4, nu=1 ku=24, ff=0, if=5, nf=1
               Performance = 2041.70 (26.94 of copy matmul, 92.38 of clock)
      mmNT   : ma=1, lat=5, nb=24, mu=4, nu=1 ku=24, ff=0, if=5, nf=1
               Performance = 1692.36 (22.33 of copy matmul, 76.58 of clock)
      mmTN   : ma=1, lat=5, nb=24, mu=4, nu=1 ku=24, ff=0, if=5, nf=1
               Performance = 2004.91 (26.45 of copy matmul, 90.72 of clock)
      mmTT   : ma=1, lat=2, nb=24, mu=4, nu=1 ku=24, ff=0, if=5, nf=1
               Performance = 1762.73 (23.26 of copy matmul, 79.76 of clock)



   STAGE 2-4-2: CacheEdge DETECTION
      CacheEdge set to 2097152 bytes
      csNKB set to 0 bytes


   STAGE 2-4-3: LARGE/SMALL CASE CROSSOVER DETECTION


   STAGE 2-4-3: COPY/NO-COPY CROSSOVER DETECTION
      done.


   STAGE 2-4-4: LEVEL 3 BLAS TUNE
      done.


   STAGE 2-4-5: GEMV TUNE
      gemvN : chose routine 3:ATL_cgemvN_1x1_1a.c written by R. Clint Whaley
              Yunroll=32, Xunroll=1, using 87 percent of L1
              Performance = 3241.37 (42.76 of copy matmul, 146.67 of clock)
      gemvT : chose routine 101:ATL_cgemvT_mm.c written by R. Clint Whaley
              Yunroll=0, Xunroll=0, using 87 percent of L1
              Performance = 1212.23 (15.99 of copy matmul, 54.85 of clock)


   STAGE 2-4-6: GER TUNE
      ger : chose routine 1:ATL_cger1_axpy.c written by R. Clint Whaley
            mu=16, nu=1, using  0.75 percent of L1 Cache
              Performance = 2365.49 (31.21 of copy matmul, 107.04 of clock)


STAGE 3: GENERAL LIBRARY BUILD


STAGE 4: POST-BUILD TUNING
   done.


STAGE 4: Threading install

*******************************************************************************
*******************************************************************************
*******************************************************************************
*      FINISHED ATLAS3.7.31 INSTALL OF SECTION 0-0-0 ON 05/18/2007 AT 20:40   *
*******************************************************************************
*******************************************************************************
*******************************************************************************




^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2007-05-19  4:39 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-05-18 19:06 [gentoo-science] [Fwd: [atlas-devel] 3.7.31 and threading problem] M. Edward (Ed) Borasky
2007-05-18 19:31 ` Markus Dittrich
2007-05-18 19:57   ` M. Edward (Ed) Borasky
2007-05-18 20:33     ` Markus Dittrich
2007-05-19  4:38       ` M. Edward (Ed) Borasky

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox