public inbox for gentoo-science@lists.gentoo.org
 help / color / mirror / Atom feed
* [gentoo-science] [Fwd: [atlas-devel] Athlon64 X2 results]
@ 2007-05-18 21:41 M. Edward (Ed) Borasky
  0 siblings, 0 replies; only message in thread
From: M. Edward (Ed) Borasky @ 2007-05-18 21:41 UTC (permalink / raw
  To: gentoo-science

[-- Attachment #1: Type: text/plain, Size: 1 bytes --]



[-- Attachment #2: [atlas-devel] Athlon64 X2 results.eml --]
[-- Type: message/rfc822, Size: 12886 bytes --]

From: "M. Edward (Ed) Borasky" <znmeb@cesmail.net>
To: math-atlas-devel@lists.sourceforge.net
Subject: [atlas-devel] Athlon64 X2 results
Date: Sun, 13 May 2007 18:25:51 -0700
Message-ID: <4647BA9F.80400@cesmail.net>

I just got an Athlon64 X2 4200+ (and a motherboard/RAM/hard drive, 
etc.). It's taken me a week or so to get it stabilized, but I've got 
Gentoo loaded and just built "blas-atlas" and "lapack-atlas" on it. 
Here's what I got in the SUMMARY.LOG. Compiler is GCC 4.1.2 and the 
kernel is 2.4.21 (Gentoo). Questions:

1. Do the numbers look right for a dual-core 2210 MHz Athlon64?
2. Does this chip really have SSE3? The /proc/cpuinfo flags that Linux 
provides show SSE and SSE2, but not SSE3.


*******************************************************************************
*******************************************************************************
*******************************************************************************
*       BEGAN ATLAS3.7.30 INSTALL OF SECTION 0-0-0 ON 05/13/2007 AT 
16:54     *
*******************************************************************************
*******************************************************************************
*******************************************************************************





IN STAGE 1 INSTALL:  SYSTEM PROBE/AUX COMPILE
   Level 1 cache size calculated as 64KB.

   dFPU: Combined muladd instruction with 5 cycle pipeline.
         Apparent number of registers : 32
         Register-register performance=1691.12MFLOPS
   sFPU: Combined muladd instruction with 5 cycle pipeline.
         Apparent number of registers : 32
         Register-register performance=1576.58MFLOPS


IN STAGE 2 INSTALL:  TYPE-DEPENDENT TUNING


STAGE 2-1: TUNING PREC='d' (precision 1 of 4)


   STAGE 2-1-1 : BUILDING BLOCK MATMUL TUNE
      The best matmul kernel was ATL_dmm4x1x90_x87.c, NB=52, written by 
R. Clint Whaley
      Performance: 4030.58MFLOPS (182.38 percent of of detected clock rate)
        (Gen case got 2222.47MFLOPS)
      mmNN   : ma=1, lat=4, nb=28, mu=4, nu=1 ku=28, ff=0, if=5, nf=1
               Performance = 2182.04 (54.14 of copy matmul, 98.73 of clock)
      mmNT   : ma=1, lat=4, nb=28, mu=4, nu=1 ku=28, ff=0, if=5, nf=1
               Performance = 1957.55 (48.57 of copy matmul, 88.58 of clock)
      mmTN   : ma=1, lat=8, nb=28, mu=4, nu=1 ku=28, ff=0, if=5, nf=1
               Performance = 2063.71 (51.20 of copy matmul, 93.38 of clock)
      mmTT   : ma=1, lat=5, nb=28, mu=4, nu=1 ku=28, ff=0, if=5, nf=1
               Performance = 1808.69 (44.87 of copy matmul, 81.84 of clock)



   STAGE 2-1-2: CacheEdge DETECTION
      CacheEdge set to 2097152 bytes


   STAGE 2-1-3: LARGE/SMALL CASE CROSSOVER DETECTION


   STAGE 2-1-3: COPY/NO-COPY CROSSOVER DETECTION
      done.


   STAGE 2-1-4: LEVEL 3 BLAS TUNE
      done.


   STAGE 2-1-5: GEMV TUNE
      gemvN : chose routine 3:ATL_gemvN_1x1_1a.c written by R. Clint Whaley
              Yunroll=32, Xunroll=1, using 90 percent of L1
              Performance = 848.61 (21.05 of copy matmul, 38.40 of clock)
      gemvT : chose routine 105:ATL_gemvT_2x16_1.c written by R. Clint 
Whaley
              Yunroll=2, Xunroll=16, using 90 percent of L1
              Performance = 830.40 (20.60 of copy matmul, 37.57 of clock)


   STAGE 2-1-6: GER TUNE
      ger : chose routine 1:ATL_ger1_axpy.c written by R. Clint Whaley
            mu=16, nu=1, using  0.84 percent of L1 Cache
              Performance = 618.42 (15.34 of copy matmul, 27.98 of clock)


STAGE 2-2: TUNING PREC='s' (precision 2 of 4)


   STAGE 2-2-1 : BUILDING BLOCK MATMUL TUNE
      The best matmul kernel was ATL_smm14x1x84_sse.c, NB=84, written by 
R. Clint Whaley
      Performance: 7779.10MFLOPS (352.00 percent of of detected clock rate)
        (Gen case got 1961.99MFLOPS)
      mmNN   : ma=1, lat=4, nb=28, mu=4, nu=1 ku=28, ff=0, if=6, nf=1
               Performance = 2020.39 (25.97 of copy matmul, 91.42 of clock)
      mmNT   : ma=1, lat=7, nb=28, mu=4, nu=1 ku=28, ff=0, if=6, nf=1
               Performance = 1708.96 (21.97 of copy matmul, 77.33 of clock)
      mmTN   : ma=1, lat=4, nb=28, mu=4, nu=1 ku=28, ff=0, if=6, nf=1
               Performance = 1984.23 (25.51 of copy matmul, 89.78 of clock)
      mmTT   : ma=1, lat=7, nb=28, mu=4, nu=1 ku=28, ff=0, if=6, nf=1
               Performance = 1683.02 (21.64 of copy matmul, 76.15 of clock)



   STAGE 2-2-2: CacheEdge DETECTION
      CacheEdge set to 2097152 bytes


   STAGE 2-2-3: LARGE/SMALL CASE CROSSOVER DETECTION


   STAGE 2-2-3: COPY/NO-COPY CROSSOVER DETECTION
      done.


   STAGE 2-2-4: LEVEL 3 BLAS TUNE
      done.


   STAGE 2-2-5: GEMV TUNE
      gemvN : chose routine 3:ATL_gemvN_1x1_1a.c written by R. Clint Whaley
              Yunroll=32, Xunroll=1, using 87 percent of L1
              Performance = 1208.92 (15.54 of copy matmul, 54.70 of clock)
      gemvT : chose routine 101:ATL_gemvT_mm.c written by R. Clint Whaley
              Yunroll=0, Xunroll=0, using 87 percent of L1
              Performance = 1258.14 (16.17 of copy matmul, 56.93 of clock)


   STAGE 2-2-6: GER TUNE
      ger : chose routine 1:ATL_ger1_axpy.c written by R. Clint Whaley
            mu=16, nu=1, using  0.97 percent of L1 Cache
              Performance = 1142.53 (14.69 of copy matmul, 51.70 of clock)


STAGE 2-3: TUNING PREC='z' (precision 3 of 4)


   STAGE 2-3-1 : BUILDING BLOCK MATMUL TUNE
      The best matmul kernel was ATL_dmm4x1x90_x87.c, NB=60, written by 
R. Clint Whaley
      Performance: 3977.18MFLOPS (179.96 percent of of detected clock rate)
        (Gen case got 2140.87MFLOPS)
      mmNN   : ma=1, lat=3, nb=24, mu=4, nu=1 ku=24, ff=0, if=5, nf=1
               Performance = 2217.59 (55.76 of copy matmul, 100.34 of clock)
      mmNT   : ma=1, lat=4, nb=24, mu=4, nu=1 ku=24, ff=0, if=5, nf=1
               Performance = 2050.31 (51.55 of copy matmul, 92.77 of clock)
      mmTN   : ma=1, lat=8, nb=24, mu=4, nu=1 ku=24, ff=0, if=5, nf=1
               Performance = 2086.29 (52.46 of copy matmul, 94.40 of clock)
      mmTT   : ma=1, lat=6, nb=24, mu=4, nu=1 ku=24, ff=0, if=5, nf=1
               Performance = 1977.85 (49.73 of copy matmul, 89.50 of clock)



   STAGE 2-3-2: CacheEdge DETECTION
      CacheEdge set to 2097152 bytes
      zdNKB set to 0 bytes


   STAGE 2-3-3: LARGE/SMALL CASE CROSSOVER DETECTION


   STAGE 2-3-3: COPY/NO-COPY CROSSOVER DETECTION
      done.


   STAGE 2-3-4: LEVEL 3 BLAS TUNE
      done.


   STAGE 2-3-5: GEMV TUNE
      gemvN : chose routine 3:ATL_cgemvN_1x1_1a.c written by R. Clint Whaley
              Yunroll=32, Xunroll=1, using 87 percent of L1
              Performance = 1710.40 (43.01 of copy matmul, 77.39 of clock)
      gemvT : chose routine 101:ATL_cgemvT_mm.c written by R. Clint Whaley
              Yunroll=0, Xunroll=0, using 87 percent of L1
              Performance = 1047.50 (26.34 of copy matmul, 47.40 of clock)


   STAGE 2-3-6: GER TUNE
      ger : chose routine 1:ATL_cger1_axpy.c written by R. Clint Whaley
            mu=16, nu=1, using  0.76 percent of L1 Cache
              Performance = 1256.75 (31.60 of copy matmul, 56.87 of clock)


STAGE 2-4: TUNING PREC='c' (precision 4 of 4)


   STAGE 2-4-1 : BUILDING BLOCK MATMUL TUNE
      The best matmul kernel was ATL_smm14x1x84_sse.c, NB=84, written by 
R. Clint Whaley
      Performance: 7579.67MFLOPS (342.97 percent of of detected clock rate)
        (Gen case got 1886.85MFLOPS)
      mmNN   : ma=1, lat=2, nb=24, mu=4, nu=1 ku=24, ff=0, if=5, nf=1
               Performance = 2038.76 (26.90 of copy matmul, 92.25 of clock)
      mmNT   : ma=1, lat=5, nb=24, mu=4, nu=1 ku=24, ff=0, if=5, nf=1
               Performance = 1673.32 (22.08 of copy matmul, 75.72 of clock)
      mmTN   : ma=1, lat=5, nb=24, mu=4, nu=1 ku=24, ff=0, if=5, nf=1
               Performance = 2006.47 (26.47 of copy matmul, 90.79 of clock)
      mmTT   : ma=1, lat=2, nb=24, mu=4, nu=1 ku=24, ff=0, if=5, nf=1
               Performance = 1773.72 (23.40 of copy matmul, 80.26 of clock)



   STAGE 2-4-2: CacheEdge DETECTION
      CacheEdge set to 2097152 bytes
      csNKB set to 0 bytes


   STAGE 2-4-3: LARGE/SMALL CASE CROSSOVER DETECTION


   STAGE 2-4-3: COPY/NO-COPY CROSSOVER DETECTION
      done.


   STAGE 2-4-4: LEVEL 3 BLAS TUNE
      done.


   STAGE 2-4-5: GEMV TUNE
      gemvN : chose routine 3:ATL_cgemvN_1x1_1a.c written by R. Clint Whaley
              Yunroll=32, Xunroll=1, using 87 percent of L1
              Performance = 3235.79 (42.69 of copy matmul, 146.42 of clock)
      gemvT : chose routine 101:ATL_cgemvT_mm.c written by R. Clint Whaley
              Yunroll=0, Xunroll=0, using 87 percent of L1
              Performance = 1215.66 (16.04 of copy matmul, 55.01 of clock)


   STAGE 2-4-6: GER TUNE
      ger : chose routine 1:ATL_cger1_axpy.c written by R. Clint Whaley
            mu=16, nu=1, using  0.75 percent of L1 Cache
              Performance = 2341.82 (30.90 of copy matmul, 105.96 of clock)


STAGE 3: GENERAL LIBRARY BUILD


STAGE 4: POST-BUILD TUNING
   done.


STAGE 4: Threading install

*******************************************************************************
*******************************************************************************
*******************************************************************************
*      FINISHED ATLAS3.7.30 INSTALL OF SECTION 0-0-0 ON 05/13/2007 AT 
17:27   *
*******************************************************************************
*******************************************************************************
*******************************************************************************







-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Math-atlas-devel mailing list
Math-atlas-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/math-atlas-devel


^ permalink raw reply	[flat|nested] only message in thread

only message in thread, other threads:[~2007-05-18 21:42 UTC | newest]

Thread overview: (only message) (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-05-18 21:41 [gentoo-science] [Fwd: [atlas-devel] Athlon64 X2 results] M. Edward (Ed) Borasky

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox