* [gentoo-science] [Fwd: [atlas-devel] Athlon64 X2 results]
@ 2007-05-18 21:41 M. Edward (Ed) Borasky
0 siblings, 0 replies; only message in thread
From: M. Edward (Ed) Borasky @ 2007-05-18 21:41 UTC (permalink / raw
To: gentoo-science
[-- Attachment #1: Type: text/plain, Size: 1 bytes --]
[-- Attachment #2: [atlas-devel] Athlon64 X2 results.eml --]
[-- Type: message/rfc822, Size: 12886 bytes --]
From: "M. Edward (Ed) Borasky" <znmeb@cesmail.net>
To: math-atlas-devel@lists.sourceforge.net
Subject: [atlas-devel] Athlon64 X2 results
Date: Sun, 13 May 2007 18:25:51 -0700
Message-ID: <4647BA9F.80400@cesmail.net>
I just got an Athlon64 X2 4200+ (and a motherboard/RAM/hard drive,
etc.). It's taken me a week or so to get it stabilized, but I've got
Gentoo loaded and just built "blas-atlas" and "lapack-atlas" on it.
Here's what I got in the SUMMARY.LOG. Compiler is GCC 4.1.2 and the
kernel is 2.4.21 (Gentoo). Questions:
1. Do the numbers look right for a dual-core 2210 MHz Athlon64?
2. Does this chip really have SSE3? The /proc/cpuinfo flags that Linux
provides show SSE and SSE2, but not SSE3.
*******************************************************************************
*******************************************************************************
*******************************************************************************
* BEGAN ATLAS3.7.30 INSTALL OF SECTION 0-0-0 ON 05/13/2007 AT
16:54 *
*******************************************************************************
*******************************************************************************
*******************************************************************************
IN STAGE 1 INSTALL: SYSTEM PROBE/AUX COMPILE
Level 1 cache size calculated as 64KB.
dFPU: Combined muladd instruction with 5 cycle pipeline.
Apparent number of registers : 32
Register-register performance=1691.12MFLOPS
sFPU: Combined muladd instruction with 5 cycle pipeline.
Apparent number of registers : 32
Register-register performance=1576.58MFLOPS
IN STAGE 2 INSTALL: TYPE-DEPENDENT TUNING
STAGE 2-1: TUNING PREC='d' (precision 1 of 4)
STAGE 2-1-1 : BUILDING BLOCK MATMUL TUNE
The best matmul kernel was ATL_dmm4x1x90_x87.c, NB=52, written by
R. Clint Whaley
Performance: 4030.58MFLOPS (182.38 percent of of detected clock rate)
(Gen case got 2222.47MFLOPS)
mmNN : ma=1, lat=4, nb=28, mu=4, nu=1 ku=28, ff=0, if=5, nf=1
Performance = 2182.04 (54.14 of copy matmul, 98.73 of clock)
mmNT : ma=1, lat=4, nb=28, mu=4, nu=1 ku=28, ff=0, if=5, nf=1
Performance = 1957.55 (48.57 of copy matmul, 88.58 of clock)
mmTN : ma=1, lat=8, nb=28, mu=4, nu=1 ku=28, ff=0, if=5, nf=1
Performance = 2063.71 (51.20 of copy matmul, 93.38 of clock)
mmTT : ma=1, lat=5, nb=28, mu=4, nu=1 ku=28, ff=0, if=5, nf=1
Performance = 1808.69 (44.87 of copy matmul, 81.84 of clock)
STAGE 2-1-2: CacheEdge DETECTION
CacheEdge set to 2097152 bytes
STAGE 2-1-3: LARGE/SMALL CASE CROSSOVER DETECTION
STAGE 2-1-3: COPY/NO-COPY CROSSOVER DETECTION
done.
STAGE 2-1-4: LEVEL 3 BLAS TUNE
done.
STAGE 2-1-5: GEMV TUNE
gemvN : chose routine 3:ATL_gemvN_1x1_1a.c written by R. Clint Whaley
Yunroll=32, Xunroll=1, using 90 percent of L1
Performance = 848.61 (21.05 of copy matmul, 38.40 of clock)
gemvT : chose routine 105:ATL_gemvT_2x16_1.c written by R. Clint
Whaley
Yunroll=2, Xunroll=16, using 90 percent of L1
Performance = 830.40 (20.60 of copy matmul, 37.57 of clock)
STAGE 2-1-6: GER TUNE
ger : chose routine 1:ATL_ger1_axpy.c written by R. Clint Whaley
mu=16, nu=1, using 0.84 percent of L1 Cache
Performance = 618.42 (15.34 of copy matmul, 27.98 of clock)
STAGE 2-2: TUNING PREC='s' (precision 2 of 4)
STAGE 2-2-1 : BUILDING BLOCK MATMUL TUNE
The best matmul kernel was ATL_smm14x1x84_sse.c, NB=84, written by
R. Clint Whaley
Performance: 7779.10MFLOPS (352.00 percent of of detected clock rate)
(Gen case got 1961.99MFLOPS)
mmNN : ma=1, lat=4, nb=28, mu=4, nu=1 ku=28, ff=0, if=6, nf=1
Performance = 2020.39 (25.97 of copy matmul, 91.42 of clock)
mmNT : ma=1, lat=7, nb=28, mu=4, nu=1 ku=28, ff=0, if=6, nf=1
Performance = 1708.96 (21.97 of copy matmul, 77.33 of clock)
mmTN : ma=1, lat=4, nb=28, mu=4, nu=1 ku=28, ff=0, if=6, nf=1
Performance = 1984.23 (25.51 of copy matmul, 89.78 of clock)
mmTT : ma=1, lat=7, nb=28, mu=4, nu=1 ku=28, ff=0, if=6, nf=1
Performance = 1683.02 (21.64 of copy matmul, 76.15 of clock)
STAGE 2-2-2: CacheEdge DETECTION
CacheEdge set to 2097152 bytes
STAGE 2-2-3: LARGE/SMALL CASE CROSSOVER DETECTION
STAGE 2-2-3: COPY/NO-COPY CROSSOVER DETECTION
done.
STAGE 2-2-4: LEVEL 3 BLAS TUNE
done.
STAGE 2-2-5: GEMV TUNE
gemvN : chose routine 3:ATL_gemvN_1x1_1a.c written by R. Clint Whaley
Yunroll=32, Xunroll=1, using 87 percent of L1
Performance = 1208.92 (15.54 of copy matmul, 54.70 of clock)
gemvT : chose routine 101:ATL_gemvT_mm.c written by R. Clint Whaley
Yunroll=0, Xunroll=0, using 87 percent of L1
Performance = 1258.14 (16.17 of copy matmul, 56.93 of clock)
STAGE 2-2-6: GER TUNE
ger : chose routine 1:ATL_ger1_axpy.c written by R. Clint Whaley
mu=16, nu=1, using 0.97 percent of L1 Cache
Performance = 1142.53 (14.69 of copy matmul, 51.70 of clock)
STAGE 2-3: TUNING PREC='z' (precision 3 of 4)
STAGE 2-3-1 : BUILDING BLOCK MATMUL TUNE
The best matmul kernel was ATL_dmm4x1x90_x87.c, NB=60, written by
R. Clint Whaley
Performance: 3977.18MFLOPS (179.96 percent of of detected clock rate)
(Gen case got 2140.87MFLOPS)
mmNN : ma=1, lat=3, nb=24, mu=4, nu=1 ku=24, ff=0, if=5, nf=1
Performance = 2217.59 (55.76 of copy matmul, 100.34 of clock)
mmNT : ma=1, lat=4, nb=24, mu=4, nu=1 ku=24, ff=0, if=5, nf=1
Performance = 2050.31 (51.55 of copy matmul, 92.77 of clock)
mmTN : ma=1, lat=8, nb=24, mu=4, nu=1 ku=24, ff=0, if=5, nf=1
Performance = 2086.29 (52.46 of copy matmul, 94.40 of clock)
mmTT : ma=1, lat=6, nb=24, mu=4, nu=1 ku=24, ff=0, if=5, nf=1
Performance = 1977.85 (49.73 of copy matmul, 89.50 of clock)
STAGE 2-3-2: CacheEdge DETECTION
CacheEdge set to 2097152 bytes
zdNKB set to 0 bytes
STAGE 2-3-3: LARGE/SMALL CASE CROSSOVER DETECTION
STAGE 2-3-3: COPY/NO-COPY CROSSOVER DETECTION
done.
STAGE 2-3-4: LEVEL 3 BLAS TUNE
done.
STAGE 2-3-5: GEMV TUNE
gemvN : chose routine 3:ATL_cgemvN_1x1_1a.c written by R. Clint Whaley
Yunroll=32, Xunroll=1, using 87 percent of L1
Performance = 1710.40 (43.01 of copy matmul, 77.39 of clock)
gemvT : chose routine 101:ATL_cgemvT_mm.c written by R. Clint Whaley
Yunroll=0, Xunroll=0, using 87 percent of L1
Performance = 1047.50 (26.34 of copy matmul, 47.40 of clock)
STAGE 2-3-6: GER TUNE
ger : chose routine 1:ATL_cger1_axpy.c written by R. Clint Whaley
mu=16, nu=1, using 0.76 percent of L1 Cache
Performance = 1256.75 (31.60 of copy matmul, 56.87 of clock)
STAGE 2-4: TUNING PREC='c' (precision 4 of 4)
STAGE 2-4-1 : BUILDING BLOCK MATMUL TUNE
The best matmul kernel was ATL_smm14x1x84_sse.c, NB=84, written by
R. Clint Whaley
Performance: 7579.67MFLOPS (342.97 percent of of detected clock rate)
(Gen case got 1886.85MFLOPS)
mmNN : ma=1, lat=2, nb=24, mu=4, nu=1 ku=24, ff=0, if=5, nf=1
Performance = 2038.76 (26.90 of copy matmul, 92.25 of clock)
mmNT : ma=1, lat=5, nb=24, mu=4, nu=1 ku=24, ff=0, if=5, nf=1
Performance = 1673.32 (22.08 of copy matmul, 75.72 of clock)
mmTN : ma=1, lat=5, nb=24, mu=4, nu=1 ku=24, ff=0, if=5, nf=1
Performance = 2006.47 (26.47 of copy matmul, 90.79 of clock)
mmTT : ma=1, lat=2, nb=24, mu=4, nu=1 ku=24, ff=0, if=5, nf=1
Performance = 1773.72 (23.40 of copy matmul, 80.26 of clock)
STAGE 2-4-2: CacheEdge DETECTION
CacheEdge set to 2097152 bytes
csNKB set to 0 bytes
STAGE 2-4-3: LARGE/SMALL CASE CROSSOVER DETECTION
STAGE 2-4-3: COPY/NO-COPY CROSSOVER DETECTION
done.
STAGE 2-4-4: LEVEL 3 BLAS TUNE
done.
STAGE 2-4-5: GEMV TUNE
gemvN : chose routine 3:ATL_cgemvN_1x1_1a.c written by R. Clint Whaley
Yunroll=32, Xunroll=1, using 87 percent of L1
Performance = 3235.79 (42.69 of copy matmul, 146.42 of clock)
gemvT : chose routine 101:ATL_cgemvT_mm.c written by R. Clint Whaley
Yunroll=0, Xunroll=0, using 87 percent of L1
Performance = 1215.66 (16.04 of copy matmul, 55.01 of clock)
STAGE 2-4-6: GER TUNE
ger : chose routine 1:ATL_cger1_axpy.c written by R. Clint Whaley
mu=16, nu=1, using 0.75 percent of L1 Cache
Performance = 2341.82 (30.90 of copy matmul, 105.96 of clock)
STAGE 3: GENERAL LIBRARY BUILD
STAGE 4: POST-BUILD TUNING
done.
STAGE 4: Threading install
*******************************************************************************
*******************************************************************************
*******************************************************************************
* FINISHED ATLAS3.7.30 INSTALL OF SECTION 0-0-0 ON 05/13/2007 AT
17:27 *
*******************************************************************************
*******************************************************************************
*******************************************************************************
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Math-atlas-devel mailing list
Math-atlas-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/math-atlas-devel
^ permalink raw reply [flat|nested] only message in thread
only message in thread, other threads:[~2007-05-18 21:42 UTC | newest]
Thread overview: (only message) (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-05-18 21:41 [gentoo-science] [Fwd: [atlas-devel] Athlon64 X2 results] M. Edward (Ed) Borasky
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox