* Re: [gentoo-science] [Fwd: [atlas-devel] 3.7.31 and threading problem]
2007-05-18 20:33 ` Markus Dittrich
@ 2007-05-19 4:38 ` M. Edward (Ed) Borasky
0 siblings, 0 replies; 5+ messages in thread
From: M. Edward (Ed) Borasky @ 2007-05-19 4:38 UTC (permalink / raw
To: gentoo-science
[-- Attachment #1: Type: text/plain, Size: 1368 bytes --]
Markus Dittrich wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On Fri, 18 May 2007, M. Edward (Ed) Borasky wrote:
>> Yeah ... I just got an Athlon64 X2 4200+ and the first thing I did
>> when I got the machine stabilized was build "blas-atlas" and
>> "lapack-atlas" (3.7.30). New versions of Atlas generally show up in
>> Portage or the science overlay within a day after Clint releases
>> them, so I don't usually test the upstream source.
>>
>> The best I got out of my machine was something like 7 GFLOPS on a
>> 32-bit test with 3.7.30, and there were some cases that looked like
>> they should have done better. So I definitely want to test this 160M
>> setting.
>> --
>
> Nice! Please let us know how your benchmarking goes.
> The new version should hit portage
> by tomorrow and I plan to go with 160M by default
> unless I notice something strange during testing.
>
> Best,
> Markus
>
> - ---
>
> Markus Dittrich (markusle)
> Gentoo Linux Developer
> Scientific applications
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.6 (GNU/Linux)
>
> iD8DBQFGTg2nxlRwCwb7k40RAjOYAJ4/Va9/W8UM/v7EqmOgXLlH120cfgCdEIpQ
> J4hpzeIHuRE+TgR8/Vo8rEU=
> =1RAJ
> -----END PGP SIGNATURE-----
I just installed 3.7.31. It looks like there are no significant
performance differences -- I've attached the two "SUMMARY.LOG" files
from 3.7.30 and 3.7.31.
[-- Attachment #2: SUMMARY.LOG --]
[-- Type: text/plain, Size: 8935 bytes --]
*******************************************************************************
*******************************************************************************
*******************************************************************************
* BEGAN ATLAS3.7.30 INSTALL OF SECTION 0-0-0 ON 05/13/2007 AT 16:54 *
*******************************************************************************
*******************************************************************************
*******************************************************************************
IN STAGE 1 INSTALL: SYSTEM PROBE/AUX COMPILE
Level 1 cache size calculated as 64KB.
dFPU: Combined muladd instruction with 5 cycle pipeline.
Apparent number of registers : 32
Register-register performance=1691.12MFLOPS
sFPU: Combined muladd instruction with 5 cycle pipeline.
Apparent number of registers : 32
Register-register performance=1576.58MFLOPS
IN STAGE 2 INSTALL: TYPE-DEPENDENT TUNING
STAGE 2-1: TUNING PREC='d' (precision 1 of 4)
STAGE 2-1-1 : BUILDING BLOCK MATMUL TUNE
The best matmul kernel was ATL_dmm4x1x90_x87.c, NB=52, written by R. Clint Whaley
Performance: 4030.58MFLOPS (182.38 percent of of detected clock rate)
(Gen case got 2222.47MFLOPS)
mmNN : ma=1, lat=4, nb=28, mu=4, nu=1 ku=28, ff=0, if=5, nf=1
Performance = 2182.04 (54.14 of copy matmul, 98.73 of clock)
mmNT : ma=1, lat=4, nb=28, mu=4, nu=1 ku=28, ff=0, if=5, nf=1
Performance = 1957.55 (48.57 of copy matmul, 88.58 of clock)
mmTN : ma=1, lat=8, nb=28, mu=4, nu=1 ku=28, ff=0, if=5, nf=1
Performance = 2063.71 (51.20 of copy matmul, 93.38 of clock)
mmTT : ma=1, lat=5, nb=28, mu=4, nu=1 ku=28, ff=0, if=5, nf=1
Performance = 1808.69 (44.87 of copy matmul, 81.84 of clock)
STAGE 2-1-2: CacheEdge DETECTION
CacheEdge set to 2097152 bytes
STAGE 2-1-3: LARGE/SMALL CASE CROSSOVER DETECTION
STAGE 2-1-3: COPY/NO-COPY CROSSOVER DETECTION
done.
STAGE 2-1-4: LEVEL 3 BLAS TUNE
done.
STAGE 2-1-5: GEMV TUNE
gemvN : chose routine 3:ATL_gemvN_1x1_1a.c written by R. Clint Whaley
Yunroll=32, Xunroll=1, using 90 percent of L1
Performance = 848.61 (21.05 of copy matmul, 38.40 of clock)
gemvT : chose routine 105:ATL_gemvT_2x16_1.c written by R. Clint Whaley
Yunroll=2, Xunroll=16, using 90 percent of L1
Performance = 830.40 (20.60 of copy matmul, 37.57 of clock)
STAGE 2-1-6: GER TUNE
ger : chose routine 1:ATL_ger1_axpy.c written by R. Clint Whaley
mu=16, nu=1, using 0.84 percent of L1 Cache
Performance = 618.42 (15.34 of copy matmul, 27.98 of clock)
STAGE 2-2: TUNING PREC='s' (precision 2 of 4)
STAGE 2-2-1 : BUILDING BLOCK MATMUL TUNE
The best matmul kernel was ATL_smm14x1x84_sse.c, NB=84, written by R. Clint Whaley
Performance: 7779.10MFLOPS (352.00 percent of of detected clock rate)
(Gen case got 1961.99MFLOPS)
mmNN : ma=1, lat=4, nb=28, mu=4, nu=1 ku=28, ff=0, if=6, nf=1
Performance = 2020.39 (25.97 of copy matmul, 91.42 of clock)
mmNT : ma=1, lat=7, nb=28, mu=4, nu=1 ku=28, ff=0, if=6, nf=1
Performance = 1708.96 (21.97 of copy matmul, 77.33 of clock)
mmTN : ma=1, lat=4, nb=28, mu=4, nu=1 ku=28, ff=0, if=6, nf=1
Performance = 1984.23 (25.51 of copy matmul, 89.78 of clock)
mmTT : ma=1, lat=7, nb=28, mu=4, nu=1 ku=28, ff=0, if=6, nf=1
Performance = 1683.02 (21.64 of copy matmul, 76.15 of clock)
STAGE 2-2-2: CacheEdge DETECTION
CacheEdge set to 2097152 bytes
STAGE 2-2-3: LARGE/SMALL CASE CROSSOVER DETECTION
STAGE 2-2-3: COPY/NO-COPY CROSSOVER DETECTION
done.
STAGE 2-2-4: LEVEL 3 BLAS TUNE
done.
STAGE 2-2-5: GEMV TUNE
gemvN : chose routine 3:ATL_gemvN_1x1_1a.c written by R. Clint Whaley
Yunroll=32, Xunroll=1, using 87 percent of L1
Performance = 1208.92 (15.54 of copy matmul, 54.70 of clock)
gemvT : chose routine 101:ATL_gemvT_mm.c written by R. Clint Whaley
Yunroll=0, Xunroll=0, using 87 percent of L1
Performance = 1258.14 (16.17 of copy matmul, 56.93 of clock)
STAGE 2-2-6: GER TUNE
ger : chose routine 1:ATL_ger1_axpy.c written by R. Clint Whaley
mu=16, nu=1, using 0.97 percent of L1 Cache
Performance = 1142.53 (14.69 of copy matmul, 51.70 of clock)
STAGE 2-3: TUNING PREC='z' (precision 3 of 4)
STAGE 2-3-1 : BUILDING BLOCK MATMUL TUNE
The best matmul kernel was ATL_dmm4x1x90_x87.c, NB=60, written by R. Clint Whaley
Performance: 3977.18MFLOPS (179.96 percent of of detected clock rate)
(Gen case got 2140.87MFLOPS)
mmNN : ma=1, lat=3, nb=24, mu=4, nu=1 ku=24, ff=0, if=5, nf=1
Performance = 2217.59 (55.76 of copy matmul, 100.34 of clock)
mmNT : ma=1, lat=4, nb=24, mu=4, nu=1 ku=24, ff=0, if=5, nf=1
Performance = 2050.31 (51.55 of copy matmul, 92.77 of clock)
mmTN : ma=1, lat=8, nb=24, mu=4, nu=1 ku=24, ff=0, if=5, nf=1
Performance = 2086.29 (52.46 of copy matmul, 94.40 of clock)
mmTT : ma=1, lat=6, nb=24, mu=4, nu=1 ku=24, ff=0, if=5, nf=1
Performance = 1977.85 (49.73 of copy matmul, 89.50 of clock)
STAGE 2-3-2: CacheEdge DETECTION
CacheEdge set to 2097152 bytes
zdNKB set to 0 bytes
STAGE 2-3-3: LARGE/SMALL CASE CROSSOVER DETECTION
STAGE 2-3-3: COPY/NO-COPY CROSSOVER DETECTION
done.
STAGE 2-3-4: LEVEL 3 BLAS TUNE
done.
STAGE 2-3-5: GEMV TUNE
gemvN : chose routine 3:ATL_cgemvN_1x1_1a.c written by R. Clint Whaley
Yunroll=32, Xunroll=1, using 87 percent of L1
Performance = 1710.40 (43.01 of copy matmul, 77.39 of clock)
gemvT : chose routine 101:ATL_cgemvT_mm.c written by R. Clint Whaley
Yunroll=0, Xunroll=0, using 87 percent of L1
Performance = 1047.50 (26.34 of copy matmul, 47.40 of clock)
STAGE 2-3-6: GER TUNE
ger : chose routine 1:ATL_cger1_axpy.c written by R. Clint Whaley
mu=16, nu=1, using 0.76 percent of L1 Cache
Performance = 1256.75 (31.60 of copy matmul, 56.87 of clock)
STAGE 2-4: TUNING PREC='c' (precision 4 of 4)
STAGE 2-4-1 : BUILDING BLOCK MATMUL TUNE
The best matmul kernel was ATL_smm14x1x84_sse.c, NB=84, written by R. Clint Whaley
Performance: 7579.67MFLOPS (342.97 percent of of detected clock rate)
(Gen case got 1886.85MFLOPS)
mmNN : ma=1, lat=2, nb=24, mu=4, nu=1 ku=24, ff=0, if=5, nf=1
Performance = 2038.76 (26.90 of copy matmul, 92.25 of clock)
mmNT : ma=1, lat=5, nb=24, mu=4, nu=1 ku=24, ff=0, if=5, nf=1
Performance = 1673.32 (22.08 of copy matmul, 75.72 of clock)
mmTN : ma=1, lat=5, nb=24, mu=4, nu=1 ku=24, ff=0, if=5, nf=1
Performance = 2006.47 (26.47 of copy matmul, 90.79 of clock)
mmTT : ma=1, lat=2, nb=24, mu=4, nu=1 ku=24, ff=0, if=5, nf=1
Performance = 1773.72 (23.40 of copy matmul, 80.26 of clock)
STAGE 2-4-2: CacheEdge DETECTION
CacheEdge set to 2097152 bytes
csNKB set to 0 bytes
STAGE 2-4-3: LARGE/SMALL CASE CROSSOVER DETECTION
STAGE 2-4-3: COPY/NO-COPY CROSSOVER DETECTION
done.
STAGE 2-4-4: LEVEL 3 BLAS TUNE
done.
STAGE 2-4-5: GEMV TUNE
gemvN : chose routine 3:ATL_cgemvN_1x1_1a.c written by R. Clint Whaley
Yunroll=32, Xunroll=1, using 87 percent of L1
Performance = 3235.79 (42.69 of copy matmul, 146.42 of clock)
gemvT : chose routine 101:ATL_cgemvT_mm.c written by R. Clint Whaley
Yunroll=0, Xunroll=0, using 87 percent of L1
Performance = 1215.66 (16.04 of copy matmul, 55.01 of clock)
STAGE 2-4-6: GER TUNE
ger : chose routine 1:ATL_cger1_axpy.c written by R. Clint Whaley
mu=16, nu=1, using 0.75 percent of L1 Cache
Performance = 2341.82 (30.90 of copy matmul, 105.96 of clock)
STAGE 3: GENERAL LIBRARY BUILD
STAGE 4: POST-BUILD TUNING
done.
STAGE 4: Threading install
*******************************************************************************
*******************************************************************************
*******************************************************************************
* FINISHED ATLAS3.7.30 INSTALL OF SECTION 0-0-0 ON 05/13/2007 AT 17:27 *
*******************************************************************************
*******************************************************************************
*******************************************************************************
[-- Attachment #3: SUMMARY.LOG --]
[-- Type: text/plain, Size: 8934 bytes --]
*******************************************************************************
*******************************************************************************
*******************************************************************************
* BEGAN ATLAS3.7.31 INSTALL OF SECTION 0-0-0 ON 05/18/2007 AT 20:10 *
*******************************************************************************
*******************************************************************************
*******************************************************************************
IN STAGE 1 INSTALL: SYSTEM PROBE/AUX COMPILE
Level 1 cache size calculated as 64KB.
dFPU: Combined muladd instruction with 5 cycle pipeline.
Apparent number of registers : 32
Register-register performance=1686.67MFLOPS
sFPU: Combined muladd instruction with 5 cycle pipeline.
Apparent number of registers : 32
Register-register performance=1582.42MFLOPS
IN STAGE 2 INSTALL: TYPE-DEPENDENT TUNING
STAGE 2-1: TUNING PREC='d' (precision 1 of 4)
STAGE 2-1-1 : BUILDING BLOCK MATMUL TUNE
The best matmul kernel was ATL_dmm4x1x90_x87.c, NB=52, written by R. Clint Whaley
Performance: 4019.86MFLOPS (181.89 percent of of detected clock rate)
(Gen case got 2226.96MFLOPS)
mmNN : ma=1, lat=4, nb=28, mu=4, nu=1 ku=28, ff=0, if=5, nf=1
Performance = 2191.38 (54.51 of copy matmul, 99.16 of clock)
mmNT : ma=1, lat=4, nb=28, mu=4, nu=1 ku=28, ff=0, if=5, nf=1
Performance = 1964.52 (48.87 of copy matmul, 88.89 of clock)
mmTN : ma=1, lat=8, nb=28, mu=4, nu=1 ku=28, ff=0, if=5, nf=1
Performance = 2071.76 (51.54 of copy matmul, 93.74 of clock)
mmTT : ma=1, lat=5, nb=28, mu=4, nu=1 ku=28, ff=0, if=5, nf=1
Performance = 1810.81 (45.05 of copy matmul, 81.94 of clock)
STAGE 2-1-2: CacheEdge DETECTION
CacheEdge set to 2097152 bytes
STAGE 2-1-3: LARGE/SMALL CASE CROSSOVER DETECTION
STAGE 2-1-3: COPY/NO-COPY CROSSOVER DETECTION
done.
STAGE 2-1-4: LEVEL 3 BLAS TUNE
done.
STAGE 2-1-5: GEMV TUNE
gemvN : chose routine 3:ATL_gemvN_1x1_1a.c written by R. Clint Whaley
Yunroll=32, Xunroll=1, using 90 percent of L1
Performance = 843.81 (20.99 of copy matmul, 38.18 of clock)
gemvT : chose routine 105:ATL_gemvT_2x16_1.c written by R. Clint Whaley
Yunroll=2, Xunroll=16, using 90 percent of L1
Performance = 824.29 (20.51 of copy matmul, 37.30 of clock)
STAGE 2-1-6: GER TUNE
ger : chose routine 1:ATL_ger1_axpy.c written by R. Clint Whaley
mu=16, nu=1, using 0.84 percent of L1 Cache
Performance = 629.63 (15.66 of copy matmul, 28.49 of clock)
STAGE 2-2: TUNING PREC='s' (precision 2 of 4)
STAGE 2-2-1 : BUILDING BLOCK MATMUL TUNE
The best matmul kernel was ATL_smm14x1x84_sse.c, NB=84, written by R. Clint Whaley
Performance: 7799.90MFLOPS (352.94 percent of of detected clock rate)
(Gen case got 1937.24MFLOPS)
mmNN : ma=1, lat=4, nb=28, mu=4, nu=1 ku=28, ff=0, if=6, nf=1
Performance = 2018.70 (25.88 of copy matmul, 91.34 of clock)
mmNT : ma=1, lat=7, nb=28, mu=4, nu=1 ku=28, ff=0, if=6, nf=1
Performance = 1728.63 (22.16 of copy matmul, 78.22 of clock)
mmTN : ma=1, lat=4, nb=28, mu=4, nu=1 ku=28, ff=0, if=6, nf=1
Performance = 1987.15 (25.48 of copy matmul, 89.92 of clock)
mmTT : ma=1, lat=7, nb=28, mu=4, nu=1 ku=28, ff=0, if=6, nf=1
Performance = 1682.88 (21.58 of copy matmul, 76.15 of clock)
STAGE 2-2-2: CacheEdge DETECTION
CacheEdge set to 2097152 bytes
STAGE 2-2-3: LARGE/SMALL CASE CROSSOVER DETECTION
STAGE 2-2-3: COPY/NO-COPY CROSSOVER DETECTION
done.
STAGE 2-2-4: LEVEL 3 BLAS TUNE
done.
STAGE 2-2-5: GEMV TUNE
gemvN : chose routine 3:ATL_gemvN_1x1_1a.c written by R. Clint Whaley
Yunroll=32, Xunroll=1, using 87 percent of L1
Performance = 1207.78 (15.48 of copy matmul, 54.65 of clock)
gemvT : chose routine 101:ATL_gemvT_mm.c written by R. Clint Whaley
Yunroll=0, Xunroll=0, using 87 percent of L1
Performance = 1244.38 (15.95 of copy matmul, 56.31 of clock)
STAGE 2-2-6: GER TUNE
ger : chose routine 1:ATL_ger1_axpy.c written by R. Clint Whaley
mu=16, nu=1, using 0.97 percent of L1 Cache
Performance = 1133.84 (14.54 of copy matmul, 51.30 of clock)
STAGE 2-3: TUNING PREC='z' (precision 3 of 4)
STAGE 2-3-1 : BUILDING BLOCK MATMUL TUNE
The best matmul kernel was ATL_dmm4x1x90_x87.c, NB=60, written by R. Clint Whaley
Performance: 4003.44MFLOPS (181.15 percent of of detected clock rate)
(Gen case got 2136.97MFLOPS)
mmNN : ma=1, lat=3, nb=24, mu=4, nu=1 ku=24, ff=0, if=5, nf=1
Performance = 2206.52 (55.12 of copy matmul, 99.84 of clock)
mmNT : ma=1, lat=4, nb=24, mu=4, nu=1 ku=24, ff=0, if=5, nf=1
Performance = 2049.25 (51.19 of copy matmul, 92.73 of clock)
mmTN : ma=1, lat=8, nb=24, mu=4, nu=1 ku=24, ff=0, if=5, nf=1
Performance = 2089.87 (52.20 of copy matmul, 94.56 of clock)
mmTT : ma=1, lat=6, nb=24, mu=4, nu=1 ku=24, ff=0, if=5, nf=1
Performance = 1976.67 (49.37 of copy matmul, 89.44 of clock)
STAGE 2-3-2: CacheEdge DETECTION
CacheEdge set to 2097152 bytes
zdNKB set to 0 bytes
STAGE 2-3-3: LARGE/SMALL CASE CROSSOVER DETECTION
STAGE 2-3-3: COPY/NO-COPY CROSSOVER DETECTION
done.
STAGE 2-3-4: LEVEL 3 BLAS TUNE
done.
STAGE 2-3-5: GEMV TUNE
gemvN : chose routine 3:ATL_cgemvN_1x1_1a.c written by R. Clint Whaley
Yunroll=32, Xunroll=1, using 87 percent of L1
Performance = 1710.41 (42.72 of copy matmul, 77.39 of clock)
gemvT : chose routine 101:ATL_cgemvT_mm.c written by R. Clint Whaley
Yunroll=0, Xunroll=0, using 87 percent of L1
Performance = 1048.27 (26.18 of copy matmul, 47.43 of clock)
STAGE 2-3-6: GER TUNE
ger : chose routine 1:ATL_cger1_axpy.c written by R. Clint Whaley
mu=16, nu=1, using 0.76 percent of L1 Cache
Performance = 1255.67 (31.36 of copy matmul, 56.82 of clock)
STAGE 2-4: TUNING PREC='c' (precision 4 of 4)
STAGE 2-4-1 : BUILDING BLOCK MATMUL TUNE
The best matmul kernel was ATL_smm14x1x84_sse.c, NB=84, written by R. Clint Whaley
Performance: 7579.67MFLOPS (342.97 percent of of detected clock rate)
(Gen case got 1882.85MFLOPS)
mmNN : ma=1, lat=2, nb=24, mu=4, nu=1 ku=24, ff=0, if=5, nf=1
Performance = 2041.70 (26.94 of copy matmul, 92.38 of clock)
mmNT : ma=1, lat=5, nb=24, mu=4, nu=1 ku=24, ff=0, if=5, nf=1
Performance = 1692.36 (22.33 of copy matmul, 76.58 of clock)
mmTN : ma=1, lat=5, nb=24, mu=4, nu=1 ku=24, ff=0, if=5, nf=1
Performance = 2004.91 (26.45 of copy matmul, 90.72 of clock)
mmTT : ma=1, lat=2, nb=24, mu=4, nu=1 ku=24, ff=0, if=5, nf=1
Performance = 1762.73 (23.26 of copy matmul, 79.76 of clock)
STAGE 2-4-2: CacheEdge DETECTION
CacheEdge set to 2097152 bytes
csNKB set to 0 bytes
STAGE 2-4-3: LARGE/SMALL CASE CROSSOVER DETECTION
STAGE 2-4-3: COPY/NO-COPY CROSSOVER DETECTION
done.
STAGE 2-4-4: LEVEL 3 BLAS TUNE
done.
STAGE 2-4-5: GEMV TUNE
gemvN : chose routine 3:ATL_cgemvN_1x1_1a.c written by R. Clint Whaley
Yunroll=32, Xunroll=1, using 87 percent of L1
Performance = 3241.37 (42.76 of copy matmul, 146.67 of clock)
gemvT : chose routine 101:ATL_cgemvT_mm.c written by R. Clint Whaley
Yunroll=0, Xunroll=0, using 87 percent of L1
Performance = 1212.23 (15.99 of copy matmul, 54.85 of clock)
STAGE 2-4-6: GER TUNE
ger : chose routine 1:ATL_cger1_axpy.c written by R. Clint Whaley
mu=16, nu=1, using 0.75 percent of L1 Cache
Performance = 2365.49 (31.21 of copy matmul, 107.04 of clock)
STAGE 3: GENERAL LIBRARY BUILD
STAGE 4: POST-BUILD TUNING
done.
STAGE 4: Threading install
*******************************************************************************
*******************************************************************************
*******************************************************************************
* FINISHED ATLAS3.7.31 INSTALL OF SECTION 0-0-0 ON 05/18/2007 AT 20:40 *
*******************************************************************************
*******************************************************************************
*******************************************************************************
^ permalink raw reply [flat|nested] 5+ messages in thread