From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from lists.gentoo.org ([140.105.134.102] helo=robin.gentoo.org) by nuthatch.gentoo.org with esmtp (Exim 4.62) (envelope-from ) id 1HpGiw-000547-29 for garchives@archives.gentoo.org; Sat, 19 May 2007 04:39:23 +0000 Received: from robin.gentoo.org (localhost [127.0.0.1]) by robin.gentoo.org (8.14.0/8.14.0) with SMTP id l4J4cbpG017580; Sat, 19 May 2007 04:38:37 GMT Received: from sccrmhc12.comcast.net (sccrmhc12.comcast.net [204.127.200.82]) by robin.gentoo.org (8.14.0/8.14.0) with ESMTP id l4J4cZnU017575 for ; Sat, 19 May 2007 04:38:36 GMT Received: from [71.236.188.93] (c-71-236-188-93.hsd1.or.comcast.net[71.236.188.93]) by comcast.net (sccrmhc12) with ESMTP id <2007051904383301200fgq60e>; Sat, 19 May 2007 04:38:33 +0000 Message-ID: <464E7F48.3090701@cesmail.net> Date: Fri, 18 May 2007 21:38:32 -0700 From: "M. Edward (Ed) Borasky" User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.2) Gecko/20070221 SeaMonkey/1.1.1 Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-Id: Gentoo Linux mail X-BeenThere: gentoo-science@gentoo.org Reply-to: gentoo-science@lists.gentoo.org MIME-Version: 1.0 To: gentoo-science@lists.gentoo.org Subject: Re: [gentoo-science] [Fwd: [atlas-devel] 3.7.31 and threading problem] References: <464DF942.3070401@cesmail.net> <464E0520.1080007@cesmail.net> In-Reply-To: Content-Type: multipart/mixed; boundary="------------020706020005080504010703" X-Archives-Salt: db32b5df-483c-46ae-aae8-04f011c17cdc X-Archives-Hash: 35fc5c7a3059c3e5d8e23916dfe7d119 This is a multi-part message in MIME format. --------------020706020005080504010703 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Markus Dittrich wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > On Fri, 18 May 2007, M. Edward (Ed) Borasky wrote: >> Yeah ... I just got an Athlon64 X2 4200+ and the first thing I did >> when I got the machine stabilized was build "blas-atlas" and >> "lapack-atlas" (3.7.30). New versions of Atlas generally show up in >> Portage or the science overlay within a day after Clint releases >> them, so I don't usually test the upstream source. >> >> The best I got out of my machine was something like 7 GFLOPS on a >> 32-bit test with 3.7.30, and there were some cases that looked like >> they should have done better. So I definitely want to test this 160M >> setting. >> -- > > Nice! Please let us know how your benchmarking goes. > The new version should hit portage > by tomorrow and I plan to go with 160M by default > unless I notice something strange during testing. > > Best, > Markus > > - --- > > Markus Dittrich (markusle) > Gentoo Linux Developer > Scientific applications > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.6 (GNU/Linux) > > iD8DBQFGTg2nxlRwCwb7k40RAjOYAJ4/Va9/W8UM/v7EqmOgXLlH120cfgCdEIpQ > J4hpzeIHuRE+TgR8/Vo8rEU= > =1RAJ > -----END PGP SIGNATURE----- I just installed 3.7.31. It looks like there are no significant performance differences -- I've attached the two "SUMMARY.LOG" files from 3.7.30 and 3.7.31. --------------020706020005080504010703 Content-Type: text/plain; name="SUMMARY.LOG" Content-Transfer-Encoding: 7bit Content-Disposition: inline; filename="SUMMARY.LOG" ******************************************************************************* ******************************************************************************* ******************************************************************************* * BEGAN ATLAS3.7.30 INSTALL OF SECTION 0-0-0 ON 05/13/2007 AT 16:54 * ******************************************************************************* ******************************************************************************* ******************************************************************************* IN STAGE 1 INSTALL: SYSTEM PROBE/AUX COMPILE Level 1 cache size calculated as 64KB. dFPU: Combined muladd instruction with 5 cycle pipeline. Apparent number of registers : 32 Register-register performance=1691.12MFLOPS sFPU: Combined muladd instruction with 5 cycle pipeline. Apparent number of registers : 32 Register-register performance=1576.58MFLOPS IN STAGE 2 INSTALL: TYPE-DEPENDENT TUNING STAGE 2-1: TUNING PREC='d' (precision 1 of 4) STAGE 2-1-1 : BUILDING BLOCK MATMUL TUNE The best matmul kernel was ATL_dmm4x1x90_x87.c, NB=52, written by R. Clint Whaley Performance: 4030.58MFLOPS (182.38 percent of of detected clock rate) (Gen case got 2222.47MFLOPS) mmNN : ma=1, lat=4, nb=28, mu=4, nu=1 ku=28, ff=0, if=5, nf=1 Performance = 2182.04 (54.14 of copy matmul, 98.73 of clock) mmNT : ma=1, lat=4, nb=28, mu=4, nu=1 ku=28, ff=0, if=5, nf=1 Performance = 1957.55 (48.57 of copy matmul, 88.58 of clock) mmTN : ma=1, lat=8, nb=28, mu=4, nu=1 ku=28, ff=0, if=5, nf=1 Performance = 2063.71 (51.20 of copy matmul, 93.38 of clock) mmTT : ma=1, lat=5, nb=28, mu=4, nu=1 ku=28, ff=0, if=5, nf=1 Performance = 1808.69 (44.87 of copy matmul, 81.84 of clock) STAGE 2-1-2: CacheEdge DETECTION CacheEdge set to 2097152 bytes STAGE 2-1-3: LARGE/SMALL CASE CROSSOVER DETECTION STAGE 2-1-3: COPY/NO-COPY CROSSOVER DETECTION done. STAGE 2-1-4: LEVEL 3 BLAS TUNE done. STAGE 2-1-5: GEMV TUNE gemvN : chose routine 3:ATL_gemvN_1x1_1a.c written by R. Clint Whaley Yunroll=32, Xunroll=1, using 90 percent of L1 Performance = 848.61 (21.05 of copy matmul, 38.40 of clock) gemvT : chose routine 105:ATL_gemvT_2x16_1.c written by R. Clint Whaley Yunroll=2, Xunroll=16, using 90 percent of L1 Performance = 830.40 (20.60 of copy matmul, 37.57 of clock) STAGE 2-1-6: GER TUNE ger : chose routine 1:ATL_ger1_axpy.c written by R. Clint Whaley mu=16, nu=1, using 0.84 percent of L1 Cache Performance = 618.42 (15.34 of copy matmul, 27.98 of clock) STAGE 2-2: TUNING PREC='s' (precision 2 of 4) STAGE 2-2-1 : BUILDING BLOCK MATMUL TUNE The best matmul kernel was ATL_smm14x1x84_sse.c, NB=84, written by R. Clint Whaley Performance: 7779.10MFLOPS (352.00 percent of of detected clock rate) (Gen case got 1961.99MFLOPS) mmNN : ma=1, lat=4, nb=28, mu=4, nu=1 ku=28, ff=0, if=6, nf=1 Performance = 2020.39 (25.97 of copy matmul, 91.42 of clock) mmNT : ma=1, lat=7, nb=28, mu=4, nu=1 ku=28, ff=0, if=6, nf=1 Performance = 1708.96 (21.97 of copy matmul, 77.33 of clock) mmTN : ma=1, lat=4, nb=28, mu=4, nu=1 ku=28, ff=0, if=6, nf=1 Performance = 1984.23 (25.51 of copy matmul, 89.78 of clock) mmTT : ma=1, lat=7, nb=28, mu=4, nu=1 ku=28, ff=0, if=6, nf=1 Performance = 1683.02 (21.64 of copy matmul, 76.15 of clock) STAGE 2-2-2: CacheEdge DETECTION CacheEdge set to 2097152 bytes STAGE 2-2-3: LARGE/SMALL CASE CROSSOVER DETECTION STAGE 2-2-3: COPY/NO-COPY CROSSOVER DETECTION done. STAGE 2-2-4: LEVEL 3 BLAS TUNE done. STAGE 2-2-5: GEMV TUNE gemvN : chose routine 3:ATL_gemvN_1x1_1a.c written by R. Clint Whaley Yunroll=32, Xunroll=1, using 87 percent of L1 Performance = 1208.92 (15.54 of copy matmul, 54.70 of clock) gemvT : chose routine 101:ATL_gemvT_mm.c written by R. Clint Whaley Yunroll=0, Xunroll=0, using 87 percent of L1 Performance = 1258.14 (16.17 of copy matmul, 56.93 of clock) STAGE 2-2-6: GER TUNE ger : chose routine 1:ATL_ger1_axpy.c written by R. Clint Whaley mu=16, nu=1, using 0.97 percent of L1 Cache Performance = 1142.53 (14.69 of copy matmul, 51.70 of clock) STAGE 2-3: TUNING PREC='z' (precision 3 of 4) STAGE 2-3-1 : BUILDING BLOCK MATMUL TUNE The best matmul kernel was ATL_dmm4x1x90_x87.c, NB=60, written by R. Clint Whaley Performance: 3977.18MFLOPS (179.96 percent of of detected clock rate) (Gen case got 2140.87MFLOPS) mmNN : ma=1, lat=3, nb=24, mu=4, nu=1 ku=24, ff=0, if=5, nf=1 Performance = 2217.59 (55.76 of copy matmul, 100.34 of clock) mmNT : ma=1, lat=4, nb=24, mu=4, nu=1 ku=24, ff=0, if=5, nf=1 Performance = 2050.31 (51.55 of copy matmul, 92.77 of clock) mmTN : ma=1, lat=8, nb=24, mu=4, nu=1 ku=24, ff=0, if=5, nf=1 Performance = 2086.29 (52.46 of copy matmul, 94.40 of clock) mmTT : ma=1, lat=6, nb=24, mu=4, nu=1 ku=24, ff=0, if=5, nf=1 Performance = 1977.85 (49.73 of copy matmul, 89.50 of clock) STAGE 2-3-2: CacheEdge DETECTION CacheEdge set to 2097152 bytes zdNKB set to 0 bytes STAGE 2-3-3: LARGE/SMALL CASE CROSSOVER DETECTION STAGE 2-3-3: COPY/NO-COPY CROSSOVER DETECTION done. STAGE 2-3-4: LEVEL 3 BLAS TUNE done. STAGE 2-3-5: GEMV TUNE gemvN : chose routine 3:ATL_cgemvN_1x1_1a.c written by R. Clint Whaley Yunroll=32, Xunroll=1, using 87 percent of L1 Performance = 1710.40 (43.01 of copy matmul, 77.39 of clock) gemvT : chose routine 101:ATL_cgemvT_mm.c written by R. Clint Whaley Yunroll=0, Xunroll=0, using 87 percent of L1 Performance = 1047.50 (26.34 of copy matmul, 47.40 of clock) STAGE 2-3-6: GER TUNE ger : chose routine 1:ATL_cger1_axpy.c written by R. Clint Whaley mu=16, nu=1, using 0.76 percent of L1 Cache Performance = 1256.75 (31.60 of copy matmul, 56.87 of clock) STAGE 2-4: TUNING PREC='c' (precision 4 of 4) STAGE 2-4-1 : BUILDING BLOCK MATMUL TUNE The best matmul kernel was ATL_smm14x1x84_sse.c, NB=84, written by R. Clint Whaley Performance: 7579.67MFLOPS (342.97 percent of of detected clock rate) (Gen case got 1886.85MFLOPS) mmNN : ma=1, lat=2, nb=24, mu=4, nu=1 ku=24, ff=0, if=5, nf=1 Performance = 2038.76 (26.90 of copy matmul, 92.25 of clock) mmNT : ma=1, lat=5, nb=24, mu=4, nu=1 ku=24, ff=0, if=5, nf=1 Performance = 1673.32 (22.08 of copy matmul, 75.72 of clock) mmTN : ma=1, lat=5, nb=24, mu=4, nu=1 ku=24, ff=0, if=5, nf=1 Performance = 2006.47 (26.47 of copy matmul, 90.79 of clock) mmTT : ma=1, lat=2, nb=24, mu=4, nu=1 ku=24, ff=0, if=5, nf=1 Performance = 1773.72 (23.40 of copy matmul, 80.26 of clock) STAGE 2-4-2: CacheEdge DETECTION CacheEdge set to 2097152 bytes csNKB set to 0 bytes STAGE 2-4-3: LARGE/SMALL CASE CROSSOVER DETECTION STAGE 2-4-3: COPY/NO-COPY CROSSOVER DETECTION done. STAGE 2-4-4: LEVEL 3 BLAS TUNE done. STAGE 2-4-5: GEMV TUNE gemvN : chose routine 3:ATL_cgemvN_1x1_1a.c written by R. Clint Whaley Yunroll=32, Xunroll=1, using 87 percent of L1 Performance = 3235.79 (42.69 of copy matmul, 146.42 of clock) gemvT : chose routine 101:ATL_cgemvT_mm.c written by R. Clint Whaley Yunroll=0, Xunroll=0, using 87 percent of L1 Performance = 1215.66 (16.04 of copy matmul, 55.01 of clock) STAGE 2-4-6: GER TUNE ger : chose routine 1:ATL_cger1_axpy.c written by R. Clint Whaley mu=16, nu=1, using 0.75 percent of L1 Cache Performance = 2341.82 (30.90 of copy matmul, 105.96 of clock) STAGE 3: GENERAL LIBRARY BUILD STAGE 4: POST-BUILD TUNING done. STAGE 4: Threading install ******************************************************************************* ******************************************************************************* ******************************************************************************* * FINISHED ATLAS3.7.30 INSTALL OF SECTION 0-0-0 ON 05/13/2007 AT 17:27 * ******************************************************************************* ******************************************************************************* ******************************************************************************* --------------020706020005080504010703 Content-Type: text/plain; name="SUMMARY.LOG" Content-Transfer-Encoding: 7bit Content-Disposition: inline; filename="SUMMARY.LOG" ******************************************************************************* ******************************************************************************* ******************************************************************************* * BEGAN ATLAS3.7.31 INSTALL OF SECTION 0-0-0 ON 05/18/2007 AT 20:10 * ******************************************************************************* ******************************************************************************* ******************************************************************************* IN STAGE 1 INSTALL: SYSTEM PROBE/AUX COMPILE Level 1 cache size calculated as 64KB. dFPU: Combined muladd instruction with 5 cycle pipeline. Apparent number of registers : 32 Register-register performance=1686.67MFLOPS sFPU: Combined muladd instruction with 5 cycle pipeline. Apparent number of registers : 32 Register-register performance=1582.42MFLOPS IN STAGE 2 INSTALL: TYPE-DEPENDENT TUNING STAGE 2-1: TUNING PREC='d' (precision 1 of 4) STAGE 2-1-1 : BUILDING BLOCK MATMUL TUNE The best matmul kernel was ATL_dmm4x1x90_x87.c, NB=52, written by R. Clint Whaley Performance: 4019.86MFLOPS (181.89 percent of of detected clock rate) (Gen case got 2226.96MFLOPS) mmNN : ma=1, lat=4, nb=28, mu=4, nu=1 ku=28, ff=0, if=5, nf=1 Performance = 2191.38 (54.51 of copy matmul, 99.16 of clock) mmNT : ma=1, lat=4, nb=28, mu=4, nu=1 ku=28, ff=0, if=5, nf=1 Performance = 1964.52 (48.87 of copy matmul, 88.89 of clock) mmTN : ma=1, lat=8, nb=28, mu=4, nu=1 ku=28, ff=0, if=5, nf=1 Performance = 2071.76 (51.54 of copy matmul, 93.74 of clock) mmTT : ma=1, lat=5, nb=28, mu=4, nu=1 ku=28, ff=0, if=5, nf=1 Performance = 1810.81 (45.05 of copy matmul, 81.94 of clock) STAGE 2-1-2: CacheEdge DETECTION CacheEdge set to 2097152 bytes STAGE 2-1-3: LARGE/SMALL CASE CROSSOVER DETECTION STAGE 2-1-3: COPY/NO-COPY CROSSOVER DETECTION done. STAGE 2-1-4: LEVEL 3 BLAS TUNE done. STAGE 2-1-5: GEMV TUNE gemvN : chose routine 3:ATL_gemvN_1x1_1a.c written by R. Clint Whaley Yunroll=32, Xunroll=1, using 90 percent of L1 Performance = 843.81 (20.99 of copy matmul, 38.18 of clock) gemvT : chose routine 105:ATL_gemvT_2x16_1.c written by R. Clint Whaley Yunroll=2, Xunroll=16, using 90 percent of L1 Performance = 824.29 (20.51 of copy matmul, 37.30 of clock) STAGE 2-1-6: GER TUNE ger : chose routine 1:ATL_ger1_axpy.c written by R. Clint Whaley mu=16, nu=1, using 0.84 percent of L1 Cache Performance = 629.63 (15.66 of copy matmul, 28.49 of clock) STAGE 2-2: TUNING PREC='s' (precision 2 of 4) STAGE 2-2-1 : BUILDING BLOCK MATMUL TUNE The best matmul kernel was ATL_smm14x1x84_sse.c, NB=84, written by R. Clint Whaley Performance: 7799.90MFLOPS (352.94 percent of of detected clock rate) (Gen case got 1937.24MFLOPS) mmNN : ma=1, lat=4, nb=28, mu=4, nu=1 ku=28, ff=0, if=6, nf=1 Performance = 2018.70 (25.88 of copy matmul, 91.34 of clock) mmNT : ma=1, lat=7, nb=28, mu=4, nu=1 ku=28, ff=0, if=6, nf=1 Performance = 1728.63 (22.16 of copy matmul, 78.22 of clock) mmTN : ma=1, lat=4, nb=28, mu=4, nu=1 ku=28, ff=0, if=6, nf=1 Performance = 1987.15 (25.48 of copy matmul, 89.92 of clock) mmTT : ma=1, lat=7, nb=28, mu=4, nu=1 ku=28, ff=0, if=6, nf=1 Performance = 1682.88 (21.58 of copy matmul, 76.15 of clock) STAGE 2-2-2: CacheEdge DETECTION CacheEdge set to 2097152 bytes STAGE 2-2-3: LARGE/SMALL CASE CROSSOVER DETECTION STAGE 2-2-3: COPY/NO-COPY CROSSOVER DETECTION done. STAGE 2-2-4: LEVEL 3 BLAS TUNE done. STAGE 2-2-5: GEMV TUNE gemvN : chose routine 3:ATL_gemvN_1x1_1a.c written by R. Clint Whaley Yunroll=32, Xunroll=1, using 87 percent of L1 Performance = 1207.78 (15.48 of copy matmul, 54.65 of clock) gemvT : chose routine 101:ATL_gemvT_mm.c written by R. Clint Whaley Yunroll=0, Xunroll=0, using 87 percent of L1 Performance = 1244.38 (15.95 of copy matmul, 56.31 of clock) STAGE 2-2-6: GER TUNE ger : chose routine 1:ATL_ger1_axpy.c written by R. Clint Whaley mu=16, nu=1, using 0.97 percent of L1 Cache Performance = 1133.84 (14.54 of copy matmul, 51.30 of clock) STAGE 2-3: TUNING PREC='z' (precision 3 of 4) STAGE 2-3-1 : BUILDING BLOCK MATMUL TUNE The best matmul kernel was ATL_dmm4x1x90_x87.c, NB=60, written by R. Clint Whaley Performance: 4003.44MFLOPS (181.15 percent of of detected clock rate) (Gen case got 2136.97MFLOPS) mmNN : ma=1, lat=3, nb=24, mu=4, nu=1 ku=24, ff=0, if=5, nf=1 Performance = 2206.52 (55.12 of copy matmul, 99.84 of clock) mmNT : ma=1, lat=4, nb=24, mu=4, nu=1 ku=24, ff=0, if=5, nf=1 Performance = 2049.25 (51.19 of copy matmul, 92.73 of clock) mmTN : ma=1, lat=8, nb=24, mu=4, nu=1 ku=24, ff=0, if=5, nf=1 Performance = 2089.87 (52.20 of copy matmul, 94.56 of clock) mmTT : ma=1, lat=6, nb=24, mu=4, nu=1 ku=24, ff=0, if=5, nf=1 Performance = 1976.67 (49.37 of copy matmul, 89.44 of clock) STAGE 2-3-2: CacheEdge DETECTION CacheEdge set to 2097152 bytes zdNKB set to 0 bytes STAGE 2-3-3: LARGE/SMALL CASE CROSSOVER DETECTION STAGE 2-3-3: COPY/NO-COPY CROSSOVER DETECTION done. STAGE 2-3-4: LEVEL 3 BLAS TUNE done. STAGE 2-3-5: GEMV TUNE gemvN : chose routine 3:ATL_cgemvN_1x1_1a.c written by R. Clint Whaley Yunroll=32, Xunroll=1, using 87 percent of L1 Performance = 1710.41 (42.72 of copy matmul, 77.39 of clock) gemvT : chose routine 101:ATL_cgemvT_mm.c written by R. Clint Whaley Yunroll=0, Xunroll=0, using 87 percent of L1 Performance = 1048.27 (26.18 of copy matmul, 47.43 of clock) STAGE 2-3-6: GER TUNE ger : chose routine 1:ATL_cger1_axpy.c written by R. Clint Whaley mu=16, nu=1, using 0.76 percent of L1 Cache Performance = 1255.67 (31.36 of copy matmul, 56.82 of clock) STAGE 2-4: TUNING PREC='c' (precision 4 of 4) STAGE 2-4-1 : BUILDING BLOCK MATMUL TUNE The best matmul kernel was ATL_smm14x1x84_sse.c, NB=84, written by R. Clint Whaley Performance: 7579.67MFLOPS (342.97 percent of of detected clock rate) (Gen case got 1882.85MFLOPS) mmNN : ma=1, lat=2, nb=24, mu=4, nu=1 ku=24, ff=0, if=5, nf=1 Performance = 2041.70 (26.94 of copy matmul, 92.38 of clock) mmNT : ma=1, lat=5, nb=24, mu=4, nu=1 ku=24, ff=0, if=5, nf=1 Performance = 1692.36 (22.33 of copy matmul, 76.58 of clock) mmTN : ma=1, lat=5, nb=24, mu=4, nu=1 ku=24, ff=0, if=5, nf=1 Performance = 2004.91 (26.45 of copy matmul, 90.72 of clock) mmTT : ma=1, lat=2, nb=24, mu=4, nu=1 ku=24, ff=0, if=5, nf=1 Performance = 1762.73 (23.26 of copy matmul, 79.76 of clock) STAGE 2-4-2: CacheEdge DETECTION CacheEdge set to 2097152 bytes csNKB set to 0 bytes STAGE 2-4-3: LARGE/SMALL CASE CROSSOVER DETECTION STAGE 2-4-3: COPY/NO-COPY CROSSOVER DETECTION done. STAGE 2-4-4: LEVEL 3 BLAS TUNE done. STAGE 2-4-5: GEMV TUNE gemvN : chose routine 3:ATL_cgemvN_1x1_1a.c written by R. Clint Whaley Yunroll=32, Xunroll=1, using 87 percent of L1 Performance = 3241.37 (42.76 of copy matmul, 146.67 of clock) gemvT : chose routine 101:ATL_cgemvT_mm.c written by R. Clint Whaley Yunroll=0, Xunroll=0, using 87 percent of L1 Performance = 1212.23 (15.99 of copy matmul, 54.85 of clock) STAGE 2-4-6: GER TUNE ger : chose routine 1:ATL_cger1_axpy.c written by R. Clint Whaley mu=16, nu=1, using 0.75 percent of L1 Cache Performance = 2365.49 (31.21 of copy matmul, 107.04 of clock) STAGE 3: GENERAL LIBRARY BUILD STAGE 4: POST-BUILD TUNING done. STAGE 4: Threading install ******************************************************************************* ******************************************************************************* ******************************************************************************* * FINISHED ATLAS3.7.31 INSTALL OF SECTION 0-0-0 ON 05/18/2007 AT 20:40 * ******************************************************************************* ******************************************************************************* ******************************************************************************* --------------020706020005080504010703-- -- gentoo-science@gentoo.org mailing list