public inbox for gentoo-soc@lists.gentoo.org
 help / color / mirror / Atom feed
* [gentoo-soc] Week 5 Report for Refining ROCm Packages in Gentoo
@ 2022-07-18 15:07 wuyy
  0 siblings, 0 replies; only message in thread
From: wuyy @ 2022-07-18 15:07 UTC (permalink / raw
  To: gentoo-soc

Hi all,

Week 5 is mainly about utilizing and testing the rocm.eclass I wrote --
packagingi and testing the ROCm-5.1.3 libraries. I also began to land
ROCm-5.1.3 toolchains in ::gentoo. However, new problems emerges, so I'm
a bit behind schedule, so after negotiating with my mentor, I decide to
put packaging tesorflow and jax with rocm into low priority jobs.

On https://github.com/littlewu2508/gentoo/tree/rocm-5.1.3 there are my
newest progress, sci-libs/roc{BLAS,FFT,PRIM,SPARSE,Thrust}-5.1.3
utilizing rocm.eclass. I have write amdgpu_targets.desc and added to
profile/desc, so each amdgpu_targets_ USE_EXPAND have its description
(the name and codename of the architecture, as well as the included
graphics cards). I shall post the screenshot of `equery uses rocBLAS` to
my blog.

It turned out rocm.eclass simplified those ebuilds, especially src_test.
I have spent some time testing those libraries on Radeon VII and Radeon
RX 6700XT. By running tests I've found a critical bug in rocFFT-5.1.3
[1], and was confirmed by upstream. It should be cautious, and before
the bug is fixed, amdgpu_targets_gfx906 should be masked for
rocFFT-5.1.3. On the other hand, 6700XT failed several tests on
rocSPARSE, which is explained by upstream [2]. rocBLAS pass tests on
Radeon VII, but causes amdgpu kernel module failure for some unknown
reason (maybe the load is two high, because when I restarted and ran the
failed test suite, it worked normally, it's just running the entire test
failed the GPU). Other packages passed all the tests on these two cards.

Meanwhile I'm also working on dev-libs/rccl and
dev-libs/rocm-opencl-runtime. dev-libs/rccl, like sci-libs/roc-*, can
utilize rocm.eclass and works well; however there are build failures due
to calling `chrpath -r` on a library without rpath (rocm.eclas set
-DSKIP_RPATH=ON). I shall make it work in the coming week. For
rocm-opencl-runtime, I managed to turn on USE=test, but there are test
failures on 6700XT which needs to be further investigated. Also, some of
the tests in rocm-opencl-runtime needs a DISPLAY. I tried
virtualx.eclass as ionen suggested in #gentoo-soc IRC, but in my docker
environment that didn't work. In Gentoo prefix vitualx does not work,
either.

I came across another bug when compiling rccl-5.1.3 with gfx10xx [3].
After consulting Gentoo llvm maintainer, I opened an issue on
llvm-project to ask for acknowledgement on backporting a patch to
llvm-14 which fix this problem [4].

As I prepare to land ROCm-5.1.3 toolchain in ::gentoo via this PR [5], I
noticed another problem. hip and rocm-comgr has hard-coded clang include
path in their sources, so if clang upgrades (even minor version upgrades
like 14.0.5 -> 14.0.6) would cause runtime problem. I have consulted
mgorny about this problem. He suggested me to try hacking into the clang
Driver, and see whether the include path can be extracted using C++ API
at runtime. I'll try this in the coming week, and if I failed, adding
subslot to clang may be the plan B. After fixing this, I think hip-5.1.3
and rocm-comgr-5.1.3 would be ready to land in ::gentoo.

Due to limited time I have little progress on rocm.eclass. I begun read
PYTHON_USEDEP in python eclasses, to prepare for ROCM_USEDEP. I plan to
implement this in the coming week, completing the last piece of
rocm.eclass.

And here is the brief plan of feature works for the following weeks,
after lowering the priority of tensorflow and jax:

week 6: finish rocm.eclass, send for review; continue packaging ROCm libs;
week 7: modify rocm.eclass according to comments; packaging ROCm libs, including rocWMMA;
week 8: finalize rocm.eclass; start working on cupy;
week 9: cupy ebuild; start writing wiki;
week 10: get cupy land in ::gentoo; bump dev-util/rocprofiler to 5.1.3;
week 11: continue wiki writing ; consider ROCgdb;
week 12: finish wiki; summaries my GSoC.


[1] https://github.com/ROCmSoftwarePlatform/rocFFT/issues/369
[2] https://github.com/ROCmSoftwarePlatform/rocSPARSE/issues/258
[3] https://bugs.gentoo.org/851702#c15
[4] https://github.com/llvm/llvm-project/issues/56577
[5] https://github.com/gentoo/gentoo/pull/26441

Best regards,
-- 
Yiyang Wu


^ permalink raw reply	[flat|nested] only message in thread

only message in thread, other threads:[~2022-07-18 15:07 UTC | newest]

Thread overview: (only message) (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-07-18 15:07 [gentoo-soc] Week 5 Report for Refining ROCm Packages in Gentoo wuyy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox