Skip to content

Add primitive for warp load balancing #1773

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: develop
Choose a base branch
from
Open

Conversation

upsj
Copy link
Member

@upsj upsj commented Jan 29, 2025

This adds a primitive that allows the distribution of variable-sized chunks of work across a warp for better memory coalescing and warp utilization. This can be used as a component in operations like SpGEMM, symbolic Cholesky or, combined with a segmented reduction, could also be used to implement the inner loop in an SpMV similar to MergePath. Looking for feedback and performance benchmark suggestions :)

@upsj upsj requested review from yhmtsai and a team January 29, 2025 22:36
@upsj upsj self-assigned this Jan 29, 2025
@upsj upsj marked this pull request as ready for review January 29, 2025 22:36
@ginkgo-bot ginkgo-bot added reg:build This is related to the build system. reg:testing This is related to testing. mod:core This is related to the core module. mod:cuda This is related to the CUDA module. mod:hip This is related to the HIP module. labels Jan 29, 2025
@upsj
Copy link
Member Author

upsj commented Jan 29, 2025

If the performance is good, this could also be used as an internal component for merge path SpMV

@ginkgo-bot
Copy link
Member

Error: The following files need to be formatted:

CMakeLists.txt
benchmark/CMakeLists.txt
benchmark/blas/blas_common.hpp
benchmark/conversion/conversion.cpp
benchmark/matrix_statistics/matrix_statistics.cpp
benchmark/preconditioner/preconditioner.cpp
benchmark/solver/distributed/solver.cpp
benchmark/solver/solver_common.hpp
benchmark/sparse_blas/sparse_blas.cpp
benchmark/spmv/spmv_common.hpp
benchmark/test/CMakeLists.txt
benchmark/utils/general.hpp
benchmark/utils/generator.hpp
benchmark/utils/stencil_matrix.hpp
cmake/CTestScript.cmake
cmake/Modules/CudaArchitectureSelector.cmake
cmake/Modules/FindHWLOC.cmake
cmake/Modules/FindMETIS.cmake
cmake/Modules/FindNUMA.cmake
cmake/Modules/FindNVTX.cmake
cmake/Modules/FindPAPI.cmake
cmake/Modules/FindROCTX.cmake
cmake/Modules/FindVTune.cmake
cmake/Modules/hwloc_helpers.cmake
cmake/autodetect_executors.cmake
cmake/autodetect_system_libs.cmake
cmake/build_helpers.cmake
cmake/build_type_helpers.cmake
cmake/compiler_features.cmake
cmake/create_test.cmake
cmake/cuda.cmake
cmake/get_info.cmake
cmake/hip.cmake
cmake/hip_helpers.cmake
cmake/information_helpers.cmake
cmake/install_helpers.cmake
cmake/rename.cmake
cmake/sycl.cmake
cmake/template_instantiation.cmake
common/cuda_hip/CMakeLists.txt
common/cuda_hip/base/device_matrix_data_kernels.cpp
common/cuda_hip/distributed/assembly_kernels.cpp
common/cuda_hip/distributed/index_map_kernels.cpp
common/cuda_hip/distributed/matrix_kernels.cpp
common/cuda_hip/distributed/partition_kernels.cpp
common/cuda_hip/distributed/vector_kernels.cpp
common/cuda_hip/matrix/ell_kernels.cpp
common/cuda_hip/matrix/sellp_kernels.cpp
common/cuda_hip/matrix/sparsity_csr_kernels.cpp
common/cuda_hip/multigrid/pgm_kernels.cpp
common/unified/CMakeLists.txt
common/unified/distributed/assembly_kernels.cpp
common/unified/matrix/coo_kernels.cpp
common/unified/matrix/csr_kernels.cpp
common/unified/matrix/dense_kernels.template.cpp
common/unified/preconditioner/jacobi_kernels.cpp
core/CMakeLists.txt
core/base/device_matrix_data.cpp
core/base/device_matrix_data_kernels.hpp
core/base/segmented_array.hpp
core/config/config_helper.cpp
core/config/config_helper.hpp
core/config/property_tree.cpp
core/config/registry.cpp
core/config/solver_config.cpp
core/config/solver_config.hpp
core/device_hooks/CMakeLists.txt
core/distributed/assembly.cpp
core/distributed/helpers.hpp
core/distributed/index_map.cpp
core/distributed/index_map_kernels.hpp
core/distributed/matrix.cpp
core/distributed/partition.cpp
core/distributed/partition_kernels.hpp
core/distributed/preconditioner/schwarz.cpp
core/distributed/vector.cpp
core/distributed/vector_cache.cpp
core/matrix/coo.cpp
core/matrix/coo_kernels.hpp
core/matrix/csr.cpp
core/matrix/csr_kernels.hpp
core/multigrid/pgm.cpp
core/multigrid/pgm_kernels.hpp
core/preconditioner/jacobi.cpp
core/preconditioner/jacobi_kernels.hpp
core/solver/gmres.cpp
core/solver/ir.cpp
core/solver/multigrid.cpp
core/test/accessor/CMakeLists.txt
core/test/config/config.cpp
core/test/config/preconditioner.cpp
core/test/config/property_tree.cpp
core/test/config/solver.cpp
core/test/gtest/CMakeLists.txt
core/test/log/CMakeLists.txt
core/test/matrix/csr.cpp
core/test/mpi/base/bindings.cpp
core/test/mpi/distributed/matrix.cpp
core/test/mpi/distributed/preconditioner/schwarz.cpp
core/test/preconditioner/jacobi.cpp
core/test/utils.hpp
core/test/utils/assertions_test.cpp
cuda/CMakeLists.txt
devices/cuda/CMakeLists.txt
devices/dpcpp/CMakeLists.txt
devices/hip/CMakeLists.txt
devices/omp/CMakeLists.txt
devices/reference/CMakeLists.txt
doc/CMakeLists.txt
doc/examples/CMakeLists.txt
dpcpp/CMakeLists.txt
dpcpp/base/device_matrix_data_kernels.dp.cpp
dpcpp/distributed/assembly_kernels.dp.cpp
dpcpp/distributed/index_map_kernels.dp.cpp
dpcpp/distributed/matrix_kernels.dp.cpp
dpcpp/distributed/partition_kernels.dp.cpp
dpcpp/distributed/vector_kernels.dp.cpp
dpcpp/matrix/csr_kernels.dp.cpp
dpcpp/matrix/dense_kernels.dp.cpp
dpcpp/matrix/ell_kernels.dp.cpp
dpcpp/matrix/sellp_kernels.dp.cpp
dpcpp/matrix/sparsity_csr_kernels.dp.cpp
dpcpp/multigrid/pgm_kernels.dp.cpp
dpcpp/test/base/CMakeLists.txt
examples/CMakeLists.txt
examples/adaptiveprecision-blockjacobi/CMakeLists.txt
examples/batched-solver/CMakeLists.txt
examples/cb-gmres/CMakeLists.txt
examples/custom-logger/CMakeLists.txt
examples/custom-matrix-format/CMakeLists.txt
examples/custom-stopping-criterion/CMakeLists.txt
examples/distributed-solver/distributed-solver.cpp
examples/external-lib-interfacing/CMakeLists.txt
examples/file-config-solver/CMakeLists.txt
examples/ginkgo-overhead/CMakeLists.txt
examples/ginkgo-ranges/CMakeLists.txt
examples/heat-equation/CMakeLists.txt
examples/ilu-preconditioned-solver/CMakeLists.txt
examples/inverse-iteration/CMakeLists.txt
examples/ir-ilu-preconditioned-solver/CMakeLists.txt
examples/iterative-refinement/CMakeLists.txt
examples/minimal-cuda-solver/CMakeLists.txt
examples/mixed-multigrid-preconditioned-solver/CMakeLists.txt
examples/mixed-multigrid-solver/CMakeLists.txt
examples/mixed-precision-ir/CMakeLists.txt
examples/mixed-spmv/CMakeLists.txt
examples/multigrid-preconditioned-solver-customized/CMakeLists.txt
examples/multigrid-preconditioned-solver/CMakeLists.txt
examples/nine-pt-stencil-solver/CMakeLists.txt
examples/papi-logging/CMakeLists.txt
examples/par-ilu-convergence/CMakeLists.txt
examples/performance-debugging/CMakeLists.txt
examples/poisson-solver/CMakeLists.txt
examples/preconditioned-solver/CMakeLists.txt
examples/preconditioner-export/CMakeLists.txt
examples/reordered-preconditioned-solver/CMakeLists.txt
examples/schroedinger-splitting/CMakeLists.txt
examples/simple-solver-logging/CMakeLists.txt
examples/simple-solver/CMakeLists.txt
examples/three-pt-stencil-solver/CMakeLists.txt
extensions/CMakeLists.txt
extensions/test/CMakeLists.txt
extensions/test/config/CMakeLists.txt
extensions/test/kokkos/CMakeLists.txt
hip/CMakeLists.txt
hip/test/matrix/CMakeLists.txt
include/CMakeLists.txt
include/ginkgo/core/base/mpi.hpp
include/ginkgo/core/base/precision_dispatch.hpp
include/ginkgo/core/base/std_extensions.hpp
include/ginkgo/core/base/types.hpp
include/ginkgo/core/config/config.hpp
include/ginkgo/core/config/property_tree.hpp
include/ginkgo/core/distributed/index_map.hpp
include/ginkgo/core/distributed/matrix.hpp
include/ginkgo/core/distributed/partition.hpp
include/ginkgo/core/distributed/preconditioner/schwarz.hpp
include/ginkgo/core/distributed/vector.hpp
include/ginkgo/core/distributed/vector_cache.hpp
include/ginkgo/core/matrix/coo.hpp
include/ginkgo/core/matrix/csr.hpp
include/ginkgo/core/matrix/dense.hpp
include/ginkgo/core/multigrid/pgm.hpp
include/ginkgo/core/preconditioner/jacobi.hpp
include/ginkgo/ginkgo.hpp
matrices/CMakeLists.txt
omp/CMakeLists.txt
omp/base/device_matrix_data_kernels.cpp
omp/distributed/assembly_kernels.cpp
omp/distributed/index_map_kernels.cpp
omp/distributed/matrix_kernels.cpp
omp/distributed/partition_kernels.cpp
omp/distributed/vector_kernels.cpp
omp/matrix/csr_kernels.cpp
omp/matrix/dense_kernels.cpp
omp/matrix/ell_kernels.cpp
omp/matrix/fbcsr_kernels.cpp
omp/matrix/sellp_kernels.cpp
omp/matrix/sparsity_csr_kernels.cpp
omp/multigrid/pgm_kernels.cpp
reference/CMakeLists.txt
reference/base/device_matrix_data_kernels.cpp
reference/distributed/assembly_kernels.cpp
reference/distributed/index_map_kernels.cpp
reference/distributed/matrix_kernels.cpp
reference/distributed/partition_helpers.hpp
reference/distributed/partition_kernels.cpp
reference/distributed/vector_kernels.cpp
reference/matrix/coo_kernels.cpp
reference/matrix/csr_kernels.cpp
reference/matrix/dense_kernels.cpp
reference/matrix/ell_kernels.cpp
reference/matrix/fbcsr_kernels.cpp
reference/matrix/sellp_kernels.cpp
reference/matrix/sparsity_csr_kernels.cpp
reference/multigrid/pgm_kernels.cpp
reference/preconditioner/jacobi_kernels.cpp
reference/test/distributed/assembly_kernels.cpp
reference/test/distributed/index_map_kernels.cpp
reference/test/distributed/matrix_kernels.cpp
reference/test/distributed/partition_kernels.cpp
reference/test/distributed/vector_kernels.cpp
reference/test/log/CMakeLists.txt
reference/test/matrix/coo_kernels.cpp
reference/test/matrix/csr_kernels.cpp
reference/test/matrix/dense_kernels.cpp
reference/test/matrix/ell_kernels.cpp
reference/test/matrix/sellp_kernels.cpp
reference/test/matrix/sparsity_csr_kernels.cpp
reference/test/preconditioner/jacobi_kernels.cpp
reference/test/solver/ir_kernels.cpp
test/distributed/assembly_kernels.cpp
test/distributed/index_map_kernels.cpp
test/distributed/matrix_kernels.cpp
test/distributed/partition_kernels.cpp
test/distributed/vector_kernels.cpp
test/factorization/ic_kernels.cpp
test/factorization/ilu_kernels.cpp
test/matrix/CMakeLists.txt
test/matrix/coo_kernels.cpp
test/matrix/csr_kernels.cpp
test/matrix/csr_kernels2.cpp
test/matrix/matrix.cpp
test/mpi/assembly.cpp
test/mpi/matrix.cpp
test/mpi/multigrid/pgm.cpp
test/mpi/preconditioner/schwarz.cpp
test/mpi/solver/solver.cpp
test/preconditioner/jacobi_kernels.cpp
test/reorder/CMakeLists.txt
test/solver/CMakeLists.txt
test/solver/solver.cpp
test/test_exportbuild/CMakeLists.txt
test/test_install/CMakeLists.txt
test/test_install/test_install.cpp
test/test_pkgconfig/CMakeLists.txt
test/test_subdir/CMakeLists.txt
third_party/CMakeLists.txt
third_party/dummy-hook/CMakeLists.txt
third_party/gflags/CMakeLists.txt
third_party/gtest/CMakeLists.txt
third_party/identify_stream_usage/CMakeLists.txt
third_party/nlohmann_json/CMakeLists.txt

You can find a formatting patch under Artifacts here or run format! if you have write access to Ginkgo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
mod:core This is related to the core module. mod:cuda This is related to the CUDA module. mod:hip This is related to the HIP module. reg:build This is related to the build system. reg:testing This is related to testing.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants