Accelerating Matrix Multiplication with CPU Multithreading and CUDA Block-Based GPU Parallelization
Abstract
As technology advances, we can see that the amount of data is also increasing. This article examines the problems associated with the speed of computing devices when performing arithmetic operations on large matrices. One of the optimal methods for matrix multiplication is to calculate a large matrix by dividing it into blocks using the Block-based method. This is achieved by multiplying matrices of different sizes using the Block-based parallel method on the computer's graphics processor using CUDA (Compute Unified Device Architecture) technology, as well as on the central processor using the OpenMP (Open Multi-Processing) parallel library for devices without a graphics processor. The study examines the time-consuming problem of multiplying matrices of sizes 64x64, 128x128, 512x512, 1024x1024 and 2048x2048 using these parallel processing technologies, using the simple sequential Naive method and the parallel Block-based method. The study concludes with a systematic analysis of performance metrics for several block sizes (8x8, 16x16, 32x32, etc.), an assessment of the comparative efficiency of CPU and GPU matrix multiplication implementations, and the determination of optimal limits for real-world parallel processing by comparing the efficiency of block sizes on GPUs using the OpenMP parallel programming model for CPUs and CUDA technology for NVIDIA GPUs.