Article

Optimizing Block Size and Memory Usage for Image Processing on GPU Using CUDA

Mekhriddin RakhimovTashkent University of Information Technologies named after Muhammad Al-Khwarizmi,Department of Computer Systems,Tashkent,UzbekistanMannon OchilovTashkent University of Information Technologies named after Muhammad Al-Khwarizmi,Department of Robotics and Intelligent Systems,Tashkent,UzbekistanShakhzod JavlievTashkent University of Information Technologies named after Muhammad Al-Khwarizmi,Department of Computer Systems,Tashkent,UzbekistanAzizbek KhojamurotovTashkent University of Information Technologies named after Muhammad Al-Khwarizmi,Department of Computer Systems,Tashkent,Uzbekistan

2026

ABI

Abstract

In this article, the Canny algorithm was used to accelerate the computational processes by parallel processing in image processing. In this case, all stages of the Canny algorithm were implemented in the computer's graphics processor using Compute Unified Device Architecture technology, and in order to accelerate the computational process, the image data was optimally used from the graphics processor's global, shared memory, and texture memories. During the study, images of different sizes were tested in the global and shared memory of the graphics processor using the Compute Unified Device Architecture model to optimize the use of graphics processor memory, with block sizes of <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$8 \times 8,16 \times 16$</tex> and <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$32 \times 32$</tex>. For comparison, the program execution was measured on the central processor using the Open Computer Vision Library. Based on the results obtained, the <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$\mathbf{8 x 8}$</tex> blocks of the graphics processor were considered the most optimal for small-sized images, and the <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$\mathbf{3 2} \boldsymbol{\times} \mathbf{3 2}$</tex> blocks were considered the most optimal for large-sized images. The <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$16 \times 16$</tex> block configuration showed the highest efficiency for all images. Shared memory and <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$\mathbf{1 6 x 1 6}$</tex> block configuration showed <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$\sim 81$</tex> times faster performance compared to CPU on a <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$4096 \times 4096$</tex> image and <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$\boldsymbol{\sim} \mathbf{4 9 \%}$</tex> faster performance compared to global memory.

Topics

Parallel Computing and Optimization Techniques CCD and CMOS Imaging Sensors Big Data and Digital Economy

Identifiers

DOI: 10.1109/acdsa67686.2026.11468221

Citations and references

Cited by 018 references

Metrics — AkademScholar · Coming soon