He was advised by Prof. Sruthi Sekar received the P. Bhatnagar Prize for Best Integrated Ph. D Student,in the Department of Mathematics.
Classical cyclic allocations designed for homogeneous settings are not appropriate, but the advent of task-based runtime systems makes it possible to use more general allocations. Previous theoretical work has proposed square and cube partitioning algorithms aimed at minimizing data movement for matrix multiplication.
We propose techniques to adapt these continuous square partitionings to allocating discrete tiles of a matrix, and strategies to adapt the static allocation at runtime.
We use these techniques in an implementation of Matrix Multiplication based on the StarPU runtime system, and we show through extensive experiments that this implementation allows to consistently obtain a lower communication volume while improving slightly the execution time, compared to standard state-of-the-art dynamic strategies.
The performance challenge is in their irregular memory access patterns, especially on architectures with high memory latency, such as GPUs. Previous work has proposed multi-pass scatter and gather schemes to optimize their performance on earlier GPUs; on newer GPUs, nevertheless, anecdotal evidence showed that such schemes had little performance benefit on small datasets, and few studies have been conducted on larger datasets.
Therefore, we propose a systematic study to re-evaluate the performance of multi-pass scatter and gather on three newer GPUs with various data sizes. We then develop an analytical model to analyze the execution of irregular memory accesses and estimate the multi-pass performance.
Our evaluation on the newer GPUs shows that 1 TLB caching can affect the performance of irregular memory accesses more significantly than data caching; 2 on datasets larger than the L3 TLB size, the multi-pass schemes, with a suitable number of passes, can reduce up to Our model can predict the multi-pass performance on various GPUs, with an average accuracy of It can further suggest a suitable number of passes for the best performance.
AI Abstract Abstract Matrix factorization MF discovers latent features from observations, which has shown great promises in the fields of collaborative filtering, data compression, feature extraction, word embedding, etc. While many problem-specific optimization techniques have been proposed, alternating least square ALS remains popular due to its general applicability e.
Current MF implementations are either optimized for a single machine or with a need of a large computer cluster but still running slow. This because a single machine provides limited compute power for large-scale data while multiple machines suffer from the network communication bottleneck.
To address the aforementioned challenge, we propose a novel machine learning framework to accelerate ALS on graphics processing units GPUs. We analyze the procedure of MF and focus on enhancing the efficiency via both memory optimization and approximate computing.
The former exploits GPU memory hierarchy to increase data reuse, while the later reduces unnecessary computing without hurting the convergence of the learning algorithm.
Extensive experiments on large-scale datasets show that our system not only outperforms all competing CPU solutions by a large margin but also has a 2x-4x performance gain compared to the state-of-the-art GPU solution.
Our implementations are open-sourced and available as a library to accelerate many applications. Our implementations are publicly available on GitHub https: Besides its intended purpose to save valuable storage on hard disks, compression can be utilized to increase the effective bandwidth to attached storage as realized by state-of-the-art file systems.
In the foreseeing future, on-the-fly compression and decompression will gain utmost importance for the processing of data-intensive applications such as streamed Deep Learning tasks or Next Generation Sequencing pipelines, which establishes the Need for fast parallel implementations.
Huffman coding is an integral part of a number of compression methods. However, efficient parallel implementation of Huffman decompression is difficult due to inherent data dependencies i.Unnikrishnan C, Rupesh Nasre, and Y.
N. Srikant, Falcon: A Graph Manipulation Language for Heterogeneous Systems, ACM Transactions on Architecture and Code Optimization, 12(4), regardbouddhiste.com Saurabh Kalikar, Rupesh Nasre, DomLock: a new multi-granularity locking technique for hierarchies, Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, March , , Barcelona, Spain.
THESIS CERTIFICATE This is to certify that the thesis entitled SFFMap: Set-First Fill Mapping for an Energy E cient Pipelined Data Cache, submitted by Pritam Majumder, to the Indian Institute of Technology, Madras, for the award of the degree of Master of Science (by Research), is a bona ﬁde record of the research work carried out by him under my supervision.
from hardware and limiting the scope of the application domain.
This thesis proposes a new software/hardware co-design approach to achieving 3P platforms, called the loop-task accelerator (LTA) platform, that provides high productivity and portability without sacriﬁcing performance or efﬁciency across a wide range of applications.
Rupesh is an Assistant Professor in the CSE department at IIT Madras. He completed PhD from IISc Bangalore and Post-Doctoral Fellowship from the University of Texas at Austin. His research focus is in Compilers and Parallelization. Instead of performing local dataflow analyses on all procedures during a multi-file optimized code generation, those dataflow analyses are done only on a generally much smaller set of procedures that were actually impacted by source code edits.