Research Projects | Harsh Tomar

Vision Transformers

Vision Transformers (ViT) apply the Transformer architecture, originally designed for natural language processing, to image classification tasks. The key idea is to split an image into fixed-size patches, linearly embed each patch, add position embeddings, and feed the resulting sequence of vectors to a standard Transformer encoder.

Attention Mechanism CNNs Multi-head Self-Attention (MSA) Residual Connections PyTorch

Research Paper View on GitHub

LoRA & QLoRA in PyTorch

Implementation of Low-Rank Adaptation (LoRA) and Quantized LoRA (QLoRA) techniques in PyTorch. These methods optimize large language model fine-tuning by reducing memory usage and training time, making them suitable for resource-constrained environments.

LoRA QLoRA Parameter Efficient Fine-Tuning PyTorch LLMs Quantization

LoRA Research Paper QLoRA Research Paper View on GitHub

VLMverse

PyTorch implementations of cutting-edge vision-language models from scratch. Demystifying multimodal AI with clean, educational code and detailed architectural breakdowns. Turn research papers into working code. Currently featuring PaLiGemma (SigLIP + Gemma), with more models coming soon.

Multi-Head Attention Encoder-Decoder Architecture RoPE embeddings VLMs SigLip, CLIP Vision Encoder

Google PaLiGemma Research Paper View on GitHub

Harsh Tomar

Research Papers Implementation

Vision Transformers

LoRA & QLoRA in PyTorch

VLMverse