LLM | Bhavin Jawade

Understanding MatFormer - Nested Transformers for elastic inference

Google recently released the Gemma 3n models — E4B and E2B. The models are packed with novel components and features from PLEs, ASR and AST, MobileNet-V5 for vision etc. But one of the more interesting parts in the announcement is that E2B isn’t just a smaller sibling trained separately (or distilled) like other family of models we have previously seen (7B, 11B, 70B etc) — it’s actually a sub-model within E4B. Even more intriguing, the release mentioned that it’s possible to “mix and match” layers between the two, depending on memory and compute constraints to create even more models of different sizes. How was such a modularity achieved? How are different sized models trained simultaneously?

ORPO — Preference Optimization without Reference Model

Typically, preference alignment in large language models (LLMs) requires a reference model and a warm-up phase of supervised fine-tuning. ORPO proposes a monolithic approach by integrating preference alignment directly into the supervised fine-tuning (SFT) stage, ensuring that the model can learn human preferences during instruction tuning itself.

Tuning-Free Longer Context Lengths For LLMs — A Review of Self-Extend (LLM Maybe LongLM)

LLMs are typically trained on fixed-length sequences, leading to performance degradation when dealing with longer texts due to positional Out-Of-Distribution (O.O.D) issues. The paper proposes a solution called 'Self-Extend,' which uses grouped attention to handle longer sequences by mapping out-of-distribution positions into the trained range. This approach combines normal and grouped attention, maintaining precision for nearby tokens and context awareness for distant tokens. The method significantly reduces perplexity in models and improves performance in various NLP tasks without impacting short-context tasks.

Demystifying GQA — Grouped Query Attention for Efficient LLM Pre-training

The article explores Grouped Query Attention (GQA), an efficient pre-training strategy for large language models (LLMs) like LLaMA-2 and Mistral7B. It describes GQA as a hybrid of multi-head attention (MHA) and multi-query attention (MQA), providing a balance between computational efficiency and model quality. The article also discusses the challenges of MHA, such as memory bandwidth, and how GQA addresses these by grouping query heads to optimize training and inference in large-scale models.

Understanding LoRA — Low Rank Adaptation For Finetuning Large Models

Fine-tuning large pre-trained models is computationally challenging, often involving adjustment of millions of parameters. This traditional fine-tuning approach, while effective, demands substantial computational resources and time, posing a bottleneck for adapting these models to specific tasks. LoRA presented an effective solution to this problem by decomposing the update matrix during finetuing.