LLM

Tuning-Free Longer Context Lengths For LLMs — A Review of Self-Extend (LLM Maybe LongLM)

LLMs are typically trained on fixed-length sequences, leading to performance degradation when dealing with longer texts due to positional Out-Of-Distribution (O.O.D) issues. The paper proposes a solution called 'Self-Extend,' which uses grouped attention to handle longer sequences by mapping out-of-distribution positions into the trained range. This approach combines normal and grouped attention, maintaining precision for nearby tokens and context awareness for distant tokens. The method significantly reduces perplexity in models and improves performance in various NLP tasks without impacting short-context tasks.

Demystifying GQA — Grouped Query Attention for Efficient LLM Pre-training

The article explores Grouped Query Attention (GQA), an efficient pre-training strategy for large language models (LLMs) like LLaMA-2 and Mistral7B. It describes GQA as a hybrid of multi-head attention (MHA) and multi-query attention (MQA), providing a balance between computational efficiency and model quality. The article also discusses the challenges of MHA, such as memory bandwidth, and how GQA addresses these by grouping query heads to optimize training and inference in large-scale models.

Understanding LoRA — Low Rank Adaptation For Finetuning Large Models

Fine-tuning large pre-trained models is computationally challenging, often involving adjustment of millions of parameters. This traditional fine-tuning approach, while effective, demands substantial computational resources and time, posing a bottleneck for adapting these models to specific tasks. LoRA presented an effective solution to this problem by decomposing the update matrix during finetuing.