Tag Archives: LLMs

Lit Review: Mind Evolution

Motivation:

  • Challenge to guide LLMs to engage in deeper thinking to enhance problem-solving capabilities
  • Existing research exhibit various strategies to leverage inference-time compute: chain-of-thought prompting, sequential revision based feedback
  • Deploying a search strategy like Mind Evolution offers an advantage as they can improve problem-solving abilities by exploring a larger set of solution candidates

Conceptual Walkthrough: Here’s how the Mind Evolution tackles a problem like TravelPlanner

  • Analyzes Problem: User query with travel preferences and constraints, and a set of options for travel, accommodation, food, attractions
  • Generates Initial Solutions: Mind Evolution creates an initial set of diverse candidate trip plans
  • Evaluate and Refine: Each plan is evaluated by a program that:
    • Checks how well the plan meets the user’s requirements and preferences
    • Gives scores and provides textual feedback on any issues.
  • Simulates a critical conversation: Mind Evolution uses the LLM to act as both a “critic” and an “author.”
    • The critic analyzes the plan, understands the evaluator’s feedback, and suggests ways to fix problems.
    • The author takes the critic’s advice and creates an improved version of the plan.
    • This back-and-forth continues for several rounds.
  • Evolves solutions over generations: Mind Evolution repeats the process of generating, evaluating, and refining solutions over multiple “generations,” like an evolutionary process.
    • Better-scoring plans are more likely to be selected for further refinement.
    • Parts of different plans can be combined (crossover) and small changes can be introduced (mutation) to explore new possibilities.
  • Uses multiple “islands” for diversity: Solutions are evolved on multiple independent “islands” to prevent getting stuck in a rut.
    • The best solutions from each island are shared with others (migration), encouraging exploration of a wider range of ideas.
  • Resets islands for efficiency: Islands with poorly performing solutions are periodically reset.
    • This involves replacing the bad solutions with good ones from the global population.
  • Stops when a solution is found or after a set number of generations.
    • The best plan is then presented as the solution to the TravelPlanner problem.

This approach has the following set of hyperparameters:

Findings/Results:

  • Mind Evolution excels in solving complex NL planning problems, surpassing baseline methods like Best-of-N and Sequential Revision on benchmarks like TravelPlanner and Natural Plan
  • Eliminates need for formal solvers or problem translation, which often demand significant domain expertise and effort. It achieves this by leveraging an LLM to generate, recombine, and refine candidate responses based on feedback from an evaluator
  • Emphasizes the significance of combining divergent and convergent thinking in problem-solving. Uses a genetics-based approach that iteratively evolves a population of candidate solutions towards higher quality
  • Success rates increase as number of candidate solutions increase
  • Mind Evolution’s effectiveness is further validated through a new benchmark, StegPoet, a challenging task involving the encoding of hidden messages in creative writing, demonstrating the approach’s applicability beyond conventionally formalized domains

Lit Review: Lost in the Middle

Motivation:

  • Context lengths of LLMs have been increasing exponentially, encompassing thousands of tokens as input
  • Increasing context length allows for processing of lengthy external documents, enhancing memory for long conversations, search, summarization, etc.
  • Effectiveness of these long-context LLMs in using the context across the length of this increasing context window is not yet properly explored
  • This research aims to address the problem of whether these models can robustly access and use information located within these extended context windows, specifically when the relevant information is positioned in the middle.
  • Do these models exhibit primacy bias (favoring information at the beginning of the context window) or recency bias (end of context window)?

External Context Review

No external research was needed or conducted to understand this paper

Conceptual Walkthrough:

Authors devised controlled experiments focusing on two tasks:

Multi-document Question Answering: Models receive multiple documents, only one containing the answer to a given question. Researchers systematically vary the position of the answer-containing document within the input context.

Key-Value Retrieval: A simplified task where models need to extract the value corresponding to a specific key from a JSON object. The position of the key-value pair is manipulated within the input.

Key Findings:

The U-shaped Performance Curve: A consistent finding across both tasks is the emergence of a U-shaped performance curve. Model accuracy is generally higher when the relevant information is at the beginning (primacy bias) or the end (recency bias) of the input context. Performance significantly deteriorates when models need to access information in the middle of the context. This pattern suggests that current long-context language models do not robustly utilize the entirety of their input context.

Extended Context Doesn’t Guarantee Better Performance: Interestingly, extended-context models (those with larger maximum context window sizes) do not consistently outperform their standard counterparts. This suggests that simply increasing the context window size might not be sufficient for improving the effective utilization of long contexts.

  • Decoder-only models more vulnerable to this phenomenon, encoder-decoder models more resilient
  • Comparing a base language model (MPT-30B) with its instruction-tuned counterpart (MPT-30B-instruct), both models exhibit the U-shaped curve, although the performance disparity between best and worst-case scenarios is slightly reduced
  • Solutions: organize information before feeding it to model. Preprocessing input to give it structure and “roadmap”.
    • Effective Reranking: position relevant information closer to the beginning of the input, leveraging observed primacy bias in the models
    • Ranked List Truncation: truncate the ranked list of retrieved documents, retrieving fewer documents where appropriate
    • QueryAware Contextualization: Place query both before and after the data that is to be processed
    • Understanding of Imhpact of Model Scale and Fine-Tuning: Analysis of Llama-2 models reveals that the U-shaped performance curve is more pronounced in larger language models (13B and 60B) compared to smaller models (7B)

Conclusion:

This paper examines how well language models use long input contexts and finds they struggle to access information in the middle of these contexts. This is evident in the consistent U-shaped performance curve observed across tasks, where accuracy is high when relevant information is at the beginning or end (primacy and recency bias) but drops significantly when it’s in the middle.

This has implications for real-world applications like question answering and information retrieval. The paper suggests further research on:

  • Optimizing model architecture for long contexts
  • Refining query-aware contextualization techniques
  • Developing fine-tuning strategies to address positional biases

The research highlights the need for new evaluation methods that assess the robustness of long-context models to variations in the position of relevant information.