Category Archives: Literature Reviews

Lit Review: Machine Learning Systems Design

Source Material: https://huyenchip.com/machine-learning-systems-design/toc.html

A very helpful course by Chip Huyen (who has a ton of great material – I’m considering buying her textbook on this topic) on framing common data science problems, and designing a machine learning system to solve it. Since my notes on this course are too long to reasonably be included as a blog post, I’ve attached a google doc with my key learnings below.

Notes: ML Systems Design Notes

Lit Review: More Data Usually Beats Better Algorithms

Link: https://anand.typepad.com/datawocky/2008/03/more-data-usual.html

The article discusses a Stanford professor’s experience with two student project teams tackling the Netflix Prize Challenge, a competition to improve Netflix’s movie recommendation algorithm. One team focused on building a sophisticated model, while the other took a simpler approach but incorporated additional data from IMDb (e.g., movie genres). Surprisingly, the second team outperformed the first—highlighting a key lesson in machine learning: adding independent, high-quality data can often yield better results than obsessing over model complexity.

The second team’s simple algorithm outperformed the sophisticated one. Why?

This principle also applies beyond competitions, particularly in industry settings where data is often messy. Google’s PageRank algorithm in 1998 exemplified this concept with two key insights that significantly improved search rankings:

1. Hyperlinks as a measure of popularity (a link acts as a “vote” for a page).

2. Anchor text as an additional ranking signal, similar to a page title.

Even with basic models, these data-driven insights led to major breakthroughs. The key takeaway is that data quality and feature engineering often matter more than marginal model improvements.

Reflections

This article serves as an important reminder for data scientists—especially in a time when cutting-edge architectures on Hugging Face can be tempting distractions. The often-cited 80/20 rule in data science (where 80% of the work is data collection and preprocessing) is likely an underestimation. While training and evaluating sophisticated models can be done in a few lines of code, the real impact comes from thoughtful data sourcing and preparation.

Before obsessing over model tuning, it’s essential to leverage as much high-quality data as possible and continually revisit this phase throughout the machine learning pipeline. Small but meaningful insights from the data itself—like those in PageRank—can lead to substantial performance gains.

Lit Review: Mind Evolution

Motivation:

  • Challenge to guide LLMs to engage in deeper thinking to enhance problem-solving capabilities
  • Existing research exhibit various strategies to leverage inference-time compute: chain-of-thought prompting, sequential revision based feedback
  • Deploying a search strategy like Mind Evolution offers an advantage as they can improve problem-solving abilities by exploring a larger set of solution candidates

Conceptual Walkthrough: Here’s how the Mind Evolution tackles a problem like TravelPlanner

  • Analyzes Problem: User query with travel preferences and constraints, and a set of options for travel, accommodation, food, attractions
  • Generates Initial Solutions: Mind Evolution creates an initial set of diverse candidate trip plans
  • Evaluate and Refine: Each plan is evaluated by a program that:
    • Checks how well the plan meets the user’s requirements and preferences
    • Gives scores and provides textual feedback on any issues.
  • Simulates a critical conversation: Mind Evolution uses the LLM to act as both a “critic” and an “author.”
    • The critic analyzes the plan, understands the evaluator’s feedback, and suggests ways to fix problems.
    • The author takes the critic’s advice and creates an improved version of the plan.
    • This back-and-forth continues for several rounds.
  • Evolves solutions over generations: Mind Evolution repeats the process of generating, evaluating, and refining solutions over multiple “generations,” like an evolutionary process.
    • Better-scoring plans are more likely to be selected for further refinement.
    • Parts of different plans can be combined (crossover) and small changes can be introduced (mutation) to explore new possibilities.
  • Uses multiple “islands” for diversity: Solutions are evolved on multiple independent “islands” to prevent getting stuck in a rut.
    • The best solutions from each island are shared with others (migration), encouraging exploration of a wider range of ideas.
  • Resets islands for efficiency: Islands with poorly performing solutions are periodically reset.
    • This involves replacing the bad solutions with good ones from the global population.
  • Stops when a solution is found or after a set number of generations.
    • The best plan is then presented as the solution to the TravelPlanner problem.

This approach has the following set of hyperparameters:

Findings/Results:

  • Mind Evolution excels in solving complex NL planning problems, surpassing baseline methods like Best-of-N and Sequential Revision on benchmarks like TravelPlanner and Natural Plan
  • Eliminates need for formal solvers or problem translation, which often demand significant domain expertise and effort. It achieves this by leveraging an LLM to generate, recombine, and refine candidate responses based on feedback from an evaluator
  • Emphasizes the significance of combining divergent and convergent thinking in problem-solving. Uses a genetics-based approach that iteratively evolves a population of candidate solutions towards higher quality
  • Success rates increase as number of candidate solutions increase
  • Mind Evolution’s effectiveness is further validated through a new benchmark, StegPoet, a challenging task involving the encoding of hidden messages in creative writing, demonstrating the approach’s applicability beyond conventionally formalized domains

Lit Review: Lost in the Middle

Motivation:

  • Context lengths of LLMs have been increasing exponentially, encompassing thousands of tokens as input
  • Increasing context length allows for processing of lengthy external documents, enhancing memory for long conversations, search, summarization, etc.
  • Effectiveness of these long-context LLMs in using the context across the length of this increasing context window is not yet properly explored
  • This research aims to address the problem of whether these models can robustly access and use information located within these extended context windows, specifically when the relevant information is positioned in the middle.
  • Do these models exhibit primacy bias (favoring information at the beginning of the context window) or recency bias (end of context window)?

External Context Review

No external research was needed or conducted to understand this paper

Conceptual Walkthrough:

Authors devised controlled experiments focusing on two tasks:

Multi-document Question Answering: Models receive multiple documents, only one containing the answer to a given question. Researchers systematically vary the position of the answer-containing document within the input context.

Key-Value Retrieval: A simplified task where models need to extract the value corresponding to a specific key from a JSON object. The position of the key-value pair is manipulated within the input.

Key Findings:

The U-shaped Performance Curve: A consistent finding across both tasks is the emergence of a U-shaped performance curve. Model accuracy is generally higher when the relevant information is at the beginning (primacy bias) or the end (recency bias) of the input context. Performance significantly deteriorates when models need to access information in the middle of the context. This pattern suggests that current long-context language models do not robustly utilize the entirety of their input context.

Extended Context Doesn’t Guarantee Better Performance: Interestingly, extended-context models (those with larger maximum context window sizes) do not consistently outperform their standard counterparts. This suggests that simply increasing the context window size might not be sufficient for improving the effective utilization of long contexts.

  • Decoder-only models more vulnerable to this phenomenon, encoder-decoder models more resilient
  • Comparing a base language model (MPT-30B) with its instruction-tuned counterpart (MPT-30B-instruct), both models exhibit the U-shaped curve, although the performance disparity between best and worst-case scenarios is slightly reduced
  • Solutions: organize information before feeding it to model. Preprocessing input to give it structure and “roadmap”.
    • Effective Reranking: position relevant information closer to the beginning of the input, leveraging observed primacy bias in the models
    • Ranked List Truncation: truncate the ranked list of retrieved documents, retrieving fewer documents where appropriate
    • QueryAware Contextualization: Place query both before and after the data that is to be processed
    • Understanding of Imhpact of Model Scale and Fine-Tuning: Analysis of Llama-2 models reveals that the U-shaped performance curve is more pronounced in larger language models (13B and 60B) compared to smaller models (7B)

Conclusion:

This paper examines how well language models use long input contexts and finds they struggle to access information in the middle of these contexts. This is evident in the consistent U-shaped performance curve observed across tasks, where accuracy is high when relevant information is at the beginning or end (primacy and recency bias) but drops significantly when it’s in the middle.

This has implications for real-world applications like question answering and information retrieval. The paper suggests further research on:

  • Optimizing model architecture for long contexts
  • Refining query-aware contextualization techniques
  • Developing fine-tuning strategies to address positional biases

The research highlights the need for new evaluation methods that assess the robustness of long-context models to variations in the position of relevant information.