Lit Review: Things I wish we had known before we started our first Machine Learning project

Source: https://medium.com/infinity-aka-aseem/things-we-wish-we-had-known-before-we-started-our-first-machine-learning-project-336d1d6f2184

Lessons Learned from Building a Machine Learning Pipeline with Apache Spark: A Practical Review

As machine learning (ML) adoption spreads across industry, engineering teams are often tasked with building production-grade ML pipelines on top of big data systems like Apache Spark. While such architectures promise scalability, they come with their own set of operational challenges. This literature-style review distills experiential insights from a team’s journey building their first ML pipeline, offering a candid retrospective aimed at helping others avoid common pitfalls.

1. Embrace Uncertainty and Design for Iteration

A major theme is the unpredictability of ML infrastructure projects. Initial time estimates proved unreliable as unknown challenges emerged. The team emphasizes the need for iterative, flexible development cycles rather than rigid project plans.

Key takeaway: Estimating timelines for ML infrastructure is difficult; design systems that can adapt and iterate rapidly.


2. Validate and Clean Raw Data Early

Despite having years of historical raw data, the team discovered significant issues due to inconsistent logging formats. These bugs were only discovered mid-way through pipeline development, causing setbacks.

Key takeaway: Ensure data cleanliness and consistency upfront. Build tools to validate and preprocess data before training begins.


3. Separate ETL from Model Training

Initial attempts to train models on raw, unfiltered data introduced inefficiencies. The team eventually separated data preprocessing from training, saving processed subsets for repeated use.

Key takeaway: Preprocess once, train many times. Don’t tie heavy ETL steps to model iteration loops.


4. Provide Scalable, Exploratory Access to Data

Giving team members raw access to data in S3 proved impractical due to the scale (TBs). Instead, the team deployed notebook environments (Jupyter, Zeppelin) backed by Spark clusters for interactive exploration.

Key takeaway: Scalable compute-backed environments are essential for big data exploration. S3 access alone is not sufficient.


5. Monitor Cluster Resource Usage to Optimize Costs

The team learned that ETL and ML workloads require different resource profiles, leading to cost inefficiencies when sharing compute infrastructure. Monitoring tools like Ganglia and EC2 metrics were critical to diagnosing CPU, memory, and I/O bottlenecks.

Key takeaway: Use detailed monitoring to match workloads to optimal hardware profiles.


6. Benchmark Prediction Latency Early

While model accuracy is often prioritized, prediction latency can be a hidden blocker. The team found that Apache Spark’s inference latency didn’t meet their real-time requirements and had to use alternatives like MLeap.

Key takeaway: If latency matters, benchmark early using a simple model and your target deployment framework.


7. Recognize That S3 Is Not a File System

Despite appearances, S3 is an object store, not a file system. Spark’s use of temporary file writing (e.g., renaming during output) caused slowdowns that were mitigated by reconfiguring Spark write behavior.

Key takeaway: S3’s limitations affect pipeline performance. Tune Spark settings accordingly.


8. Beware of API Friction in Non-Scala Languages

Spark is built in Scala, and while it supports Python and Java, most examples and community help are Scala-focused. The team’s Java stack introduced additional difficulty translating solutions.

Key takeaway: Consider choosing Scala or Python if using Spark MLlib, or prepare for added complexity when using Java.


9. Prioritize Cross-Team Knowledge Sharing

Many ML terms are unfamiliar to non-technical stakeholders. The team stresses the importance of demystifying ML for engineers, business leads, and operations, using simple language and visual examples.

Key takeaway: ML success depends on shared understanding. Simplify terminology and share knowledge across teams.


10. Version Your Data and Build Configurable Interfaces

Data versioning allowed the team to test models across different datasets without redeploying code. They also built a UI for model tuning and dataset selection, reducing engineering overhead.

Key takeaway: Build internal tooling for data versioning and training configuration to support experimentation without redeployments.


Conclusion

This retrospective offers valuable operational lessons for teams building ML pipelines on Spark or similar big data platforms. While many pitfalls are technical, just as many are organizational: clear communication, tooling, and iterative thinking were just as important as technical choices in the team’s eventual success.

Lit Review: Machine Learning Systems Design

Source Material: https://huyenchip.com/machine-learning-systems-design/toc.html

A very helpful course by Chip Huyen (who has a ton of great material – I’m considering buying her textbook on this topic) on framing common data science problems, and designing a machine learning system to solve it. Since my notes on this course are too long to reasonably be included as a blog post, I’ve attached a google doc with my key learnings below.

Notes: ML Systems Design Notes

Lit Review: More Data Usually Beats Better Algorithms

Link: https://anand.typepad.com/datawocky/2008/03/more-data-usual.html

The article discusses a Stanford professor’s experience with two student project teams tackling the Netflix Prize Challenge, a competition to improve Netflix’s movie recommendation algorithm. One team focused on building a sophisticated model, while the other took a simpler approach but incorporated additional data from IMDb (e.g., movie genres). Surprisingly, the second team outperformed the first—highlighting a key lesson in machine learning: adding independent, high-quality data can often yield better results than obsessing over model complexity.

The second team’s simple algorithm outperformed the sophisticated one. Why?

This principle also applies beyond competitions, particularly in industry settings where data is often messy. Google’s PageRank algorithm in 1998 exemplified this concept with two key insights that significantly improved search rankings:

1. Hyperlinks as a measure of popularity (a link acts as a “vote” for a page).

2. Anchor text as an additional ranking signal, similar to a page title.

Even with basic models, these data-driven insights led to major breakthroughs. The key takeaway is that data quality and feature engineering often matter more than marginal model improvements.

Reflections

This article serves as an important reminder for data scientists—especially in a time when cutting-edge architectures on Hugging Face can be tempting distractions. The often-cited 80/20 rule in data science (where 80% of the work is data collection and preprocessing) is likely an underestimation. While training and evaluating sophisticated models can be done in a few lines of code, the real impact comes from thoughtful data sourcing and preparation.

Before obsessing over model tuning, it’s essential to leverage as much high-quality data as possible and continually revisit this phase throughout the machine learning pipeline. Small but meaningful insights from the data itself—like those in PageRank—can lead to substantial performance gains.

Book Review: American Prometheus

Start Date: January 20, 2025
End Date: January 31, 2025


A truly remarkable biography. Though I haven’t read other accounts of Oppenheimer’s life, it’s hard to imagine one as comprehensive, balanced, or masterfully written as this (Pulitzer…). Oppenheimer was a man who left an indelible impression on everyone he met. His intellect, charisma, and eclectic interests captivated many, yet his flaws and controversial decisions left others conflicted. The authors portray both his genius and the immense pressures he faced—at Los Alamos and during the later years of his life, which seemed marked by a form of self-imposed martyrdom. This book captures the full scope of his complex life and legacy, painting a vivid portrait of a man who shaped history and bore the weight of his impact on the world.

Book Review: The Midnight Library

Start Date: January 19, 2025
End Date: January 20, 2025


A heartfelt exploration of a heavy subject. Before starting, I’d often heard this book described as “wholesome,” and after the initial chapters, I wondered if that label was misplaced. However, as the story unfolds, it candidly and even endearingly examines the weight of regret and the possibility of finding peace with one’s choices. The message is clear: this is the only life we have, and though it may not match our younger dreams or the expectations of others, it is uniquely ours. Any alternate life would simply not “fit.” A light yet thought-provoking read that balances optimism and honesty.

Book Review: Moby Dick

Start Date: January 1, 2025
End Date: January 19, 2025


A challenging yet rewarding read. Many chapters left me lost in the maze of old English, biblical allusions, and the dense terminology of whaling. Yet, amidst those moments of disorientation, I found myself captivated by Ishmael’s narrative of life aboard the whale ship. Melville’s prose, though difficult to follow at times, amplifies the mystical and almost religious undertones that permeate the story. The sheer breadth of Melville’s knowledge—spanning language, religion, philosophy, and the esoteric intricacies of whaling—makes this novel a monumental work of literature. While its complexity may deter the casual reader, it lends an unmistakable weight to the story’s themes and ideas.

Lit Review: Mind Evolution

Motivation:

  • Challenge to guide LLMs to engage in deeper thinking to enhance problem-solving capabilities
  • Existing research exhibit various strategies to leverage inference-time compute: chain-of-thought prompting, sequential revision based feedback
  • Deploying a search strategy like Mind Evolution offers an advantage as they can improve problem-solving abilities by exploring a larger set of solution candidates

Conceptual Walkthrough: Here’s how the Mind Evolution tackles a problem like TravelPlanner

  • Analyzes Problem: User query with travel preferences and constraints, and a set of options for travel, accommodation, food, attractions
  • Generates Initial Solutions: Mind Evolution creates an initial set of diverse candidate trip plans
  • Evaluate and Refine: Each plan is evaluated by a program that:
    • Checks how well the plan meets the user’s requirements and preferences
    • Gives scores and provides textual feedback on any issues.
  • Simulates a critical conversation: Mind Evolution uses the LLM to act as both a “critic” and an “author.”
    • The critic analyzes the plan, understands the evaluator’s feedback, and suggests ways to fix problems.
    • The author takes the critic’s advice and creates an improved version of the plan.
    • This back-and-forth continues for several rounds.
  • Evolves solutions over generations: Mind Evolution repeats the process of generating, evaluating, and refining solutions over multiple “generations,” like an evolutionary process.
    • Better-scoring plans are more likely to be selected for further refinement.
    • Parts of different plans can be combined (crossover) and small changes can be introduced (mutation) to explore new possibilities.
  • Uses multiple “islands” for diversity: Solutions are evolved on multiple independent “islands” to prevent getting stuck in a rut.
    • The best solutions from each island are shared with others (migration), encouraging exploration of a wider range of ideas.
  • Resets islands for efficiency: Islands with poorly performing solutions are periodically reset.
    • This involves replacing the bad solutions with good ones from the global population.
  • Stops when a solution is found or after a set number of generations.
    • The best plan is then presented as the solution to the TravelPlanner problem.

This approach has the following set of hyperparameters:

Findings/Results:

  • Mind Evolution excels in solving complex NL planning problems, surpassing baseline methods like Best-of-N and Sequential Revision on benchmarks like TravelPlanner and Natural Plan
  • Eliminates need for formal solvers or problem translation, which often demand significant domain expertise and effort. It achieves this by leveraging an LLM to generate, recombine, and refine candidate responses based on feedback from an evaluator
  • Emphasizes the significance of combining divergent and convergent thinking in problem-solving. Uses a genetics-based approach that iteratively evolves a population of candidate solutions towards higher quality
  • Success rates increase as number of candidate solutions increase
  • Mind Evolution’s effectiveness is further validated through a new benchmark, StegPoet, a challenging task involving the encoding of hidden messages in creative writing, demonstrating the approach’s applicability beyond conventionally formalized domains

Lit Review: Lost in the Middle

Motivation:

  • Context lengths of LLMs have been increasing exponentially, encompassing thousands of tokens as input
  • Increasing context length allows for processing of lengthy external documents, enhancing memory for long conversations, search, summarization, etc.
  • Effectiveness of these long-context LLMs in using the context across the length of this increasing context window is not yet properly explored
  • This research aims to address the problem of whether these models can robustly access and use information located within these extended context windows, specifically when the relevant information is positioned in the middle.
  • Do these models exhibit primacy bias (favoring information at the beginning of the context window) or recency bias (end of context window)?

External Context Review

No external research was needed or conducted to understand this paper

Conceptual Walkthrough:

Authors devised controlled experiments focusing on two tasks:

Multi-document Question Answering: Models receive multiple documents, only one containing the answer to a given question. Researchers systematically vary the position of the answer-containing document within the input context.

Key-Value Retrieval: A simplified task where models need to extract the value corresponding to a specific key from a JSON object. The position of the key-value pair is manipulated within the input.

Key Findings:

The U-shaped Performance Curve: A consistent finding across both tasks is the emergence of a U-shaped performance curve. Model accuracy is generally higher when the relevant information is at the beginning (primacy bias) or the end (recency bias) of the input context. Performance significantly deteriorates when models need to access information in the middle of the context. This pattern suggests that current long-context language models do not robustly utilize the entirety of their input context.

Extended Context Doesn’t Guarantee Better Performance: Interestingly, extended-context models (those with larger maximum context window sizes) do not consistently outperform their standard counterparts. This suggests that simply increasing the context window size might not be sufficient for improving the effective utilization of long contexts.

  • Decoder-only models more vulnerable to this phenomenon, encoder-decoder models more resilient
  • Comparing a base language model (MPT-30B) with its instruction-tuned counterpart (MPT-30B-instruct), both models exhibit the U-shaped curve, although the performance disparity between best and worst-case scenarios is slightly reduced
  • Solutions: organize information before feeding it to model. Preprocessing input to give it structure and “roadmap”.
    • Effective Reranking: position relevant information closer to the beginning of the input, leveraging observed primacy bias in the models
    • Ranked List Truncation: truncate the ranked list of retrieved documents, retrieving fewer documents where appropriate
    • QueryAware Contextualization: Place query both before and after the data that is to be processed
    • Understanding of Imhpact of Model Scale and Fine-Tuning: Analysis of Llama-2 models reveals that the U-shaped performance curve is more pronounced in larger language models (13B and 60B) compared to smaller models (7B)

Conclusion:

This paper examines how well language models use long input contexts and finds they struggle to access information in the middle of these contexts. This is evident in the consistent U-shaped performance curve observed across tasks, where accuracy is high when relevant information is at the beginning or end (primacy and recency bias) but drops significantly when it’s in the middle.

This has implications for real-world applications like question answering and information retrieval. The paper suggests further research on:

  • Optimizing model architecture for long contexts
  • Refining query-aware contextualization techniques
  • Developing fine-tuning strategies to address positional biases

The research highlights the need for new evaluation methods that assess the robustness of long-context models to variations in the position of relevant information.