Lessons Learned from Building a Machine Learning Pipeline with Apache Spark: A Practical Review
As machine learning (ML) adoption spreads across industry, engineering teams are often tasked with building production-grade ML pipelines on top of big data systems like Apache Spark. While such architectures promise scalability, they come with their own set of operational challenges. This literature-style review distills experiential insights from a team’s journey building their first ML pipeline, offering a candid retrospective aimed at helping others avoid common pitfalls.
1. Embrace Uncertainty and Design for Iteration
A major theme is the unpredictability of ML infrastructure projects. Initial time estimates proved unreliable as unknown challenges emerged. The team emphasizes the need for iterative, flexible development cycles rather than rigid project plans.
Key takeaway: Estimating timelines for ML infrastructure is difficult; design systems that can adapt and iterate rapidly.
2. Validate and Clean Raw Data Early
Despite having years of historical raw data, the team discovered significant issues due to inconsistent logging formats. These bugs were only discovered mid-way through pipeline development, causing setbacks.
Key takeaway: Ensure data cleanliness and consistency upfront. Build tools to validate and preprocess data before training begins.
3. Separate ETL from Model Training
Initial attempts to train models on raw, unfiltered data introduced inefficiencies. The team eventually separated data preprocessing from training, saving processed subsets for repeated use.
Key takeaway: Preprocess once, train many times. Don’t tie heavy ETL steps to model iteration loops.
4. Provide Scalable, Exploratory Access to Data
Giving team members raw access to data in S3 proved impractical due to the scale (TBs). Instead, the team deployed notebook environments (Jupyter, Zeppelin) backed by Spark clusters for interactive exploration.
Key takeaway: Scalable compute-backed environments are essential for big data exploration. S3 access alone is not sufficient.
5. Monitor Cluster Resource Usage to Optimize Costs
The team learned that ETL and ML workloads require different resource profiles, leading to cost inefficiencies when sharing compute infrastructure. Monitoring tools like Ganglia and EC2 metrics were critical to diagnosing CPU, memory, and I/O bottlenecks.
Key takeaway: Use detailed monitoring to match workloads to optimal hardware profiles.
6. Benchmark Prediction Latency Early
While model accuracy is often prioritized, prediction latency can be a hidden blocker. The team found that Apache Spark’s inference latency didn’t meet their real-time requirements and had to use alternatives like MLeap.
Key takeaway: If latency matters, benchmark early using a simple model and your target deployment framework.
7. Recognize That S3 Is Not a File System
Despite appearances, S3 is an object store, not a file system. Spark’s use of temporary file writing (e.g., renaming during output) caused slowdowns that were mitigated by reconfiguring Spark write behavior.
Key takeaway: S3’s limitations affect pipeline performance. Tune Spark settings accordingly.
8. Beware of API Friction in Non-Scala Languages
Spark is built in Scala, and while it supports Python and Java, most examples and community help are Scala-focused. The team’s Java stack introduced additional difficulty translating solutions.
Key takeaway: Consider choosing Scala or Python if using Spark MLlib, or prepare for added complexity when using Java.
9. Prioritize Cross-Team Knowledge Sharing
Many ML terms are unfamiliar to non-technical stakeholders. The team stresses the importance of demystifying ML for engineers, business leads, and operations, using simple language and visual examples.
Key takeaway: ML success depends on shared understanding. Simplify terminology and share knowledge across teams.
10. Version Your Data and Build Configurable Interfaces
Data versioning allowed the team to test models across different datasets without redeploying code. They also built a UI for model tuning and dataset selection, reducing engineering overhead.
Key takeaway: Build internal tooling for data versioning and training configuration to support experimentation without redeployments.
Conclusion
This retrospective offers valuable operational lessons for teams building ML pipelines on Spark or similar big data platforms. While many pitfalls are technical, just as many are organizational: clear communication, tooling, and iterative thinking were just as important as technical choices in the team’s eventual success.