Apache Spark is the de facto standard for large-scale data processing. It provides APIs in Python, Scala, Java, and R for batch processing, streaming, machine learning, and graph processing. Created at UC Berkeley, it powers data infrastructure at thousands of companies.
Key Features
✓Unified Engine: Batch, streaming, ML, graph in one
✓Spark SQL: SQL interface for structured data
✓MLlib: Distributed machine learning library
✓Structured Streaming: Stream processing on DataFrames