Data Source API 定义如何从存储系统进行读写的相关 API 接口,比如 Hadoop 的 InputFormat/OutputFormat,Hive 的 Serde 等。
这些 API 非常适合用户在 Spark 中使用 RDD 编程的时候使用。
使用这些 API 进行编程虽然能够解决我们的问题,但是对用户来说使用成本还是挺高的,而且 Spark 也不能对其进行优化。
为了解决这些问题,Spark 1.3 版本开始引入了 Data Source API V1,通过这个 API 我们可以很方便的读取各种来源的数据,而且 Spark 使用 SQL 组件的一些优化引擎对数据源的读取进行优化,比如列裁剪、过滤下推等等。
SPARK-11215
Multiple columns support added to various Transformers: StringIndexer
SPARK-11150
Implement Dynamic Partition Pruning
SPARK-13677
Support Tree-Based Feature Transformation
SPARK-16692
Add MultilabelClassificationEvaluator
SPARK-19591
Add sample weights to decision trees
SPARK-19712
Pushing Left Semi and Left Anti joins through Project, Aggregate, Window, Union etc.
SPARK-19827
R API for Power Iteration Clustering
SPARK-20286
Improve logic for timing out executors in dynamic allocation
SPARK-20636
Eliminate unnecessary shuffle with adjacent Window expressions
SPARK-22148
Acquire new executors to avoid hang because of blacklisting
SPARK-22796
Multiple columns support added to various Transformers: PySpark QuantileDiscretizer
SPARK-23128
A new approach to do adaptive execution in Spark SQL
SPARK-23155
Apply custom log URL pattern for executor log URLs in SHS
SPARK-23539
Add support for Kafka headers
SPARK-23674
Add Spark ML Listener for Tracking ML Pipeline Status
SPARK-23710
Upgrade the built-in Hive to 2.3.5 for hadoop-3.2
SPARK-24333
Add fit with validation set to Gradient Boosted Trees: Python API
SPARK-24417
Build and Run Spark on JDK11
SPARK-24615
Accelerator-aware task scheduling for Spark
SPARK-24920
Allow sharing Netty's memory pool allocators
SPARK-25250
Fix race condition with tasks running when new attempt for same stage is created leads to other task in the next attempt running on the same partition id retry multiple times
SPARK-25341
Support rolling back a shuffle map stage and re-generate the shuffle files
SPARK-25348
Data source for binary files
SPARK-25390
data source V2 API refactoring
SPARK-25501
Add Kafka delegation token support
SPARK-25603
Generalize Nested Column Pruning
SPARK-26132
Remove support for Scala 2.11 in Spark 3.0.0
SPARK-26215
define reserved keywords after SQL standard
SPARK-26412
Allow Pandas UDF to take an iterator of pd.DataFrames
SPARK-26651
Use Proleptic Gregorian calendar
SPARK-26759
Arrow optimization in SparkR's interoperability
SPARK-26848
Introduce new option to Kafka source: offset by timestamp (starting/ending)
SPARK-27064
create StreamingWrite at the beginning of streaming execution
SPARK-27119
Do not infer schema when reading Hive serde table with native data source
SPARK-27225
Implement join strategy hints
SPARK-27240
Use pandas DataFrame for struct type argument in Scalar Pandas UDF
SPARK-27338
Fix deadlock between TaskMemoryManager and UnsafeExternalSorter$SpillableIterator
SPARK-27396
Public APIs for extended Columnar Processing Support
SPARK-27463
Support Dataframe Cogroup via Pandas UDFs
SPARK-27589
Re-implement file sources with data source V2 API
SPARK-27677
Disk-persisted RDD blocks served by shuffle service, and ignored for Dynamic Allocation
SPARK-27699
Partially push down disjunctive predicated in Parquet/ORC
SPARK-27763
Port test cases from PostgreSQL to Spark SQL
SPARK-27884
Deprecate Python 2 support
SPARK-27921
Convert applicable *.sql tests into UDF integrated test base
SPARK-27963
Allow dynamic allocation without an external shuffle service
SPARK-28177
Adjust post shuffle partition number in adaptive execution
SPARK-28199
Move Trigger implementations to Triggers.scala and avoid exposing these to the end users