public class SchemaRDD extends RDD<org.apache.spark.sql.catalyst.expressions.Row>
Row
objects that has an associated schema. In addition to standard RDD functions,
SchemaRDDs can be used in relational queries, as shown in the examples below.
Importing a SQLContext brings an implicit into scope that automatically converts a standard RDD
whose elements are scala case classes into a SchemaRDD. This conversion can also be done
explicitly using the createSchemaRDD
function on a SQLContext
.
A SchemaRDD
can also be created by loading data in from external sources.
Examples are loading data from Parquet files by using by using the
parquetFile
method on SQLContext
, and loading JSON datasets
by using jsonFile
and jsonRDD
methods on SQLContext
.
== SQL Queries ==
A SchemaRDD can be registered as a table in the SQLContext
that was used to create it. Once
an RDD has been registered as a table, it can be used in the FROM clause of SQL statements.
// One method for defining the schema of an RDD is to make a case class with the desired column
// names and types.
case class Record(key: Int, value: String)
val sc: SparkContext // An existing spark context.
val sqlContext = new SQLContext(sc)
// Importing the SQL context gives access to all the SQL functions and implicit conversions.
import sqlContext._
val rdd = sc.parallelize((1 to 100).map(i => Record(i, s"val_$i")))
// Any RDD containing case classes can be registered as a table. The schema of the table is
// automatically inferred using scala reflection.
rdd.registerAsTable("records")
val results: SchemaRDD = sql("SELECT * FROM records")
== Language Integrated Queries ==
case class Record(key: Int, value: String)
val sc: SparkContext // An existing spark context.
val sqlContext = new SQLContext(sc)
// Importing the SQL context gives access to all the SQL functions and implicit conversions.
import sqlContext._
val rdd = sc.parallelize((1 to 100).map(i => Record(i, "val_" + i)))
// Example of language integrated queries.
rdd.where('key === 1).orderBy('value.asc).select('key).collect()
Constructor and Description |
---|
SchemaRDD(SQLContext sqlContext,
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan baseLogicalPlan) |
Modifier and Type | Method and Description |
---|---|
SchemaRDD |
aggregate(scala.collection.Seq<org.apache.spark.sql.catalyst.expressions.Expression> aggregateExprs)
Performs an aggregation over all Rows in this RDD.
|
SchemaRDD |
as(scala.Symbol alias)
Applies a qualifier to the attributes of this relation.
|
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan |
baseLogicalPlan() |
SchemaRDD |
baseSchemaRDD() |
SchemaRDD |
coalesce(int numPartitions,
boolean shuffle,
scala.math.Ordering<org.apache.spark.sql.catalyst.expressions.Row> ord)
Return a new RDD that is reduced into
numPartitions partitions. |
org.apache.spark.sql.catalyst.expressions.Row[] |
collect()
Return an array that contains all of the elements in this RDD.
|
scala.collection.Iterator<org.apache.spark.sql.catalyst.expressions.Row> |
compute(Partition split,
TaskContext context)
:: DeveloperApi ::
Implemented by subclasses to compute a given partition.
|
long |
count()
:: Experimental ::
Return the number of elements in the RDD.
|
SchemaRDD |
distinct()
Return a new RDD containing the distinct elements in this RDD.
|
SchemaRDD |
distinct(int numPartitions,
scala.math.Ordering<org.apache.spark.sql.catalyst.expressions.Row> ord)
Return a new RDD containing the distinct elements in this RDD.
|
SchemaRDD |
filter(scala.Function1<org.apache.spark.sql.catalyst.expressions.Row,Object> f)
Return a new RDD containing only the elements that satisfy a predicate.
|
SchemaRDD |
generate(org.apache.spark.sql.catalyst.expressions.Generator generator,
boolean join,
boolean outer,
scala.Option<String> alias)
:: Experimental ::
Applies the given Generator, or table generating function, to this relation.
|
Partition[] |
getPartitions()
Implemented by subclasses to return the set of partitions in this RDD.
|
SchemaRDD |
groupBy(scala.collection.Seq<org.apache.spark.sql.catalyst.expressions.Expression> groupingExprs,
scala.collection.Seq<org.apache.spark.sql.catalyst.expressions.Expression> aggregateExprs)
Performs a grouping followed by an aggregation.
|
void |
insertInto(String tableName)
:: Experimental ::
Appends the rows from this RDD to the specified table.
|
void |
insertInto(String tableName,
boolean overwrite)
:: Experimental ::
Adds the rows from this RDD to the specified table, optionally overwriting the existing data.
|
SchemaRDD |
intersection(RDD<org.apache.spark.sql.catalyst.expressions.Row> other)
Return the intersection of this RDD and another one.
|
SchemaRDD |
intersection(RDD<org.apache.spark.sql.catalyst.expressions.Row> other,
int numPartitions)
Return the intersection of this RDD and another one.
|
SchemaRDD |
intersection(RDD<org.apache.spark.sql.catalyst.expressions.Row> other,
Partitioner partitioner,
scala.math.Ordering<org.apache.spark.sql.catalyst.expressions.Row> ord)
Return the intersection of this RDD and another one.
|
SchemaRDD |
join(SchemaRDD otherPlan,
org.apache.spark.sql.catalyst.plans.JoinType joinType,
scala.Option<org.apache.spark.sql.catalyst.expressions.Expression> on)
Performs a relational join on two SchemaRDDs
|
SchemaRDD |
limit(org.apache.spark.sql.catalyst.expressions.Expression limitExpr) |
SchemaRDD |
limit(int limitNum)
Limits the results by the given integer.
|
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan |
logicalPlan() |
SchemaRDD |
orderBy(scala.collection.Seq<org.apache.spark.sql.catalyst.expressions.SortOrder> sortExprs)
Sorts the results by the given expressions.
|
void |
printSchema()
Prints out the schema in the tree format.
|
org.apache.spark.sql.SQLContext.QueryExecution |
queryExecution()
:: DeveloperApi ::
A lazily computed query execution workflow.
|
void |
registerAsTable(String tableName)
Registers this RDD as a temporary table using the given name.
|
SchemaRDD |
repartition(int numPartitions,
scala.math.Ordering<org.apache.spark.sql.catalyst.expressions.Row> ord)
Return a new RDD that has exactly numPartitions partitions.
|
SchemaRDD |
sample(boolean withReplacement,
double fraction,
long seed)
:: Experimental ::
Returns a sampled version of the underlying dataset.
|
void |
saveAsParquetFile(String path)
Saves the contents of this
SchemaRDD as a parquet file, preserving the schema. |
void |
saveAsTable(String tableName)
:: Experimental ::
Creates a table from the the contents of this SchemaRDD.
|
String |
schemaString()
Returns the output schema in the tree format.
|
SchemaRDD |
select(scala.collection.Seq<org.apache.spark.sql.catalyst.expressions.Expression> exprs)
Changes the output of this relation to the given expressions, similar to the
SELECT clause
in SQL. |
SQLContext |
sqlContext() |
SchemaRDD |
subtract(RDD<org.apache.spark.sql.catalyst.expressions.Row> other)
Return an RDD with the elements from
this that are not in other . |
SchemaRDD |
subtract(RDD<org.apache.spark.sql.catalyst.expressions.Row> other,
int numPartitions)
Return an RDD with the elements from
this that are not in other . |
SchemaRDD |
subtract(RDD<org.apache.spark.sql.catalyst.expressions.Row> other,
Partitioner p,
scala.math.Ordering<org.apache.spark.sql.catalyst.expressions.Row> ord)
Return an RDD with the elements from
this that are not in other . |
org.apache.spark.sql.catalyst.expressions.Row[] |
take(int num)
Take the first num elements of the RDD.
|
JavaSchemaRDD |
toJavaSchemaRDD()
Returns this RDD as a JavaSchemaRDD.
|
SchemaRDD |
toSchemaRDD()
Returns this RDD as a SchemaRDD.
|
String |
toString() |
SchemaRDD |
unionAll(SchemaRDD otherPlan)
Combines the tuples of two RDDs with the same schema, keeping duplicates.
|
SchemaRDD |
where(org.apache.spark.sql.catalyst.expressions.Expression condition)
Filters the output, only returning those rows where
condition evaluates to true. |
SchemaRDD |
where(scala.Function1<org.apache.spark.sql.catalyst.expressions.DynamicRow,Object> dynamicUdf)
:: Experimental ::
Filters tuples using a function over a
Dynamic version of a given Row. |
<T1> SchemaRDD |
where(scala.Symbol arg1,
scala.Function1<T1,Object> udf)
Filters tuples using a function over the value of the specified column.
|
aggregate, cache, cartesian, checkpoint, checkpointData, collect, context, countApprox, countApproxDistinct, countByValue, countByValueApprox, creationSiteInfo, dependencies, filterWith, first, flatMap, flatMapWith, fold, foreach, foreachPartition, foreachWith, getCheckpointFile, getStorageLevel, glom, groupBy, groupBy, groupBy, id, isCheckpointed, iterator, keyBy, map, mapPartitions, mapPartitionsWithContext, mapPartitionsWithIndex, mapPartitionsWithSplit, mapWith, max, min, name, partitioner, partitions, persist, persist, pipe, pipe, pipe, preferredLocations, randomSplit, reduce, saveAsObjectFile, saveAsTextFile, saveAsTextFile, setName, sparkContext, takeOrdered, takeSample, toArray, toDebugString, toJavaRDD, toLocalIterator, top, toString, union, unpersist, zip, zipPartitions, zipPartitions, zipPartitions, zipPartitions, zipPartitions, zipPartitions, zipWithIndex, zipWithUniqueId
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
initialized, initializeIfNecessary, initializeLogging, initLock, isTraceEnabled, log_, log, logDebug, logDebug, logError, logError, logInfo, logInfo, logTrace, logTrace, logWarning, logWarning
public SchemaRDD(SQLContext sqlContext, org.apache.spark.sql.catalyst.plans.logical.LogicalPlan baseLogicalPlan)
public SQLContext sqlContext()
public org.apache.spark.sql.catalyst.plans.logical.LogicalPlan baseLogicalPlan()
public SchemaRDD baseSchemaRDD()
public scala.collection.Iterator<org.apache.spark.sql.catalyst.expressions.Row> compute(Partition split, TaskContext context)
RDD
public Partition[] getPartitions()
RDD
public SchemaRDD select(scala.collection.Seq<org.apache.spark.sql.catalyst.expressions.Expression> exprs)
SELECT
clause
in SQL.
schemaRDD.select('a, 'b + 'c, 'd as 'aliasedName)
exprs
- a set of logical expression that will be evaluated for each input row.
public SchemaRDD where(org.apache.spark.sql.catalyst.expressions.Expression condition)
condition
evaluates to true.
schemaRDD.where('a === 'b)
schemaRDD.where('a === 1)
schemaRDD.where('a + 'b > 10)
public SchemaRDD join(SchemaRDD otherPlan, org.apache.spark.sql.catalyst.plans.JoinType joinType, scala.Option<org.apache.spark.sql.catalyst.expressions.Expression> on)
otherPlan
- the SchemaRDD
that should be joined with this one.joinType
- One of Inner
, LeftOuter
, RightOuter
, or FullOuter
. Defaults to Inner.
on
- An optional condition for the join operation. This is equivalent to the ON
clause in standard SQL. In the case of Inner
joins, specifying a
condition
is equivalent to adding where
clauses after the join
.
public SchemaRDD orderBy(scala.collection.Seq<org.apache.spark.sql.catalyst.expressions.SortOrder> sortExprs)
schemaRDD.orderBy('a)
schemaRDD.orderBy('a, 'b)
schemaRDD.orderBy('a.asc, 'b.desc)
public SchemaRDD limit(org.apache.spark.sql.catalyst.expressions.Expression limitExpr)
public SchemaRDD limit(int limitNum)
schemaRDD.limit(10)
public SchemaRDD groupBy(scala.collection.Seq<org.apache.spark.sql.catalyst.expressions.Expression> groupingExprs, scala.collection.Seq<org.apache.spark.sql.catalyst.expressions.Expression> aggregateExprs)
schemaRDD.groupBy('year)(Sum('sales) as 'totalSales)
public SchemaRDD aggregate(scala.collection.Seq<org.apache.spark.sql.catalyst.expressions.Expression> aggregateExprs)
schemaRDD.aggregate(Sum('sales) as 'totalSales)
public SchemaRDD as(scala.Symbol alias)
val x = schemaRDD.where('a === 1).as('x)
val y = schemaRDD.where('a === 2).as('y)
x.join(y).where("x.a".attr === "y.a".attr),
public SchemaRDD unionAll(SchemaRDD otherPlan)
public <T1> SchemaRDD where(scala.Symbol arg1, scala.Function1<T1,Object> udf)
schemaRDD.sfilter('a)((a: Int) => ...)
public SchemaRDD where(scala.Function1<org.apache.spark.sql.catalyst.expressions.DynamicRow,Object> dynamicUdf)
Dynamic
version of a given Row. DynamicRows use
scala's Dynamic trait to emulate an ORM of in a dynamically typed language. Since the type of
the column is not known at compile time, all attributes are converted to strings before
being passed to the function.
schemaRDD.where(r => r.firstName == "Bob" && r.lastName == "Smith")
public SchemaRDD sample(boolean withReplacement, double fraction, long seed)
public long count()
public SchemaRDD generate(org.apache.spark.sql.catalyst.expressions.Generator generator, boolean join, boolean outer, scala.Option<String> alias)
generator
- A table generating function. The API for such functions is likely to change
in future releasesjoin
- when set to true, each output row of the generator is joined with the input row
that produced it.outer
- when set to true, at least one row will be produced for each input row, similar to
an OUTER JOIN
in SQL. When no output rows are produced by the generator for a
given row, a single row will be output, with NULL
values for each of the
generated columns.alias
- an optional alias that can be used as qualifier for the attributes that are
produced by this generate operation.
public SchemaRDD toSchemaRDD()
public JavaSchemaRDD toJavaSchemaRDD()
public org.apache.spark.sql.catalyst.expressions.Row[] collect()
RDD
public org.apache.spark.sql.catalyst.expressions.Row[] take(int num)
RDD
public SchemaRDD coalesce(int numPartitions, boolean shuffle, scala.math.Ordering<org.apache.spark.sql.catalyst.expressions.Row> ord)
RDD
numPartitions
partitions.
This results in a narrow dependency, e.g. if you go from 1000 partitions to 100 partitions, there will not be a shuffle, instead each of the 100 new partitions will claim 10 of the current partitions.
However, if you're doing a drastic coalesce, e.g. to numPartitions = 1, this may result in your computation taking place on fewer nodes than you like (e.g. one node in the case of numPartitions = 1). To avoid this, you can pass shuffle = true. This will add a shuffle step, but means the current upstream partitions will be executed in parallel (per whatever the current partitioning is).
Note: With shuffle = true, you can actually coalesce to a larger number of partitions. This is useful if you have a small number of partitions, say 100, potentially with a few partitions being abnormally large. Calling coalesce(1000, shuffle = true) will result in 1000 partitions with the data distributed using a hash partitioner.
public SchemaRDD distinct()
RDD
public SchemaRDD distinct(int numPartitions, scala.math.Ordering<org.apache.spark.sql.catalyst.expressions.Row> ord)
RDD
public SchemaRDD filter(scala.Function1<org.apache.spark.sql.catalyst.expressions.Row,Object> f)
RDD
public SchemaRDD intersection(RDD<org.apache.spark.sql.catalyst.expressions.Row> other)
RDD
Note that this method performs a shuffle internally.
intersection
in class RDD<org.apache.spark.sql.catalyst.expressions.Row>
public SchemaRDD intersection(RDD<org.apache.spark.sql.catalyst.expressions.Row> other, Partitioner partitioner, scala.math.Ordering<org.apache.spark.sql.catalyst.expressions.Row> ord)
RDD
Note that this method performs a shuffle internally.
intersection
in class RDD<org.apache.spark.sql.catalyst.expressions.Row>
partitioner
- Partitioner to use for the resulting RDDpublic SchemaRDD intersection(RDD<org.apache.spark.sql.catalyst.expressions.Row> other, int numPartitions)
RDD
Note that this method performs a shuffle internally.
intersection
in class RDD<org.apache.spark.sql.catalyst.expressions.Row>
numPartitions
- How many partitions to use in the resulting RDDpublic SchemaRDD repartition(int numPartitions, scala.math.Ordering<org.apache.spark.sql.catalyst.expressions.Row> ord)
RDD
Can increase or decrease the level of parallelism in this RDD. Internally, this uses a shuffle to redistribute data.
If you are decreasing the number of partitions in this RDD, consider using coalesce
,
which can avoid performing a shuffle.
repartition
in class RDD<org.apache.spark.sql.catalyst.expressions.Row>
public SchemaRDD subtract(RDD<org.apache.spark.sql.catalyst.expressions.Row> other)
RDD
this
that are not in other
.
Uses this
partitioner/partition size, because even if other
is huge, the resulting
RDD will be <= us.
public SchemaRDD subtract(RDD<org.apache.spark.sql.catalyst.expressions.Row> other, int numPartitions)
RDD
this
that are not in other
.public SchemaRDD subtract(RDD<org.apache.spark.sql.catalyst.expressions.Row> other, Partitioner p, scala.math.Ordering<org.apache.spark.sql.catalyst.expressions.Row> ord)
RDD
this
that are not in other
.public org.apache.spark.sql.SQLContext.QueryExecution queryExecution()
The query execution is considered a Developer API as phases may be added or removed in future releases. This execution is only exposed to provide an interface for inspecting the various phases for debugging purposes. Applications should not depend on particular phases existing or producing any specific output, even for exactly the same query.
Additionally, the RDD exposed by this execution is not designed for consumption by end users. In particular, it does not contain any schema information, and it reuses Row objects internally. This object reuse improves performance, but can make programming against the RDD more difficult. Instead end users should perform RDD operations on a SchemaRDD directly.
public org.apache.spark.sql.catalyst.plans.logical.LogicalPlan logicalPlan()
public String toString()
toString
in class Object
public void saveAsParquetFile(String path)
SchemaRDD
as a parquet file, preserving the schema. Files that
are written out using this method can be read back in as a SchemaRDD using the parquetFile
function.
public void registerAsTable(String tableName)
SQLContext
that was used to create this SchemaRDD.
public void insertInto(String tableName, boolean overwrite)
public void insertInto(String tableName)
public void saveAsTable(String tableName)
Note that this currently only works with SchemaRDDs that are created from a HiveContext as
there is no notion of a persisted catalog in a standard SQL context. Instead you can write
an RDD out to a parquet file, and then register that file as a table. This "table" can then
be the target of an insertInto
.
public String schemaString()
public void printSchema()