In this article we will see, how to join two datasets in spark with Java API, different type of joins available in Spark java programming and difference between them with sample java code.

Dummy data
Lets first prepare some dummy data to test different joins available in Spark Java API, we have created Two classes Data.java and DataFamily.java (Code is available at the end of this article).

We have first created some dummy data in Java list and than converted that data to Spark Dataset using "sqlContext", a support file to get sql context (SparkContext) is also given at the and of this article.
		/* DUMMY DATA creation */
		List<Data> dataList = Arrays.asList(new Data(1, "AA", "Apple", 1), new Data(2, "AB", "Orange", 1),
				new Data(3, "AC", "Banana", 2), new Data(4, "AD", "Guava", 3));

		List<DataFamily> dataFamilyList = Arrays.asList(new DataFamily(1, "Pu Family", "USA"),
				new DataFamily(1, "Pu Family", "USA"), new DataFamily(4, "Lu Family", "France"));
		
		
		/* Convert from Java list to Spark Dataset */
		Dataset<Row> rawData = SparkContext.sqlContext().createDataFrame(dataList, Data.class);
		rawData.show();

		Dataset<Row> dataFamilyData = SparkContext.sqlContext().createDataFrame(dataFamilyList, DataFamily.class);
		dataFamilyData.show();
Here "rawData" and "dataFamilyData", datasets have following values, we will use these two datasets to apply all the joins and examine result.


+----+------------+------+------+
|code|dataFamilyId|dataId| value|
+----+------------+------+------+
|  AA|           1|     1| Apple|
|  AB|           1|     2|Orange|
|  AC|           2|     3|Banana|
|  AD|           3|     4| Guava|
+----+------------+------+------+

+------------+--------+---------+
|dataFamilyId|location|     name|
+------------+--------+---------+
|           1|     USA|Pu Family|
|           2|      UK|Ge Family|
|           4|  France|Lu Family|
+------------+--------+---------+
In Apache Spark Java API, supported join types includes: 'inner', 'outer', 'full', 'fullouter', 'leftouter', 'left', 'rightouter', 'right', 'leftsemi', 'leftanti', 'cross'. Lets look into each one of them one by one:

1) Inner Join in Spark - Java API

In a Spark innner join, only matching rows from both Datasets are combined together to make a new dataset.
/* INNER JOIN IN Spark Java */
Dataset<Row> innerJoinData = rawData.join(dataFamilyData,
		rawData.col("dataFamilyId").equalTo(dataFamilyData.col("dataFamilyId")));
innerJoinData.show();
Output : Output will be shown something like this:

+----+------------+------+------+------------+--------+---------+
|code|dataFamilyId|dataId| value|dataFamilyId|location|     name|
+----+------------+------+------+------------+--------+---------+
|  AA|           1|     1| Apple|           1|     USA|Pu Family|
|  AB|           1|     2|Orange|           1|     USA|Pu Family|
|  AC|           2|     3|Banana|           2|      UK|Ge Family|
+----+------------+------+------+------------+--------+---------+

2) Outer Join in Spark - Java API

In a Spark outer join, all matching and non matching rows from both datasets are combined together to make a new dataset. Non matching values in another dataset are filled by null.
		/* OUTER JOIN Spark Java */
		Dataset<Row> outerJoinData = rawData.join(dataFamilyData,
				rawData.col("dataFamilyId").equalTo(dataFamilyData.col("dataFamilyId")), "outer");
		outerJoinData.show();
Output : Output will be shown something like this:

+----+------------+------+------+------------+--------+---------+
|code|dataFamilyId|dataId| value|dataFamilyId|location|     name|
+----+------------+------+------+------------+--------+---------+
|  AA|           1|     1| Apple|           1|     USA|Pu Family|
|  AB|           1|     2|Orange|           1|     USA|Pu Family|
|  AD|           3|     4| Guava|        null|    null|     null|
|  AC|           2|     3|Banana|           2|     USA|Su Family|
|null|        null|  null|  null|           4|  France|Lu Family|
+----+------------+------+------+------------+--------+---------+

3) Full Join in Spark - Java API

In a Spark full join, all matching and non matching rows from both datasets are combined together to make a new dataset. Non matching values in another dataset are filled by null.
		/* FULL JOIN Spark Java */
		Dataset<Row> fullJoinData = rawData.join(dataFamilyData,
				rawData.col("dataFamilyId").equalTo(dataFamilyData.col("dataFamilyId")), "full");
		fullJoinData.show();
Output : Output will be shown something like this:

+----+------------+------+------+------------+--------+---------+
|code|dataFamilyId|dataId| value|dataFamilyId|location|     name|
+----+------------+------+------+------------+--------+---------+
|  AA|           1|     1| Apple|           1|     USA|Pu Family|
|  AB|           1|     2|Orange|           1|     USA|Pu Family|
|  AD|           3|     4| Guava|        null|    null|     null|
|  AC|           2|     3|Banana|           2|     USA|Su Family|
|null|        null|  null|  null|           4|  France|Lu Family|
+----+------------+------+------+------------+--------+---------+

4) Full outer Join in Spark - Java API

In a Spark full outer join, all matching and non matching rows from both datasets are combined together to make a new dataset. Non matching values in another dataset are filled by null.
		/* FULL OUTER Spark Java */
		Dataset<Row> fullouterJoinData = rawData.join(dataFamilyData,
				rawData.col("dataFamilyId").equalTo(dataFamilyData.col("dataFamilyId")), "fullouter");
		fullouterJoinData.show();
Output : Output will be shown something like this:

+----+------------+------+------+------------+--------+---------+
|code|dataFamilyId|dataId| value|dataFamilyId|location|     name|
+----+------------+------+------+------------+--------+---------+
|  AA|           1|     1| Apple|           1|     USA|Pu Family|
|  AB|           1|     2|Orange|           1|     USA|Pu Family|
|  AD|           3|     4| Guava|        null|    null|     null|
|  AC|           2|     3|Banana|           2|     USA|Su Family|
|null|        null|  null|  null|           4|  France|Lu Family|
+----+------------+------+------+------------+--------+---------+

5) Left outer Join in Spark - Java API

In a Spark left outer join, all the rows from the left dataset while only matching rows from right dataset are combined together to make a new dataset. Non matching values in right dataset are filled by null.
		/* LEFT OUTER Spark Java */
		Dataset<Row> leftouterJoinData = rawData.join(dataFamilyData,
				rawData.col("dataFamilyId").equalTo(dataFamilyData.col("dataFamilyId")), "leftouter");
		leftouterJoinData.show();
Output : Output will be shown something like this:

+----+------------+------+------+------------+--------+---------+
|code|dataFamilyId|dataId| value|dataFamilyId|location|     name|
+----+------------+------+------+------------+--------+---------+
|  AA|           1|     1| Apple|           1|     USA|Pu Family|
|  AB|           1|     2|Orange|           1|     USA|Pu Family|
|  AC|           2|     3|Banana|           2|     USA|Su Family|
|  AD|           3|     4| Guava|        null|    null|     null|
+----+------------+------+------+------------+--------+---------+

6) Left Join in Spark - Java API

In a Spark left join, all the rows from the left dataset while only matching rows from right dataset are combined together to make a new dataset. Non matching values in right dataset are filled by null.
		/* LEFT JOIN Spark Java */
		Dataset<Row> leftJoinData = rawData.join(dataFamilyData,
				rawData.col("dataFamilyId").equalTo(dataFamilyData.col("dataFamilyId")), "left");
		leftJoinData.show();
Output : Output will be shown something like this:

+----+------------+------+------+------------+--------+---------+
|code|dataFamilyId|dataId| value|dataFamilyId|location|     name|
+----+------------+------+------+------------+--------+---------+
|  AA|           1|     1| Apple|           1|     USA|Pu Family|
|  AB|           1|     2|Orange|           1|     USA|Pu Family|
|  AC|           2|     3|Banana|           2|     USA|Su Family|
|  AD|           3|     4| Guava|        null|    null|     null|
+----+------------+------+------+------------+--------+---------+

7) Right outer Join in Spark - Java API

In a Spark right outer join, all the rows from the right dataset while only matching rows from left dataset are combined together to make a new dataset. Non matching values in left dataset are filled by null.
		/* RIGHT OUTER Spark Java */
		Dataset<Row> rightouterJoinData = rawData.join(dataFamilyData,
				rawData.col("dataFamilyId").equalTo(dataFamilyData.col("dataFamilyId")), "rightouter");
		rightouterJoinData.show();
Output : Output will be shown something like this:

+----+------------+------+------+------------+--------+---------+
|code|dataFamilyId|dataId| value|dataFamilyId|location|     name|
+----+------------+------+------+------------+--------+---------+
|  AB|           1|     2|Orange|           1|     USA|Pu Family|
|  AA|           1|     1| Apple|           1|     USA|Pu Family|
|  AC|           2|     3|Banana|           2|     USA|Su Family|
|null|        null|  null|  null|           4|  France|Lu Family|
+----+------------+------+------+------------+--------+---------+

8) Right Join in Spark - Java API

In a Spark right join, all the rows from the right dataset while only matching rows from left dataset are combined together to make a new dataset. Non matching values in left dataset are filled by null.

		/* RIGHT Spark Java */
		Dataset<Row> rightJoinData = rawData.join(dataFamilyData,
				rawData.col("dataFamilyId").equalTo(dataFamilyData.col("dataFamilyId")), "right");
		rightJoinData.show();
Output : Output will be shown something like this:

+----+------------+------+------+------------+--------+---------+
|code|dataFamilyId|dataId| value|dataFamilyId|location|     name|
+----+------------+------+------+------------+--------+---------+
|  AB|           1|     2|Orange|           1|     USA|Pu Family|
|  AA|           1|     1| Apple|           1|     USA|Pu Family|
|  AC|           2|     3|Banana|           2|     USA|Su Family|
|null|        null|  null|  null|           4|  France|Lu Family|
+----+------------+------+------+------------+--------+---------+

9) Left semi Join in Spark - Java API

In a Spark left semi join, only the rows from left dataset that are matching with right are displayed and no row from the right dataset is added to joined dataset.
		/* LEFT SEMI Spark Java */
		Dataset<Row> leftsemiJoinData = rawData.join(dataFamilyData,
				rawData.col("dataFamilyId").equalTo(dataFamilyData.col("dataFamilyId")), "leftsemi");
		leftsemiJoinData.show();
Output : Output will be shown something like this:

+----+------------+------+------+
|code|dataFamilyId|dataId| value|
+----+------------+------+------+
|  AA|           1|     1| Apple|
|  AB|           1|     2|Orange|
|  AC|           2|     3|Banana|
+----+------------+------+------+

10) Left anti Join

In a Spark left anti join, only the non matching rows from left dataset are displayed, no row from the right dataset added to the joined dataset.
		/* LEFT ANTI JOIN Spark Java */
		Dataset<Row> leftantiJoinData = rawData.join(dataFamilyData,
				rawData.col("dataFamilyId").equalTo(dataFamilyData.col("dataFamilyId")), "leftanti");
		leftantiJoinData.show();
Output : Output will be shown something like this:

+----+------------+------+-----+
|code|dataFamilyId|dataId|value|
+----+------------+------+-----+
|  AD|           3|     4|Guava|
+----+------------+------+-----+

11) Cross Join in Spark - Java API

In a Spark cross join, cross product of all matching rows from left and right datasets are added to the resulting joined dataset, non matching rows do not adds up.
		/* CROSS JOIN Spark Java */
		Dataset<Row> crossJoinData = rawData.join(dataFamilyData,
				rawData.col("dataFamilyId").equalTo(dataFamilyData.col("dataFamilyId")), "cross");
		crossJoinData.show();
Output : Output will be shown something like this:

+----+------------+------+------+------------+--------+---------+
|code|dataFamilyId|dataId| value|dataFamilyId|location|     name|
+----+------------+------+------+------------+--------+---------+
|  AA|           1|     1| Apple|           1|     USA|Pu Family|
|  AB|           1|     2|Orange|           1|     USA|Pu Family|
|  AC|           2|     3|Banana|           2|     USA|Su Family|
+----+------------+------+------+------------+--------+---------+


Other files used in examples:
Data.java
package com.tb.examples;

public class Data {
	private long dataId;
	private String code;
	private String value;
	private long dataFamilyId;
	
	public Data() {
		super();
	}

	public Data(long dataId, String code, String value, long dataFamilyId) {
		super();
		this.dataId = dataId;
		this.code = code;
		this.value = value;
		this.dataFamilyId = dataFamilyId;
	}

	// Getters and Setters

}

DataFamily.java
package com.tb.examples;

public class DataFamily {
	private long dataFamilyId;
	private String name;
	private String location;
	
	public DataFamily() {
		super();
	}

	public DataFamily(long dataFamilyId, String name, String location) {
		super();
		this.dataFamilyId = dataFamilyId;
		this.name = name;
		this.location = location;
	}

	// Getters and Setters

}

SparkContext.java
package com.tb.examples;

import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.sql.SparkSession;

public class SparkContext<T> {
	private static JavaSparkContext javaSparkContext;
	private static SparkSession sparkSession;
	private static SQLContext sqlContext;
	

	public static JavaSparkContext javaSparkContext() {
		if (javaSparkContext == null)
			javaSparkContext = new JavaSparkContext();
		return javaSparkContext;
	}

	public static SparkSession sparkSession() {
		if (sparkSession == null)
			sparkSession = SparkSession.builder().getOrCreate();
		return sparkSession;
	}

	public static SQLContext sqlContext() {
		if (sqlContext == null)
			sqlContext = sparkSession().sqlContext();
		return sqlContext;

	}

}

In this article we have seen different type of Joins available in Apache Spark - Java API with example code and difference between them, in upcoming articles we will see more about Spark programming with Java.
  • By Techburps.com
  • Apr 4, 2018
  • Big Data