Spark Streaming Pulsar receiver
Pulsar version 2.0
The documentation that you’re reading is for the 2.0 release of Apache Pulsar. For more information on Pulsar 2.0, see this guide.
The Spark Streaming receiver for Pulsar is a custom receiver that enables Apache Spark Streaming to receive data from Pulsar.
An application can receive data in Resilient Distributed Dataset (RDD) format via the Spark Streaming Pulsar receiver and can process it in a variety of ways.
Prerequisites
To use the receiver, include a dependency for the pulsar-spark
library in your Java configuration.
Maven
If you’re using Maven, add this to your pom.xml
:
<!-- in your <properties> block -->
<pulsar.version>2.0.0-incubating</pulsar.version>
<!-- in your <dependencies> block -->
<dependency>
<groupId>org.apache.pulsar</groupId>
<artifactId>pulsar-spark</artifactId>
<version>${pulsar.version}</version>
</dependency>
Gradle
If you’re using Gradle, add this to your build.gradle
file:
def pulsarVersion = "2.0.0-incubating"
dependencies {
compile group: 'org.apache.pulsar', name: 'pulsar-spark', version: pulsarVersion
}
Usage
Pass an instance of SparkStreamingPulsarReceiver
to the receiverStream
method in JavaStreamingContext
:
SparkConf conf = new SparkConf().setMaster("local[*]").setAppName("pulsar-spark");
JavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.seconds(5));
ClientConfiguration clientConf = new ClientConfiguration();
ConsumerConfiguration consConf = new ConsumerConfiguration();
String url = "pulsar://localhost:6650/";
String topic = "persistent://sample/standalone/ns1/topic1";
String subs = "sub1";
JavaReceiverInputDStream<byte[]> msgs = jssc
.receiverStream(new SparkStreamingPulsarReceiver(clientConf, consConf, url, topic, subs));
Example
You can find a complete example here. In this example, the number of messages which contain the string “Pulsar” in received messages is counted.