Real-time Stream Processing
In this article we will learn how to use clusters of Kafka, Logstash and Apache Spark to build a real time processing engine.
For a real-time processing engine we need two things – “Event Source” and “Event Processor”
Event Source – We need an event source for the events to be processed. A stream of tweets could be an example of event source. In this article we will use logstash for streaming events. Logstash can stream events from various data sources like twitter, log files, tcp ports etc.
Event Processor – Event processor will consume events from Kafka topics and will do further processing on events. We will use Apache Spark for real-time event processing.
We will create a real time architecture for counting words in a stream of text for each 10 seconds time window.
Real-time processing architecture
Architecture consists of three components
- Apache Spark
Kafka is a publisher – subscriber architecture. Events will be published on kafka topics and any subscriber for that specific topic will get those specific events.
To simulate real-time events we will use a large text file, we can use logstash to create a stream of that text file and output it on kafka server. Each line in that file can be considered as an event.
We will run a consumer job in Apache spark and do processing on that event as we want. Consumer will listen on kafka topics for events.
We will create clusters of kakfa and apache spark nodes. Steps for installation and configuration for this architecture are as follows.
Kafka Cluster Configuration
Kafka with minimal configuration can be download from here. Follow these steps to create a Kafka cluster-
- Download and extract Kafka on each cluster
- Edit config/zookeeper.properties and add Kafka server details on master node
If you have more servers, add all in similar manner
- Edit config/server.properties on each node and add/edit following.
Note: broker.id must be unique for each node.
- On master node add one more property in config/server.properties
You can download and install Logstash from here.
Logstash configuration needs input and output plugins, we will create file input and Kafka output plugins for our use case. Following steps need to be done for creating a Logstash configuration
- We need to create a configuration file say, sample.conf
- Add file input plugin
- Add Kafka output plugin
Apache Spark Configuration
Download and extract Apache spark on each node from here. On master node edit conf/slave and add slave nodes here-
Note: Make sure slave machines are accessible from master node through ssh without password.
Now let’s write a consumer job in Spark and name it kafka_wordcount.py
Here, SparkContext represents Spark cluster and it will be used to create RDDs and other spark operations. StreamingContext will create DStreams from input data source for providing streaming functionality. For more information on spark streaming you can refer to spark streaming guide.
Now follow these steps to run the whole setup-
- Start Kafka server on each node from kafka home dir
- Create a kafka topic – kafkatest
- Now create stream of text by running logstash on any of node from logstash home dir
It will start streaming content of text file to kafka ports.
Till now our producer is ready. Now let’s start consumer.
Run spark cluster with following command from spark home dir
Now submit spark consumer job.
This job should print count of words in stream for every 10 seconds.