
In today's data-driven world, the ability to process and analyze data in real-time is crucial for making timely decisions. Whether it's for monitoring system performance, tracking user behavior, or analyzing financial markets, real-time data pipelines play a vital role.
In this blog, we'll walk you through building a simple data pipeline using Go, Kafka, ClickHouse, and Apache Superset. Our goal is to demonstrate how you can generate both stream and batch data with Go, push it into Kafka queues, consume it with dedicated consumers, store it in ClickHouse, and finally visualize it with Apache Superset. By the end of this guide, you'll have a fully functional data pipeline ready for real-time and batch data processing.
We'll cover everything from setting up the environment and writing Go scripts to configuring Kafka, ClickHouse, and Superset. Let's dive in and start building a data pipeline that can handle the demands of modern data processing.
Let's have a quick look at what are we trying to build with the following flow diagram:

Seems straightforward, doesn't it? Indeed, it is – once you grasp the function of each component. From here on out, it's all about diving into the setup and implementation process.
Setting Up the Environment(For local testing - ofcourse it can be migrated to prod environment post few changes which is out of scope of this blog)
Installation of Kafka
Installation of Clickhouse Opensource
Installation of Apache Superset.
Installation of Kafka: This is pretty straightforward and I m sure many of us here might already have this in our local system - either through Docker or through binary scripts. For now we can download the latest Kafka from https://kafka.apache.org/downloads and untar it in local. Post this all we need to do is to:
start the zookeeper(open a terminal and from inside the kafka directory run the following):
bin/zookeeper-server-start.sh config/zookeeper.properties
start the Kafka server(open a terminal and from inside the kafka directory run the following):
bin/kafka-server-start.sh config/server.properties
now we need to create the topics - one for stream data which would be receiving continuous data - high throughput - "stream-queue". And one for batch data or lets say non-frequent data - "batch-queue".
bin/kafka-topics.sh --create --topic stream-queue --bootstrap-server localhost:9092
bin/kafka-topics.sh --create --topic batch-queue --bootstrap-server localhost:9092
Installation of Clickhouse:
Clickhouse provides an official docker image for clickhouse server. Now here we need to take notice of one thing that by default "ClickHouse will be accessible only via the Docker network." So if all of your connected services are not inside one docker network(which in most cases wont be) - we need to define an explicit port while launching the clickhouse container:
docker run -d -p 8123:8123 -p19000:9000 --name codemyles-clickhouse-server --ulimit nofile=262144:262144 clickhouse/clickhouse-server
And done. You have a clickhouse server up and running. Want to test it? Clickhouse Docker image comes with clickhouse-client already installed. So all we need to do is to run the following command to verify:
docker exec -it codemyles-clickhouse-server clickhouse-client
If it opens up a shell with a smiley(quite literally - ":)") then we are good to go! Let's move on to our next and last candidate for installation - Apache superset.
Installation of Apache Superset:
We would be using Docker image provided by Apache Superset team but with a small change. By default apache doesn't come with Clickhouse driver installed so we need to add it to the Dockerfile. Fret not, we got you covered. Just put the following in a file and name it `Clickhouse-Superset-Dockerfile`:
FROM apache/superset
# Switching to root to install the required packages
USER root
RUN pip install clickhouse-connect
# Switching back to using the `superset` user
EXPOSE 8123
USER superset
And post this all we need to do it is to build the image:
docker build -t superset -f Clickhouse-Superset-Dockerfile
And now lets bring the superset container up:
docker run -d -p 8080:8088 -e "SUPERSET_SECRET_KEY=your_secret_key_here" --name superset-with-clickhouse superset
Almost done. We need to run few of the commands post the container is up:
docker exec -it superset superset fab create-admin \
--username admin \
--firstname Superset \
--lastname Admin \
--email admin@superset.com \
--password admin
docker exec -it superset superset db upgrade
docker exec -it superset superset init
We can verify the installation by doing login on the portal: http://localhost:8080/login/

And now in our data pipeline setup, to simulate the behavior of multiple backend services pushing events data into Kafka we have written a simple golang script to help us generate a realtime data. This step mirrors real-world scenarios where various components of a system generate events continuously, providing valuable insights for monitoring, analysis, and decision-making.
This Go script is equipped with two distinct functionalities:
Stream Data Generation
Batch Data Generation
Show me the code

Complete code can be seen from the below link:
All we need to do now is to run the above code in a terminal `go run main.go`
Now before we get to the Kafka consumers part we need to make sure our Clickhouse database has a valid table for our incoming data. So lets get back to our clickhouse-client which we discussed earlier and lets create a test database and a finance_data_kafka table:
CREATE DATABASE test;
CREATE TABLE test.finance_data_kafka ( transaction_id String PRIMARY KEY, account_id String, transaction_date Date, transaction_time DateTime, transaction_amount Float64, transaction_type String, merchant_name String, merchant_category String, card_type String, card_network String );
Great! Now we have a table ready to accept INSERTs. Lets fire up your favorite code editor and lets put golang workers and channels to work on different queues. This gives us the flexibility 1) to manage the speed and number of consumers based on the incoming throughput of the traffic, and 2) to do any kind of transformations we would like to do upon the incoming data.
Complete code file can be found on the below codefile link. While there are many ways to connect to clickhouse with golang sdk we opted to choose the native sql driver of golang and using http protocol for communication. One can definitely choose tcp protocol as well - details of the same could be found on the clickhouse documentation page.
Here is how we are connecting to clickhouse:

And here is how we are launching our different set of workers for `stream-queue` and `batch-queue`:

Complete code can be found here: https://codefile.io/f/42TgecOMoT
Again - one can launch the consumers based on the traffic and available CPU cores - but as of now this should serve our purpose. Details of streamConsumer and batchConsumer function can be found in the complete codefile link mentioned above.
So with this we are good to push sample data into Kafka, consume it and finally push it into Clickhouse. All we need now is to connect Apache Superset to clickhouse and build some dashboard. Well that is pretty straightforward:
Login to superset. Find "Database Connections" under settings.
Find "+Database" button and lets choose "Clickhouse" from "Choose a Database" dropdown. Note: This is exactly why we added 'pip install clickhouse-connect' for our superset dockerfile.
Enter in database name as "test". Leave username and password as blank.
Enter the IP of your host machine - which in this our scenario would be machine on which Clickhouse server docker container is running. Enter the port - remember to check this from the clickhouse docker run command which we ran earlier.
And we are done!
Time to dive into chart creation. Crafting charts is a breeze, all thanks to the intuitive Apache Superset UI. There's just one golden rule to keep in mind: avoid premature aggregation unless you intend to perform further analysis. Why? Well, Superset empowers you to select aggregation functions directly on the dataset (for us, the raw table from ClickHouse, but one can run even a custom query), shaping how data is visualized on the chart.
For instance, if you're interested in tracking the influx of transactions every 5 minutes, there's no need to pre-aggregate by timestamp(GROUP BY?). Instead, define these aggregations seamlessly while configuring the chart in Superset. Ready to explore? Go ahead, give it a try!

While the setup may appear deceptively simple, the combination of ClickHouse's robust ingestion capabilities, the efficiency of GoLang workers and channels as potent data consumers, and Apache Superset's analytical prowess render this entire setup highly scalable. This means that as your data needs grow, this pipeline can effortlessly accommodate increasing volumes and complexity, ensuring that your analytics infrastructure remains agile and responsive to evolving business requirements. With scalability at its core, this data pipeline is poised to grow alongside your organization, empowering you to unlock deeper insights and drive informed decision-making at scale.
So, roll up your sleeves, experiment, and let your data pipeline pave the way for a data-driven future. Happy coding!
Comments