2023-09-12

Run spark-submit for Apache Spark (PySpark) using Docker

3 mins read Pre-Requisites docker-compose file Below is a docker-compose file to set up a Spark cluster with 1 master and 2 worker […]
2023-01-11

Machine Learning for Big Data using PySpark with real-world projects

10 mins read Introduction I have prepared a GitHub Repository that provides a set of self-study tutorials on Machine Learning for big data […]
2022-02-17

A guide on PySpark Window Functions with Partition By

11 mins read When analyzing data within groups, Pyspark window functions can be more useful than using groupBy for examining relationships. First, a […]
2019-07-17

A quick review of Apache Kafka

27 mins read Introduction Kafka is a word that gets heard a lot nowadays. A lot of leading digital companies seem to use it. […]
2018-07-08

What is Word2vec word embedding?

24 mins read I find the concept of embeddings to be one of the most fascinating ideas in machine learning. If you’ve ever […]