2023-01-11

Machine Learning for Big Data using PySpark with real-world projects

10 mins read Introduction I have prepared a GitHub Repository that provides a set of self-study tutorials on Machine Learning for big data […]
2022-11-16

Repository for implementation of statistics concepts for Data Science in Python

3 mins read The field of statistics is becoming increasingly important in the world of data science and machine learning. I have recently […]
2022-10-24

How to return pandas dataframes from Scikit-Learn transformations: New API simplifies data preprocessing

3 mins read Scikit-learn, a popular Python library for machine learning, is often one of the first tools introduced to data science beginners. […]
2022-10-15

Setting up Apache Airflow using Docker-Compose

11 mins read Although being pretty late to the party (Airflow became an Apache Top-Level Project in 2019), I still had trouble finding […]
2022-09-23

Implementing Attention Mechanism in Python

7 mins read The attention mechanism was introduced to improve the performance of the encoder-decoder model for machine translation. The idea behind the […]
2022-09-18

A guide on PySpark Window Functions with Partition By

11 mins read Pyspark window functions are useful when you want to examine relationships within groups of data rather than between groups of […]
2022-08-30

Setup collaborative MLflow with PostgreSQL as Tracking Server and MinIO as Artifact Store using docker containers

14 mins read In this post, I will show how to configure MLflow in a way that allows multiple data scientists using different […]
2022-08-14

The default Random Forest feature importance is not reliable: Understanding Permutation Feature Importance

47 mins read The scikit-learn Random Forest feature importance and R’s default Random Forest feature importance strategies are biased. To get reliable results […]
2022-08-12

A review on information theory concepts for machine learning: Entropy, Cross-Entropy, KL divergence, Information gain, and Mutual Information

58 mins read Information Theory Information theory is a field of study concerned with quantifying information for communication. It is a subfield of mathematics […]
2022-08-05

Understanding Gradient Boost Regression by numerical examples and Python Code

13 mins read Gradient boost is a machine learning algorithm that works on the ensemble technique called ‘Boosting’. Like other boosting models, Gradient […]
2022-08-02

Measure the correlation between numerical and categorical variables and the correlation between two categorical variables in Python: Chi-Square and ANOVA

27 mins read Data analysis is an essential part of any research or business endeavor, and one of the most fundamental techniques is […]
2022-08-01

Audio source separation (vocal remover) system based on Deep Learning

12 mins read Table of Contents: Introduction Are you looking for that instrumental version of your favorite song? Or are you a DJ […]
2022-08-01

A simple tutorial on Sampling Importance and Monte Carlo with Python codes

16 mins read Introduction In this post, I’m going to explain the importance sampling. Importance sampling is an approximation method instead of a […]
2022-07-30

A comprehensive tutorial on MLflow for MLOps: From experimentation to production

39 mins read After reading this post you will be able to: Understand how you and your Data Science teams can improve your […]
2022-07-28

Understanding TF-IDF with Python example

7 mins read Term Frequency – Inverse Document Frequency (TF-IDF) is a popular statistical technique utilized in natural language processing and information retrieval […]
2022-07-28

Steps to package and publish Python codes to PyPI (pip)

6 mins read You wrote a new Python package that solves a specific problem and it’s now time to share it with the […]
2022-07-21

Partial Dependence Plots with Python code

17 mins read What Are Partial Dependence Plots Some people complain machine learning models are black boxes. These people will argue we cannot see how […]
2022-07-19

Understanding Transposed Convolution with Python example

25 mins read Transposed Convolutions is a revolutionary concept for applications like image segmentation, super-resolution, etc but sometimes it becomes a little trickier […]
2022-07-19

Understanding the basics of audio data with Python code

36 mins read Overview A huge amount of audio data is being generated every day in almost every organization. Audio data yields substantial […]
2022-07-14

Setup Apache Spark on a multi-node cluster

12 mins read This article covers basic steps to install and configure Apache Spark Apache Spark 3.1.1 on a multi-node cluster which includes installing spark […]