2022-02-28

Understanding TF-IDF with Python example

6 mins read Term Frequency – Inverse Document Frequency (TF-IDF) is a widely used statistical method in natural language processing and information retrieval. […]
2022-02-28

Understanding Pandas and NumPy views vs copies to handle SettingWithCopyWarning

33 mins read Table of Contents Prerequisites Example of a SettingWithCopyWarning Views and Copies in NumPy and Pandas Understanding Views and Copies in […]
2022-02-23

Useful shortcut keys in Linux terminal

11 mins read Ubuntu comes with a powerful set of keyboard shortcuts that you can utilize in order to increase your productivity through minimum effort. […]
2022-02-18

A guide on PySpark Window Functions with Partition By

11 mins read Pyspark window functions are useful when you want to examine relationships within groups of data rather than between groups of […]
2022-02-17

Setting up a multi-node Apache Spark Cluster on a local Windows machine with Virtual Box

6 mins read Prerequisite Understand how to install Ubuntu inside Windows using Oracle VM VirtualBox from this Link Apache Spark is a fast and […]
2022-02-17

Useful magic commands in Jupyter Notebook/Lab

30 mins read Jupyter Notebook/Lab is the go-to tool used by data scientists and developers worldwide to perform data analysis nowadays. It provides […]
2022-02-14

Understanding GROUP BY, GROUPING SET, ROLL UP, and CUBE in SQL

18 mins read GROUP BY A table in a database has columns of information in it. Each column in a table represents an […]
2022-02-11

A tutorial on Apache Cassandra data modeling – RowKeys, Columns, Keyspaces, Tables, and Keys

24 mins read In this post, I will discuss the basic concepts of data modeling in Apache Cassandra. It is important to understand […]
2022-02-11

Understanding Cassandra Partition Key, Composite Key, and Clustering Key

13 mins read 1. Overview Data distribution and data modeling in the Cassandra NoSQL database are different from those in a traditional relational […]
2022-02-04

Connect to Cassandra Cluster with Dbeaver Community edition

2 mins read DataStax offers the JDBC driver from Magnitude (formerly Simba) to users at no cost so you should be able to […]
2022-02-03

Feature Selection for categorical data with Python code

17 mins read Feature selection is the process of identifying and selecting a subset of input features that are most relevant to the target […]
2022-02-03

Basic feature engineering tasks for numeric and categorical data with Python code

34 mins read Machine learning pipelines Any intelligent system basically consists of an end-to-end pipeline starting from ingesting raw data and leveraging data […]
2022-01-30

Understanding Expectation-Maximization (EM) algorithm

18 mins read The EM algorithm is often used in machine learning as an algorithm for data clustering.​​ Sometimes, one of the clustering problems […]
2022-01-29

A guide to different Cross-Validation methods in Machine Learning

19 mins read In machine learning (ML), generalization usually refers to the ability of an algorithm to be effective across various inputs. It […]
2022-01-27

Understanding the Dummy Variable Trap with example

4 mins read Linear regression is a method we can use to quantify the relationship between one or more predictor variables and a response variable. […]