Evaluate Clustering Algorithms

The difference in Performance Measurement between Supervised and Unsupervised Learning The performance measurement for supervised learning algorithms is simple because the evaluation can be done by comparing the prediction against the labels. However, for an unsupervised learning problem, there are no labels and therefore also no ground truth. Therefore we need other evaluation methods to … Read more

Decision Tree

During the training of the decision tree algorithm for a classification task, the dataset is split into subsets on the basis of features. After the training, the overall importance of a feature in a decision tree can be computed in the following way:1) Go through all the splits for which the feature was used and … Read more

Transform the Target Variable

The Objective Transforming the Target Variable There are three problems that can occur in a machine learning project that we can tackle by transforming the target variable:1) Improve the results of a machine learning model when the target variable is skewed.2) Reduce the impact of outliers in the target variable3) Using the mean absolute error … Read more

Feature Transformation in Machine Learning

In machine learning, feature transformation is a common technique used to improve the accuracy of models. One of the reasons for transformation is to handle skewed data, which can negatively affect the performance of many machine learning algorithms.In this article, you Programming Example for Feature Transformation For this article, I programmed an example to work … Read more

Overview of Error, Loss, and Cost Functions

Difference between Error, Loss, and Cost Function In summary, the error function measures the overall performance of the model, the loss function measures the performance of the model on a single training example, and the cost function calculates the average performance of the model over the entire training set. The error function measures the overall … Read more

SGDRegressor and SGDClassifier

SGDRegressor and SGDClassifier use stochastic gradient descent (SGD) as an optimization algorithm. That makes the model very efficient for large datasets because due to SGD, the model parameters are updated after each training sample instead of the entire dataset. Moreover, both algorithms have a lot of flexibility regarding different loss functions, methods to prevent overfitting, … Read more

How to Find and Input Missing Values in a Dataset

Datasets may have missing values, and this can cause problems for many machine learning algorithms. As such, it is good practice to identify and replace missing values for each column in your dataset prior to modeling your prediction task. Find Missing Values in a Dataset Finding missing values in a dataset is not very complicated. … Read more

Find Outlier in Datasets using Local Outlier Factor

The Local Outlier Factor (LOF) is an unsupervised algorithm to detect outliers in your dataset. LOF detects outliers based on the local deviation of the density from an sample compared to the samples neighbors. The local density is calculated by the distance between the sample to its surrounding neighbors (k-nearest neighbors). Outliers are samples that … Read more

Find Outlier in Datasets using Isolation Forest

The Isolation Forest is a unsupervised anomaly detection technique based on the decision tree algorithm. The main idea is that a sample that travels deeper into the tree is less likely to be an outlier because samples that are near to each other need many splits to separate them. On the other hand are samples … Read more