Classification

Classification in machine learning is commonly considered to be a supervised learning approach in which the computer program learns from the data input given to it and then uses this learning to classify new observation. This data set may be bi-class, like identifying whether the person is male or female, or that the mail is spam or non-spam, or it may be multi-class suchas as identifying the nationality of a document id. Some examples of classification problems are: speech recognition, handwriting recognition, biometric identification, document classification etc.

Classification Methods

The most important methods for classification in machine learning are briefly mentioned here for reference.

Decision Trees: Decision Trees are a type of Supervised Machine Learning where the data is continuously split according to a certain parameter. The tree can be explained by two entities, namely decision nodes and leaves. The leaves are the decisions or the final outcomes. And the decision nodes are where the data is split.
Naive Bayes Classifier: Naive Bayes is a statistical classification technique based on Bayes Theorem. It is one of the simplest supervised learning algorithms. Naive Bayes classifier assumes that the effect of a particular feature in a class is independent of other features.
Random Forest: Random Forest is a popular machine learning algorithm that belongs to the supervised learning technique. It can be used for both Classification and Regression problems in ML. It is based on the concept of ensemble learning, which is a process of combining multiple algorithms to solve a particular problem.
Support Vector Machines (SVM): Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms, which is used for Classification as well as Regression problems. However, primarily, it is used for Classification problems in Machine Learning. The goal of the SVM algorithm is to create the best line or decision boundary that can segregate the dataset into classes so that we can easily put the new data point in the correct category in the future.
K-Nearest Neighbors (KNN): K-Nearest Neighbors is one of the most basic yet essential classification algorithms in Machine Learning. It belongs to the supervised learning domain and finds intense application in pattern recognition, data mining and intrusion detection.
Logistic Regression: Despite its name, logistic regression is a linear model for classification rather than regression. Logistic regression is also known in the literature as logit regression, maximum-entropy classification (MaxEnt) or the log-linear classifier.
Neural Networks: Neural Networks are inspired by modelling after the human brain, such as they can recognize patterns from data. They can perform classification tasks by either labeling or clustering the input via multiple linear and non-linear functional transformations that allow to split the data apart along multiple directions such that data records are as far apart from those in other classes and as close as possible to their own class.

Difficulties with Classification

Overfitting: This occurs when the model is too complex and captures noise along with underlying pattern in data. It performs well on training data but poorly on unseen data.
Underfitting: The opposite of overfitting, when the model is too simple to capture the underlying pattern in data, resulting in poor performance on both training and unseen data.
Poor Data: Classification algorithms require a lot of data to be able to make accurate predictions, and not having enough data, or imbalanced sized dataset classes, can lead to inaccurate models. To alleviate this problem, synthetic data generation methods can be used. These methods create artificial data that mimics the properties of the original data. There are synthetic methods to overcome this problem such as data Augmentation, creating new data based on modifications of the existing data. For example, in image data, rotations, flips, or color changes are such. SMOTE (Synthetic Minority Over-sampling Technique) is a method used to address class size imbalance in the dataset by creating synthetic examples of the minority class via data augmentation. Recently, GANs (Generative Adversarial Networks) have been used to, a type deep learning network, that can generate new data instances that resemble the original training data. Bootstrapping is another data generation method which involves generating samples from the dataset with replacement.
High Dimensionality: A dataset has two sizes, the amount to records or datapoints, and the size of those records. The latter is the number of features, which in turn is the dimension of the data. Please note here that the dimension of the data is not necessarily the dimension of the input to the learning algorithms. Depending on the type of features they may need to be transformed appropriately resulting in higher-dimension of a dataset. For example is is common to encode each a word in text as a cohort of numbers. The amount of numbers used for each word is then the dimension of the dataset. Having too many features can make the classification task complex and expensive to find a good model in terms of computation and running time.