Skip to content

Learning

Artificial Intelligence as a discipline has the purpose to create systems, be it software or other forms, to imitate intelligent behavior as we perceive and define it in the real world. Most classical Artificial Intelligence approach the problem in what we call bottom-up approach using standard computer science algorithms. The approach of classical AI is limited from the fact that such algorithms need well formed domain of input and operation, and result in being unable to handle uncertainty that is ubiquitous in the real world.

Statistical Machine Learning approaches the problem of understanding the world much like biological organisms seem to have operated for thousands of years. It creates a model based on the input data. The maths are simple, the more the data, the better the model. Thus an machine learning algorithm takes through what is called the training phase the data and produces a model that is appropriate. The better the learning algorithm the better the model. A lot of research goes into finding appropriate learning algorithms for different data domains. Eventually the produced model is going to be used to produce from the input data the desired output.

Typically, a model is an association of input to some output. Take for example trying to figure out if it is raining outside by taking a measurement of humidity. In a mathematical approach, humidity would be considered some x variable and the rain probability would be y. The learning algorithm is provided with a dataset, a set of pairs of x the humidity and the rain probability y, for as many measurements as we can possibly have. It learns (produces) a model that when provided with some value x gives us the probability to rain.

This very simple example is the absolute basis, and can extend to set of pairs with many input variables and many output variables. This kind of data is referred to as high-dimensional, but in reality has nothing to do with dimensions apart from taking loan of the term from mathematical physics where also the input data represents physical dimensions. In fact multi-dimensional datasets are just many variables together. So in our example above it would just be "temperature, humidity, sunshine, wind, ... " each would be referred to as a different dimension.

Model

In the simplest case, a model is a mathematical function with some coefficients. Imagine a large equalizer with many knobs to tune for a certain type of music and room. Just like the sound technician has set the values for the knobs, the learning algorithm sets the values for the coefficient. The process is very similar, as the sound technician does lots of tries to gather data , e.g. of echo, and plays with the different settings to achieve an optimal specific setup. Therefore, when learning has completed, the parameters of the function ( or our equalizer ) is the learned model or just the model.

The learning algorithm itself has parameters, there are referenced as meta-parameters, and may affect how successful the learning algorithm is in producing a good model.

This concept of how good a model is for the provided data is measured with what we call loss. There are many losses and they are pretty much specific to the domain at hand. However, the general idea is simply that we measure how close the output of the model is with the expected output given a known input-output pair from the dataset. Evidently, one would say the smaller the loss the better the learned model. However, this is not entirely true, and a great effort in the research community has been made to alleviate this.

Data commonly represents an instance of the problem that we are going to solve. Usually, it is somehow gathered after having been generated from some phenomena, process, or an abstract problem. Even though it seems easy at a first, creating a dataset that captures all the patterns is not a direct process. Another difficulty is that even a fully pattern representative dataset does not restrict the learning algorithm from producing an erroneous model that does not capture correctly the process. This is generally called overfitting. A model that is not overfit and minimizes loss is considered to have the optimal generalization. The latter is how well does it achieve to produce the correct output for previously unknown input.

Learning algorithm

The learning algorithm is at the heart of all statistical machine learning. The purpose is to learn from the dataset the parameters of a model. In deep learning an neural networks in general, these are called weights. As we are going to see weights is a term from Neuroscience that is used to determine the strength of a synapse between neurons.

The learning algorithm by itself has a set of parameters that are used to tune the learning to the specific dataset in order to achieve the best possible generalization. These are called meta-parameters and the process to uncover them is called meta-learning, but the actual process is different from the one user by the learning algorithm to determine the weights.

Transformations

The weights, or more formally the parameters, along with a formula or an algorithm to combine them with the input are the model. Statistical machine learning tries to find a linear combination of the weights and input variable transformations in order to produce the output.

Transformation is a linear or non-linear function. The transformation is linear when multiplying the input with a coefficient (another name for weights) and adding to it a number. On the other hand it is non-linear if apart from the previous operations based on the weights, the input is transformed non-linearly, e.g. by squaring or taking its logarithm. These non-linear transformations have no magic apart from restrict the output form of the input, instead of having it unbounded.

Take for example a half circle drawn on top of an axis. A model can be learned where the input is the position on the axis and the output the height of the half circle. Using a large number of Gaussian bells on next to the other, the learning algorithm can determine the weights for its own such that correct height is returned.

Modelling Categories

Machine learning can be separated into two categories of algorithms based on their intended application problem:

Regression is defined on continuous input-response types of problems and that the output is piece-wise continuous ( not necessarily smooth though ). For example, simply predicting the final resting position of a ball falling of a table is such a problem.

Classification in the contrary is all about finding where things are better. Imaging a bag of balls of different colors and being given a new ball to put in the pile that color is most similar. This "most similar" is at the heart of every classification problem. An item belongs to the class that it is most similar with every other item.

Learning methodologies

The most important methodologies in machine learning supervised and unsupervised learning. There is also reinforcement but this is in fact a special case of supervised. We will love that for later.

Both methodologies can be utilized in either regression or classification problems to find the best possible solution. The supervised method needs a dataset that along with the examples of the input there is also the correct output. These are usually found in the literature as X and Y part of the dataset. The supervised learning algorithm then works by associating the input with the output by determining the weights or more generally the model parameters. In the case of classification the input can be anything and the output is the class memberships of the input. Whereas in the case of regression the provided output in dataset is the correct output for the provided input.

In the unsupervised case only the input X is provided and the learning algorithm usually find a model that finds groups in the dataset. The members of these groups are such that they are most similar to one another. This mostly relates to classification problems. However, regression can also be achieved in the case of density estimation. In that case the input is mapped to its probability of occurrence.