This is a series of articles titled Machine Learning explained with tissues.
Machine Learning and AI have become a mainstream topic in the latest years. Many companies have their own Data Science division and every day we hear about wonderful new tools that use this technology.
But, for many, Machine Learning is just a black box that does magic, and data scientists are some kind of mightful wizards that are able to do almost anything with data. The purpose of this series is to give a quick intuition of what is inside that black box and explain with drawings in a tissue the foundations of Machine Learning, with no maths nor code.
DISCLAIMER: This article is informative, many of the concepts explained below may not accurate -The aim of the article is to give an intuition of the capabilities of Machine Learning for non-data scientists.
What is Machine Learning?
For years I've read and heard thousands of definitions of machine learning, in my humble opinion, the most intuitive one is "Machine Learning is providing a computer the ability to learn from data". Yeah, I know, this doesn't sound accurate but points the two keys that every machine learning process needs:
- The data to extract the knowledge
- The models and procedures that we will use to extract the knowledge.
Machine Learning Problems
There are tons of problems that can be solved with machine learning, some examples could be: voice recognition, face recognition, spam detection, product recommendations and many many others. From a theoretical point of view we could classify all these problems under two labels:
Today we will focus on classification -we will talk about regression in another article.
To start, let's figure out the problem; in the first one we will have some houses in a map and two different villages, each house belongs to a single village.
The first machine learning problem is to generate a rule that allows us to classify the houses into their corresponding village. To do so we will need to use some data; in this case we will use the position of each house, so we will have a list of houses characterized by a position (x,y) -In Machine Learning each house is called a sample, and both x and y, the components of the position of the house, are called the features of the sample.
Using this information -the position of the house, and their village- we are able to create the rule to split the houses of one village from the houses of the other village. This rule is called a classifier and intuitively will look like this.
So at this point, we have a set of houses and a line that allows us to tell whether one house belongs to the first village, or to the second one.
I know what are you thinking, "I already knew the village of each house on beforehand, why would this be useful?" Well, suppose that some people build bright new houses on our map -noted by circles in the draw:
These new houses have no village assigned on the first moment, and we will have to decide if they belong to the village one or the village two. Now that we have our super amazing classifier we are able to do this decision automatically, and we can say that the two houses that are below the line belongs to the black village, and the house above the line belongs to the white village. Easy, huh?
Let's go to the top and see the big picture; at the starting point what we had was:
- A set of houses labeled with a village -we knew the village to which they belong. This is called the Training Set.
- Our aim was to create a rule capable to split our two villages. This is called the Classifier.
- And our goal was to be able to set the village for any new house that appears in our map. This is called the Test Set.
These three elements are the must-have to state a classification problem.
This house problem was really well suited, we have a bunch of houses and two villages and we were able to split both with one line. This is the ideal case for any statistician or data scientist, but, unfortunately, it is not a common case.
Data distributions can be really complex, and as the complexity increases, another question emerges: The line that I've drawn is the only possible line in the map that splits both villages? Or, in other words: how can I ensure that I'm drawing the line correctly?
Actually, you cannot give a quick answer to those questions if you play a while with this image you may find thousands of ways to draw this line and perfectly divide the space for the two villages. So, what is the line, from all candidates, that does it the best?
We cannot know a priori how to draw the best divider, but we have techniques to evaluate how good a divider is -I will discuss this topic on another article Generalization and Overfitting. By now, we've realized that the more complex is the distribution of the data, the more difficult is to find a good classifier.
Now, you could be thinking that if real distributions are complex, and with complex data we need complex models to get the job done, this problem scales really bad.
Ok, that's true... or not. This is the part of machine learning that mixes art and science. If we had an extremely complex distribution, our first option probably would be to build a really complex classifier; but, conversely, we could try to represent the data in a different way. Imagine that you could be able to transform the data from the left representation to of the image below to the right representation, wouldn't it be nice?
This technique is the basis of data processing that will occupy the next article of the series Machine Learning explained with tissues.
Today we've discussed the basis of machine learning and introduced the concept of classification and data distribution in a nutshell. In the following posts, we will dive into more advanced concepts such as data processing, overfitting, classification models and regression.
I hope you liked the article, and for sure any feedback is welcome.