The craze for machine learning has soared new heights ever since Harvard review article named ‘data scientist’ as the most lucrative job of the coming decade. Machine learning, to some, might sound like a complete sorcery, although it is only an inheritance of some trivial concepts of mathematics. So, as long as you did not skip mathematics lecture in your university, you are good to go. Before we begin discussing these algorithms, you should know on which platform should you really implement these ideas. Machine learning and AI, being some of the most logic-intensive concepts, demand a rather simpler platform – like python.
Perhaps the easiest algorithm in the domain of ML. Linear regression is helpful when the data is somewhat arranged in a straight line. A few anomalies can be accommodated, or more appropriately ignored from the set. In linear regression, a best fit line is drawn around the data set. A best fit line is one that passes through maximum number of points in the given set. See figure below.
Source – Mathematica
In figure above, the blue dots represent data points that are concentrated towards the vicinity and start to get sparser towards the end. The line shown is the best representation of a function that can predict the behaviour of given data set.
Source – dataio
Once the values of a and b has been established in the equation y = ax + b, values of x and y for new instances can be determined. This is the basis of Machine Learning – utilizing past experience to predict future events.
If you haven’t noticed yet, linear regression is used when the values are continuous (like marks, percentage, etc.). When relationship with boolean values (TRUE or FALSE) needs to be established, logistic regression is implemented.
Such type of regression can be leveraged in determining whether an event will occur or not. In predicting events there are only two possibilities –
So, if we are predicting whether a patient with certain symptoms is Covid positive, we do so by labeling sick patients as 1 in out data set.
Source – Logicworld
Using logistic function you can find out how or if a virus will spread to a particular person. In the example below, I have mentioned a scenario to help you understand the logistic function in a easy way.
There is a malware in a dedicated server that has the tendency to replicate itself. Say, there are 10,000 storage sectors in the server; the malware has only recently entered and has not replicated yet.
Let us assume the malware can only spread to one sector every day. On the very first day, say sector 1 was infected.
From sector 1 the malware infected sector 2 (randomly chosen number)
The iteration continues until all the 10,000 sectors have been infected with the malware.
Most analytical problems involve decision making and hence KNN models are some of the most widely used algorithms in the field of Machine learning. KNN is used almost everywhere. You will not come across an ML code that does not make use of KNN algorithms—KNN is really so versatile.
Source – towardsdatascience
Refer the diagram shown above. The assumptions that KNN algorithm makes is that similar things exist in close proximity. For a moment, assume the dots represent birds. Similar dots mean birds of the same species. Dots with similar shapes but differently colour coded should be assumed as different.
Now, a flock of bird will comprise all birds of the same species. This is what KNN algorithm intends to exploit. Similar data values tend to be in the vicinity of one another in the K-Map.
Say, a function inputs three values, age, height(cms) and number of children. The table represents the range of values age, height and number of children can take.
|Height (cms)||30-200 cms|
|Number of Children||0-7|
The three fields may overlap with one another in certain cases but values from one field will always be in close proximity of values that are alike.
I have demarcated boundaries between the various data set. If I assume an arbitrary point anywhere, then based on where the point lies I can accurately determine the data set to which the point belong.
KNN is easier to implement than it seems. Steps include –
1. load the data
2. assign K to the neighbours
3. for each set in the data, calculate –
a. distance between the given set and the example set
b. put distance and index into an ordered set
4. sort the set on the basis of distances
5. pick the first k entry from the sorted list
6. fetch the labels of the K entries selected
7. return mean if it’s regression and mode if its classification.
If the above three algorithms pushed your brains to its max, there is yet another bad news. Bagging and random forests algorithm could have significant learning curves. In short, this is where you would need to do much brainstorming. Luckily, we will only scratch the surface and leave the rest to you to figure out on your own.
Before we begin dismantling this strange-sounding algorithm, let us quickly jab in the vibe of ensembles. Ensembling, in layman terms, means combining results from multiple learners for improved outcomes. This increases the accuracy of our ML algorithms, making them more apt for real-life implementation.
The ideology behind ensembling is—ensemble of learners working in unison perform better than single learners. Think of it like multiple brains at work. Bagging is one of the three ways in which ensembling is done – the remaining two being Boosting and Stacking.
It is one of the most robust machine learning algorithms as it is able to perform both progression as well as classification. The name “forest” springs from the algorithm’s capability to create decision trees. Obviously, more the decision trees, more robust the predictions.
To classify a new object based on classification, each tree gives its classification. Now, decisions from the trees may not be the same, it then becomes the job of the forest to choose a classification which gets the maximum number of votes in the entire forest.