In unsupervised learning algorithms, there are no label column, and we only have feature columns.

In the previous article, I mentioned that in supervised learning, we have a column called the label column. This column in “Supervised Machine Learning Algorithm” is actually our target column. The machine learns with the values in this column and predicts the label column values for the new rows we give to the machine.

In Unsupervised Machine Learning, the goal is to identify the input data pattern, so there is no column called label.

We will not monitor the machine to learn, and the machine itself will group the same input data.

To better understand how an “unsupervised learning algorithm” works, consider the following:

This example is about types of chairs.

The algorithm divides the inputs in the output into three categories. We haven’t even told the machine what the inputs are. The algorithm has done the division with the features it has observed.

Because we don’t teach machines, the machine may have errors in diagnosis and categories, and this is one of the disadvantages of “unsupervised learning algorithms”, so our model may have more errors than “supervised learning algorithms”.

And in this example, we can see that our model has an error in the third category and put a Ladder-Back Chair in the Park Bench category because it resembles park benches!

Recommended systems are an example of the widespread use of these types of algorithms. These systems track and store the pages you visit on a site, such as Amazon, and then offer you products what you have seen on that site, most likely you like those products because they match what you are looking for.

Analyzing the behavior of social network users is another attractive user of these algorithms, especially for advertising companies. They can analyze what each person posts on their social network, find out what they like, and send them targeted ads.

Also, in the days before the presidential election, this method can be used to predict the outcome of the election.

**Clustering** and **dimension reduction** are two important categories of supervised learning algorithms.

we want to look at one of the clustering algorithms:

**K-means Clustering Algorithm**

One of the best and simplest unsupervised learning algorithms is the k-means. In this algorithm, we must first know and tell the machine, “How many clusters do we want to divide the data into?”

Suppose our data set is about the types of chairs, we already know how many types of chairs are in our data, but we do not know what each type is!

If we want to continue the first example and solve it with the k-means algorithm, we know that the number of our clusters is 3.

Our other assumption is that we have only two features of the chairs.

Now that we have the data and assumptions, we can review the algorithm together:

- Select the point randomly as the centers of the clusters.
- Repeat the following steps:
- Assign each data to a cluster with the nearest center.
- Update the center of each cluster by averaging the data assigned to that cluster.

- Stop: When no data changes its cluster in one iteration.

Another condition can be set for stopping. If we know that the algorithm may not work properly under certain conditions, it is better to limit the iteration to avoid getting trapped in the infinite loop. For example, we can set the maximum iteration to 20.

The gray points indicate the path of the centers.

As you can see, in each iteration, the centers that are randomly located on the page move closer to the center of the cluster.

After clustering converges, the output of this algorithm will be two:

1- Center of clusters

2- Assigning all data to clusters: It is determined to Each data belongs to which cluster.

**The k-means algorithm**

Inputs:

- Number of clusters: K
- Training set:

Randomly initialize K cluster centroids

Repeat

{

For i=1 to m:

For k=1 to K:

}

tips:

- In the training set, no “y” is specified for each “x” and we only have input and no output or label is specified. This is exactly the difference between supervised learning and unsupervised learning. In supervised learning, we specify that x
^{1}belongs to class y^{1}, x^{2}belongs to class y^{2}, and … - In supervised learning, we have x
^{0}in the training set, which was a bias factor. - Our data each are a vector in a n-dimensional space and the centers are in the same n-dimensional space. “∈ℝ
^{n}” means n-dimensional vector.

**The objective function**

In this method, we need a score function to evaluate the quality of clustering. So we need an objective function to be able to measure the quality of the clustering:

symbols:

μ_{k }: Cluster center of k

c^{i }: Cluster number assigned to x^{(i)} data

μ_{c}^{(i)}: Cluster center assigned to x^{(i)} data

The above function is also called the **cost function**.

**A better way to initialize cluster centers**

Random selection of the center of a cluster can cause a problem: that point may be in a place to which no data belongs!

There is a simple solution to fix this problem. Instead of selecting the center randomly, the data will be considered as centers (Randomly). For example, I want three centers, so I have to select three input data as cluster centers.

Note that the number of clusters is the maximum number of input data. If we define a cluster for each data, K will be equal to m. (K<=m)

**Local optimization and global optimization**

Another problem that may occur in using the k-means algorithm is that the algorithm stops at the local optimization. Consider the following figure:

As you can see, we asked for the three-cluster, and the algorithm did the same for us. But this answer is not optimal, and the value of the objective function is not minimized in this case! The algorithm is stopped at the local optimal stage.

Now look at this figure:

The order of the data on the page is the same as in the previous figure, but the centers of the clusters are located correctly this time, and our objective function will be minimal. This condition is called global optimization.

A simple way to solve this problem is to do the algorithm several times from the beginning instead of running it once (including initializing the cluster centers).

And each time we calculate the objective function and, in the end, the minimum value of the objective function will be the correct answer.

Another problem with this algorithm is that it is not suitable for very large data. Because it does a lot of calculations every time it repeats.

Wait for the continuation of this article and we will review this algorithm further.