The previous article mentioned the logistic single classification,Explain some details of how to implement. The next thing to explore is the logistic multi-class classification.
If the Logistic single classification is a single classifier, the Logistic multi-class classification is :
(1) Multi-class classifier.
(2) The weight is a matrix whose dimension is the number of {features} × {the number of categories}.
(3) The matrix of the input data is its dimension: {the number of data} × {the number of features}.
The difference is that the single (binary) classification weight is a single vector, and the multi-class weight is a matrix because of the large number of classifications.
Pay attention to Softmax here is :
By adding multi-layer classifiers to increase the depth of learning, after layer-by-layer filtering, certain situations will increase the accuracy of the estimated value, and you will not know until you actually test it.
It should be noted here that the matrix operation.
let’s compare Single Classifier and Multi-class Classifier:
Single Classifier
Suppose the data {X} has 100 entries, 3 feature fields, and 3 elements in the weight vector W, the program must be careful whether the dimensions are correct…
Multi-class Classifier
Suppose the data {X} has 100 entries, 3 feature fields, 3 categories, and the weight vector W has 3x3 elements, the program must be careful whether the dimensions are correct:
Summary
- The source of everything is to find the loss function, and use the gradient descent method to differentiate the weight variables of each feature of the loss function to train and find the most appropriate weight value of each feature. Going back and doing a lot of past data or future data:
(1) Regression
(2) Classification - The loss function of logistic regression comes from the square of the error. The direct meaning is to minimize the error of the predicted value generated by the feature weight.
- Logistic single classification (binary classifier) can be regarded as a single classifier, through the Sigmoid activation function, the predicted value is compressed to between 0 and 1, and transformed into the concept of probability. The loss function is derived from the likelihood function. Through the probability It comes from cross entropy, and then differentiates the concept of gradient descent on the loss function to obtain the feature weight to maximize the probability of classification prediction. Another meaning is that the chance of error becomes smaller.
- Logistic multiple classification is to combine multiple single classifier weight vectors into a weight matrix, and use the Softmax activation function to judge the classification by probability, and the probability of each classification predicted by each data sums to 1. Let the predicted results belong to which category, which category has the highest probability value, and the probability corresponds to the actual value, and the weight matrix representing the inversion also helps to minimize the error of the future predicted results.
- The hidden layer can be calculated with the feature weight matrix, and then use the compression of standardized data values such as Sigmoid, Softmax, ReLU, etc. to make it into a value similar to the probability; the hidden layer can be multi-layered, and the final layer can generate the final prediction classification , its meaning also means that through the way of converting the hidden layer into a probability, after layer-by-layer filtering and judgment, the best classification prediction will be obtained with the concept of total probability.
Referance: