ML Principles

Back to Home
Last updated : 10 Oct 2025

ML Basics

Bayesian Decision Theory

Bayesial Classifier Decision

Discriminant Fn g_i(x)gi(x)g_i(x):

Equivalent Discriminants for Zero-One loss :

Decision Region for Binary classifier : The boundary b/w regions where discriminant fn is equal for the 2 classes

Gaussian (Normal) Distribution

Plays major role in many bayesian classifier. Types are :

  1. Univariate Normal Distribution models a single continuous variable (eg. height, temp)

  2. Multivariate Normal Distribution models multiple features together.

Key variables : mean \muμ\mu, variance \sigma^2σ2\sigma^2, covariance matrix \SigmaΣ\Sigma.

Maximum Likelihood Estimation (MLE)

Maximum Posterior Estimation (MAP)

Decision Trees

Decision Tree Image

Info Gain

Measures how much feature tells about class label:

I(X,Y) = H(X) - H(X|Y)I(X,Y)=H(X)H(XY)I(X,Y) = H(X) - H(X|Y) = H(Y) - H(Y|X)H(Y)H(YX)H(Y) - H(Y|X)

where,

Entropy for x: H(X) = -\sum\limits_{i=1}\limits^n P(x=i) \log_2P(X=i)H(X)=i=1nP(x=i)log2P(X=i)H(X) = -\sum\limits_{i=1}\limits^n P(x=i) \log_2P(X=i)
H(X|Y=v)H(XY=v)H(X|Y=v) = -\sum\limits_{i=1}\limits^n P(x=i|Y=v) \log_2P(X=i|Y=v) H(X|Y) = \sum\limits_{v\in Y} P(Y=v) \cdot H(X|Y=v)i=1nP(x=iY=v)log2P(X=iY=v)H(XY)=vYP(Y=v)H(XY=v)-\sum\limits_{i=1}\limits^n P(x=i|Y=v) \log_2P(X=i|Y=v) H(X|Y) = \sum\limits_{v\in Y} P(Y=v) \cdot H(X|Y=v)

Note : To prevent Overfitting -

  1. Limit tree depth

  2. set min no of samples req for splitting node.

  3. use cross validation.

Steps to create Decision Tree :

  1. Calc entropy for entire Dataset
    H(S) = -\sum\limits_{i=1}\limits^n P(S=y_i) \log_2P(S=y_i)H(S)=i=1nP(S=yi)log2P(S=yi)H(S) = -\sum\limits_{i=1}\limits^n P(S=y_i) \log_2P(S=y_i)

  2. Calc expected entropy after split for feature X_iXiX_i
    \sum\limits_k \frac{|S_k|}{|S|}\ H(S_k)kSkS H(Sk)\sum\limits_k \frac{|S_k|}{|S|}\ H(S_k)

  3. Calc Info Gain for feature X_iXiX_i:
    I(S,X_i) = H(S) - \sum\limits_k \frac{|S_k|}{|S|}\ H(S_k)I(S,Xi)=H(S)kSkS H(Sk)I(S,X_i) = H(S) - \sum\limits_k \frac{|S_k|}{|S|}\ H(S_k)

  4. Split in descending order of Info Gain

Linear Regression

Gradient Descent for Linear Regression:

Closed Form Solution:

Closed Form Solution Gradient Descent
Non-iterative Requires multiple iteration
No need of learning rate Need to choose right learning rate
Can be expensive for large n Works well for large n

NOTE => Standardize the features, ensure that features have similar scales. Transform them to have mean 0 and variance 1:
\qquad x_j^{(i)} \leftarrow \frac{x_j^{(i)}-\mu_j}{s_j}, \qquad j=1,2,.... dxj(i)xj(i)μjsj,j=1,2,....d\qquad x_j^{(i)} \leftarrow \frac{x_j^{(i)}-\mu_j}{s_j}, \qquad j=1,2,.... d, where \mu_jμj\mu_j = feature mean, s_jsjs_j = feature standared deviation (or range)

Logistic Regression

Gradient Descent for Logistic Regression

Multi-Class Classification

Logistic regression is naturally binary, we have to make them adapt for multi class classification such as :

  1. One-Versus-Rest (OvR)

    • find K-1 classifiers f1, f2, ..... f_(k-1)

      • f1 classifies 1 vs {2, 3, .... K}

      • f2 classifes 2 vs {1, 3, 4, ..... K}, so on and so forth till f_k-1

    • Pick classifier with highest probablity score.

    • Advantage : Good if classes are imbalanced

    • Disadvantage : Can be sensitive to dominant classes.

  2. One-Versus-One

    • find K(K-1)/2 classifiers f(1,2), f(1,3), ...., f(K-1, K)

      • f(1,2) classifies 1 vs 2

      • f(1,3) classifies 1 vs 3, so on and so forth

    • Each class is assigned binary code, chose class whose code is close to predicted bits

    • Advantage : Robust to errors and allows flexibility (powerfull for ensemble methods)

    • Disadvantage : Computationally expensive

  3. Multinomial Logistic Regression (Softmax):

    • For C classes {1, 2, ... C}:

    • Softmax function : $p(y=c|x: \theta_1.....\theta_C) $ = $\frac{exp(\theta_c^Tx)}{\sum_{j=1}^C exp(\theta_j^Tx)} $

    • Better because it offers joint probability modeling, better scalability, and cleaner decision boundaries.

    • instead of modelling the log odds for a single class, we now compute the score for each class using a linear fn, exponentiate the scores and normalize using a softmax fn, this yields interpretable probablities for each class and let us optimize across entropy loss fn across all classes simultaneously.

Evaluate Performance

Classification Regression
Accuracy : Proportion of correct predictions out of all. (TP+TN)/(total)
Use when classes are balanced
Mean Square Error (penalize large error)
Precision : Proportion of positive predictions that were correct.
True positives / Predicted Positives = TP/(TP+FP)
Use when FP are costly (eg. spam detection)
MAE(all deviation treated equally)
Recall : Proportion of actual positives that wer correctly predicted. TP/(TP+FN)
Use when missing positives is risky (eg. cancer screening)
R-Squared Score (measures how much of variablility is explained by the model)
F1 score: Harmonic mean of precision & recall, balancing both. (2*Precision*Recall) / (Precision+Recall)
Use when we want middleground b/w avoiding FP and FN

ROC Curve is a plot of the TP rate against the FP rate. It shows the tradeoff between sensitivity and specificity.

Confusion Matrix

Actual \ Predicted Yes No
Yes TP FN
No FP TN

Evaluate Comparision between 2 models

K-fold cross-validation

Hypothesis Testing

Support Vector Machines (SVM)

Kernels in SVM

K-Nearest neighbour

Principal Component Analysis (PCA)

Ensemble Learning

Adaboost (Boosting)

Data PreProcessing

Clustering

Params of clustering :

Clustering Techniques

4af3fed87aae0a140580faa37ecd335b.png

  1. Hierarchical : Bulids tree of clusters (dendrogram)
    • Agglomerative (Bottom-Up) : start with each data pt and merge closes pairs to form clusters
    • Divisive (Top-Down) : start with one cluster and recursively split
  2. Partitional : Divides data into set no of clusters clustering
    • K-Means : assigns pts to clusters while minimizeing within cluster variance
  3. Desity-Based : Finds clusters of high density regions, eg DBSCAN
  4. Model-BAsed : Assumes data follows a mix of underlying distributions (eg. Gaussian Mixture Models).

K-Means Clustering

K-Means ++

K-Medoids

K-Medoids K-Means
centers are selectd from samples centers are calculated (may not be data pts)
similarity metric : L1 Norm similarity metric : L2 Norm
More robust to noise & outliers

Hierarchial Clustering

Probablistic Clustering

Gaussian Mixture Model (GMM)

Expectation Maximization

e8086b0e28ab8f289eb2ea7719947286.png

Density-Based Spatial Clustering of Applications with Noise (DBSCAN)