ML Principles

Back to Home

Last updated : 10 Oct 2025

ML Basics

Bayesian Decision Theory

Bayesial Classifier Decision

Discriminant Fn $g_i(x)$:

Gaussian (Normal) Distribution

Plays major role in many bayesian classifier. Types are :

  1. Univariate Normal Distribution models a single continuous variable (eg. height, temp)

  2. Multivariate Normal Distribution models multiple features together.

Key variables : mean $\mu$, variance $\sigma^2$, covariance matrix $\Sigma$.

Maximum Likelihood Estimation (MLE)

Maximum Posterior Estimation (MAP)

Decision Trees

Decision Tree Image

Info Gain

Measures how much feature tells about class label:

$I(X,Y) = H(X) - H(X|Y) $ = $ H(Y) - H(Y|X)$

where,

Entropy for x: $H(X) = -\sum\limits_{i=1}\limits^n P(x=i) \log_2P(X=i)$
$H(X|Y=v) $ = $ -\sum\limits_{i=1}\limits^n P(x=i|Y=v) \log_2P(X=i|Y=v)$
$H(X|Y) = \sum\limits_{v\in Y} P(Y=v)* H(X|Y=v)$

Note : To prevent Overfitting -

  1. Limit tree depth

  2. set min no of samples req for splitting node.

  3. use cross validation.

Steps to create Decision Tree :

  1. Calc entropy for entire Dataset
    $H(S) = -\sum\limits_{i=1}\limits^n P(S=y_i) \log_2P(S=y_i)$

  2. Calc expected entropy after split for feature $X_i$
    $\sum\limits_k \frac{|S_k|}{|S|}\ H(S_k)$

  3. Calc Info Gain for feature $X_i$:
    $I(S,X_i) = H(S) - \sum\limits_k \frac{|S_k|}{|S|}\ H(S_k)$

  4. Split in descending order of Info Gain

Linear Regression

Gradient Descent for Linear Regression:

Closed Form Solution:

Closed Form Solution Gradient Descent
Non-iterative Requires multiple iteration
No need of learning rate Need to choose right learning rate
Can be expensive for large n Works well for large n

NOTE => Standardize the features, ensure that features have similar scales. Transform them to have mean 0 and variance 1:
$\qquad x_j^{(i)} \leftarrow \frac{x_j^{(i)}-\mu_j}{s_j}, \qquad j=1,2,.... d$, where $\mu_j$ = feature mean, $s_j$ = feature standared deviation (or range)

Logistic Regression

Gradient Descent for Logistic Regression

Multi-Class Classification

Logistic regression is naturally binary, we have to make them adapt for multi class classification such as :

  1. One-Versus-Rest (OvR)

    • find K-1 classifiers f1, f2, ..... f_(k-1)

      • f1 classifies 1 vs {2, 3, .... K}

      • f2 classifes 2 vs {1, 3, 4, ..... K}, so on and so forth till f_k-1

    • Pick classifier with highest probablity score.

    • Advantage : Good if classes are imbalanced

    • Disadvantage : Can be sensitive to dominant classes.

  2. One-Versus-One

    • find K(K-1)/2 classifiers f(1,2), f(1,3), ...., f(K-1, K)

      • f(1,2) classifies 1 vs 2

      • f(1,3) classifies 1 vs 3, so on and so forth

    • Each class is assigned binary code, chose class whose code is close to predicted bits

    • Advantage : Robust to errors and allows flexibility (powerfull for ensemble methods)

    • Disadvantage : Computationally expensive

  3. Multinomial Logistic Regression (Softmax):

    • For C classes {1, 2, ... C}:

    • Softmax function : $p(y=c|x: \theta_1.....\theta_C) $ = $\frac{exp(\theta_c^Tx)}{\sum_{j=1}^C exp(\theta_j^Tx)} $

    • Better because it offers joint probability modeling, better scalability, and cleaner decision boundaries.

    • instead of modelling the log odds for a single class, we now compute the score for each class using a linear fn, exponentiate the scores and normalize using a softmax fn, this yields interpretable probablities for each class and let us optimize across entropy loss fn across all classes simultaneously.

Evaluate Performance

Classification Regression
Accuracy : Proportion of correct predictions out of all. (TP+TN)/(total)
Use when classes are balanced
Mean Square Error (penalize large error)
Precision : Proportion of positive predictions that were correct.
True positives / Predicted Positives = TP/(TP+FP)
Use when FP are costly (eg. spam detection)
MAE(all deviation treated equally)
Recall : Proportion of actual positives that wer correctly predicted. TP/(TP+FN)
Use when missing positives is risky (eg. cancer screening)
R-Squared Score (measures how much of variablility is explained by the model)
F1 score: Harmonic mean of precision & recall, balancing both. (2*Precision*Recall) / (Precision+Recall)
Use when we want middleground b/w avoiding FP and FN

ROC Curve is a plot of the TP rate against the FP rate. It shows the tradeoff between sensitivity and specificity.

Confusion Matrix

Actual \ Predicted Yes No
Yes TP FN
No FP TN

Evaluate Comparision between 2 models

K-fold cross-validation

Hypothesis Testing

Support Vector Machines (SVM)

Kernels in SVM

K-Nearest neighbour

Principal Component Analysis (PCA)