Every time an AI recognizes your face, translates a sentence, or recommends a movie — it is doing mathematics. Not magic. Just four core areas of math, working together at extraordinary speed.
A chef doesn’t need to understand chemistry to cook — but a great chef who understands chemistry can do things others cannot. Similarly, understanding AI’s mathematics lets you build systems others only dream of.
A vector is simply a list of numbers: [3, 5, 2]. It describes a point in space, a pixel’s colour, or a word’s meaning.
A matrix is a grid of numbers. It is a machine that transforms vectors — rotating them, stretching them, or projecting them into new dimensions.
A transformation is the act of applying a matrix to a vector, producing a new vector.
Imagine a recipe book that converts Indian measures to American ones. Every cup becomes 240ml, every tablespoon 15ml. This conversion grid is a matrix. Apply it to any recipe (vector) and you instantly get the converted recipe (new vector). AI does exactly this — billions of times per second.
Your family’s daily fruit bowl contains 2 bananas, 4 apples, 0 mangoes, 3 oranges. Write this as a vector. What is its “dimension”?
Answer: [2, 4, 0, 3] — dimension = 4 (four types of fruit)You buy [2, 1, 3] kg of rice, dal, and sugar. Prices are [₹60, ₹100, ₹45] per kg. What is your total bill? (Hint: multiply each pair and add)
Answer: (2×60) + (1×100) + (3×45) = 120+100+135 = ₹355A recipe uses [2 cups flour, 1 cup sugar]. A scaling matrix doubles everything. What are the new quantities?
Answer: [4 cups flour, 2 cups sugar] — you just applied a matrix!ChatGPT & Large Language Models: Every word you type is converted to a vector of ~12,288 numbers. These vectors are then multiplied through 96 layers of matrices (transformers). The output vector is converted back into the most likely next word. You are talking to a machine that is doing matrix multiplication 96 times per token.
A derivative measures how fast something is changing. If you drive at 60 km/h, your speed is the derivative of your position.
A gradient is the multi-dimensional version — it points in the direction of steepest increase, like an arrow pointing uphill.
Gradient descent means walking downhill — opposite to the gradient — to find the lowest point (the minimum error).
Backpropagation is how neural networks calculate gradients for all their weights at once using the chain rule.
You are cooking soup and it is too salty. You taste it (measure the error). You add a little water (adjust in the direction that reduces error). You taste again. Adjust again. This loop — taste → measure error → adjust → repeat — is gradient descent. AI does this millions of times with numbers instead of salt.
Your morning walk covers 6 km in 1 hour, then 4 km in the next hour. Is your speed increasing or decreasing? What is the “derivative” of your distance?
Speed = 6 km/h then 4 km/h → decreasing. Derivative = change in speed = -2 km/h²AI predicts your house has 3 rooms. It actually has 5. Error = (3-5)² = 4. If the AI increases its prediction by 1 (to 4 rooms), error = (4-5)² = 1. Which direction should it adjust?
Increase the prediction! Error decreased from 4 → 1. The gradient pointed upward at 3, so move right (increase).If each step toward the correct answer overshoots and makes the AI oscillate back and forth, what should you change about the learning rate?
Decrease the learning rate — smaller steps prevent overshooting the minimum.AlphaGo (DeepMind, 2016): Defeated the world Go champion using a neural network trained with gradient descent. The network played millions of games against itself, each time calculating gradients and adjusting billions of weights. The “blindfolded hiker” eventually found a valley of strategy no human had ever explored.
Probability is a number between 0 and 1 expressing how likely something is. 0 = impossible, 1 = certain.
Bayes’ Theorem says: your belief after seeing evidence = (your prior belief × how likely the evidence is given the belief) ÷ total probability of the evidence.
Statistics is the science of learning patterns from data — means, standard deviations, correlations, distributions.
In AI: every output of a language model or classifier is a probability distribution over possible answers.
You know your toaster burns toast 20% of the time when you leave it on “high.” This morning you smell something. Given that you do smell burning, what’s the probability your toast is burnt? Bayesian reasoning lets you update your estimate — turning a vague worry into a precise number. AI medical diagnosis works exactly this way: given these symptoms, what is the probability of each disease?
In a bag of 10 biscuits, 6 are chocolate and 4 are plain. You pick one randomly. What is the probability it is chocolate?
P(chocolate) = 6/10 = 0.6 = 60%It rains 40% of days in July. When it rains, school closes 70% of the time. What is the probability of both rain AND school being closed tomorrow?
P(rain AND closed) = 0.40 × 0.70 = 0.28 = 28%You think your friend is lying 10% of the time (prior). You then notice they avoided eye contact. People avoiding eye contact when lying = 80% of the time. Update your belief: are they more likely lying now?
Yes! New evidence (eye contact) significantly raises the posterior probability of lying.Google’s Medical AI (2020): A deep learning model trained on 100,000 chest X-rays uses probabilistic inference to detect lung cancer with higher accuracy than radiologists. For each scan it outputs P(cancer) as a number — not a yes/no — allowing doctors to see its confidence level and make better decisions alongside the AI.
An objective function (or loss function) is a mathematical formula that measures “how good” a solution is. AI tries to minimize it.
Constraints are the rules your solution must obey (e.g., budget limit, weight limit, time limit).
A local minimum is a valley — but there may be a deeper valley elsewhere (global minimum).
Modern AI uses Adam, SGD, LBFGS and other algorithms to navigate these landscapes for problems with billions of variables.
Every morning Google Maps solves an optimization problem: given road speeds, traffic, traffic lights, and turns — find the route with minimum travel time. There may be thousands of possible routes. The optimization algorithm (Dijkstra’s, A*) finds the best one in milliseconds. Reinforce learning AI does the same — finding the best sequence of actions to maximize a reward.
You want to bake the most delicious cake. You have ₹200 to spend on ingredients. You rate each ingredient’s contribution to deliciousness out of 10. How would you write this as an optimization problem?
Maximize: Σ(deliciousness_score × amount_used) Subject to: Σ(price × amount) ≤ ₹200You are adjusting the sugar in a recipe. At 2 spoons it tastes ok (local minimum). At 4 spoons it tastes perfect (global minimum). Why might an AI get stuck at 2 spoons?
Because nearby both directions taste worse than 2 spoons — the AI sees a valley and stops, missing the deeper valley at 4.A delivery driver must visit 5 houses in a city. The goal is minimum total distance. What are possible constraints?
Must visit all 5 houses. Must return to start. Roads are one-way. Max 4 hours. This is the famous "Travelling Salesman Problem!"DeepMind’s AlphaFold (2020): Solved the 50-year-old protein folding problem by optimizing the 3D positions of amino acids to minimize an energy function. The search space has more configurations than atoms in the universe. AlphaFold’s optimization algorithms found the correct fold for nearly every known protein in hours — a discovery worth a Nobel Prize.