martedì 4 novembre 2025

Homework 5

Measures of Location and Dispersion, and their use

Introduction

The objective of this homework is to study the meaning and use of the mean and the variance, and how to use them in academic and more real scenarios.

This study presents types of means and explains how to compute the mean and variance using both classical (so, batch type) and online approaches. Online methods are particularly relevant when data arrives sequentially or storing all samples is not feasible.

Measures of Location

Arithmetic Mean

The arithmetic mean is the sum of all values divided by the number of observations. It represents the central value of additive data. The formula used to calculate it is simply 

x̄ = (x₁ + x₂ + ... + xₙ) / n

It is used in fields like general distributions, or baseline summary statistic. It's easy to compute and widely used, but it's sensitive to outliers.

Median

The median is the middle value once data is sorted, if n (where n is the number of elements) is odd, then it's the middle value, otherwise it's the average of the two middle numbers. It's used in fields like robust locationmeasure for skewed distributions or data with outliers, so it's not affected by them, but it's less efficient than mean for symmetric, light-tailed data. The two formulas are:

Mode

It represents the most frequently occurring value, it's used in categorical data, for identifying peaks in distributions. It has the advantage to being applicable in discrete and categorical data, but on the other hand it may not be unique, so it can be unstable for small samples.

The mode is particularly meaningful for discrete data; in continuous distributions it corresponds to the peak of the density function.

Measures of Dispersion

Variance

It measures the average squared deviation from the mean, so it shows how spread out a set of numbers is from the average, indicating the degree of variability, so an high variance means that the data points are far from the mean and from each other, while a low variance means that data points are close to the mean (and possibly to each other). The formula to calculate the variance is this one:


Standard Deviation

Like variance, also the the standard deviation measured how much the data values are spread out, but it does this by using the actual data used (on the other hand, the variance, being the average of the squared differences and being measured in squared units it's practical in some scenarios, but generally less intuitive than the standard deviation).

The standard deviation is the square root of the variance, so the formula can be:

Interquantile Range (IQR)

It measures the spread of the central 50% of the data, so it is also robust ot outliers. The simple formula is IQR = Q3 - Q1, where Q3 is the third (or upper) quartile, and Q1 is the first (or lower) quartile. To find the interquartile range, first order the data from lowest to highest, then find the median to divide the data into a lower and upper half. The Q1 is the median of the lower half, and Q3 is the median of the upper half.

Range

As the nam says, it measures the range bewteen the highest value, and the lowest one. Obviously, it is very sensitive to outliers, but also useful for quick assessments. The easy formula is just a subtraction betwerrn the two values: R = max(xi) - min (xi).

Although simple, the range is highly sensitive to outliers and therefore rarely used alone

Online (Recursive) Arithmetic Mean

Objective

Given a stream of values x1, x2, ..., xn, the online update of the arithmetic mean is:

meanₙ = meanₙ₋₁ + (xₙ − meanₙ₋₁) / n

with the natural base case:

mean₁ = x₁

This formula lets us update the mean in O(1) memory and O(1) time per sample, without the need to store all the past data.

We can also give a short algebric proof by derivation:

Starting from the definition of the sample mean: meanₙ = (x₁ + x₂ + ... + xₙ) / n, let t Sₙ = x₁ + ... + xₙ be the running sum. Then: 

Sₙ = Sₙ₋₁ + xₙ
meanₙ = Sₙ / n = (Sₙ₋₁ + xₙ) / n

But Sₙ₋₁ = (n − 1) * meanₙ₋₁, hence:

meanₙ = [ (n − 1) * meanₙ₋₁ + xₙ ] / n = meanₙ₋₁ + (xₙ − meanₙ₋₁) / n

This prepares the ground for the online variance computation, where the numerical advantages become even more evident.

Why it makes sense?

The term (xₙ − meanₙ₋₁) represents how far the new value is from the current mean (like a sort of "innovation"). Dividing by n damps that innovates more and more as the sample grows (each new point has weight 1/n). So, in this way, early samples move the mean a lot, later samples only fine-tune it.

Moreover, keeping a giant sum Sₙ can overflow or lose precision for large n or large magnitudes. The recursive formula provided works directly on the mean and a small correction (xₙ − meanₙ₋₁)/n, which reduces both the risk ofoverflow anf the rounding error accumulation.

What about online Variance?


While the arithmetic mean admits a simple and intuitive online update rule, the variance requires a more sophisticated approach. A naive online computation of variance can easily lead to numerical instability due to catastrophic cancellation and accumulation of floating-point errors.

For this reason, the online variance formula is derived using a well-established method (Welford's algorithm), which will be formally introduced and implemented in Homework 6 together with the full derivation of the recurrence equations.

Nessun commento:

Posta un commento

Homework 11

Homework 11 – Simulation of a Wiener Process via Euler–Maruyama and Connection to the Counting Process Approximation 1. Introduction In Home...