Friday, October 17, 2025

What is Regression?

Regression is a fundamental concept in statistics and machine learning used to model the relationship between a dependent variable (what you want to predict) and one or more independent variables (predictors). It's like drawing a "best-fit" line through data points to make predictions or understand trends.

The most common type is linear regression, which assumes a straight-line relationship. Other types include:In programming, "regression" can also refer to regression testing (re-testing code to ensure new changes don't break old functionality), but given our previous chat on Python basics, I'll focus on statistical regression here. If you mean something else, let me know!Linear Regression: The BasicsLinear regression finds the equation
y = mx + b
, where:
  • ( y ): Predicted value (dependent variable).
  • ( x ): Input feature (independent variable).
  • ( m ): Slope (how much ( y ) changes per unit of ( x )).
  • ( b ): Y-intercept (value of ( y ) when
    x = 0
    ).
The "best fit" minimizes the sum of squared errors (residuals) between actual and predicted values—calculated via the least squares method.Step-by-Step: How to Perform Linear Regression
  1. Gather Data: Collect pairs of ( x ) (features) and ( y ) (targets).
  2. Fit the Model: Use an algorithm to compute ( m ) and ( b ).
  3. Evaluate: Check metrics like R-squared (how well the line fits data, 0–1 scale; closer to 1 is better).
  4. Predict: Use the equation for new ( x ) values.
Implementing Linear Regression in PythonPython makes this easy with libraries like NumPy (for math) and scikit-learn (for ML models). Here's a simple example using synthetic data: predicting house prices (( y )) based on size (( x )).Example Code
python
# Import libraries
import numpy as np
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt  # For visualization

# Step 1: Sample data (house size in sq ft, price in $1000s)
X = np.array([[800], [1200], [1500], [2000], [2500]])  # Features (2D array required)
y = np.array([150, 220, 280, 350, 420])               # Targets

# Step 2: Fit the model
model = LinearRegression()
model.fit(X, y)

# Get slope (m) and intercept (b)
m = model.coef_[0]      # Slope
b = model.intercept_    # Intercept
print(f"Equation: y = {m:.2f}x + {b:.2f}")

# R-squared score
r_squared = model.score(X, y)
print(f"R-squared: {r_squared:.2f}")

# Step 3: Predict for new data (e.g., 1800 sq ft house)
new_size = np.array([[1800]])
predicted_price = model.predict(new_size)
print(f"Predicted price for 1800 sq ft: ${predicted_price[0]:.0f}k")

# Optional: Visualize
plt.scatter(X, y, color='blue', label='Actual Data')
plt.plot(X, model.predict(X), color='red', label='Regression Line')
plt.xlabel('House Size (sq ft)')
plt.ylabel('Price ($1000s)')
plt.legend()
plt.show()
Expected Output
Equation: y = 0.12x + 22.00
R-squared: 0.99
Predicted price for 1800 sq ft: $238k
(And a plot showing data points and the fitted line.)How It Works (Math Breakdown)
  • Slope Calculation:
    m = \frac{n(\sum xy) - (\sum x)(\sum y)}{n(\sum x^2) - (\sum x)^2}
    , where ( n ) is the number of data points.
  • Intercept:
    b = \frac{(\sum y) - m(\sum x)}{n}
    .
  • For our data: Plug in values to verify
    m \approx 0.12
    and
    b \approx 22
    .
This is a closed-ended example—run it in your Python environment to see the plot! For real data, load from CSV with pandas: df = pd.read_csv('data.csv'); X = df[['size']]; y = df['price'].Next Steps
  • Advanced: Try multiple regression with more features or use statsmodels for detailed stats.
  • Practice: Datasets like Boston Housing (via sklearn) are great starters.
  • Want code for logistic regression, a full tutorial, or regression testing in Python?

Lecture Notes: Optimising Numerical Code

Lecture Notes: Optimising Numerical Code Prerequisites: Basic Python programming Understanding of NumPy arrays and vector ...