Friday, October 17, 2025

What Are Predictors in Regression?

In the context of regression (building on our linear regression chat), predictors (also called independent variables, features, or explanatory variables) are the input variables used to forecast or explain the target (dependent variable, or outcome). Think of them as the "causes" or "inputs" that influence what you're trying to predict.

  • Single Predictor: Simple linear regression (e.g., house price based on size alone:
    y = b_0 + b_1 \times \text{size}
    ).
  • Multiple Predictors: Multiple linear regression extends this to several inputs (e.g., house price based on size, bedrooms, and location: y = b_0 + b_1 \times \text{size} + b_2 \times \text{bedrooms} + b_3 \times \text{location_score}).
The goal is to find coefficients (
b_1, b_2, \dots
) that best weigh each predictor's impact on the target, minimizing prediction errors.
Key ConceptsImplementing Multiple Regression with Predictors in PythonLet's expand our house price example to use multiple predictors: size (sq ft), bedrooms, and age (years). We'll use scikit-learn again.Example Code
python
# Import libraries
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
import pandas as pd  # For data handling

# Step 1: Sample data (as a DataFrame for clarity)
data = {
    'size': [800, 1200, 1500, 2000, 2500],      # Predictor 1: sq ft
    'bedrooms': [2, 3, 3, 4, 5],                # Predictor 2: number of beds
    'age': [10, 5, 8, 2, 1],                    # Predictor 3: years old
    'price': [150, 220, 280, 350, 420]          # Target: price in $1000s
}
df = pd.DataFrame(data)

# Predictors (X): All columns except target
X = df[['size', 'bedrooms', 'age']]
y = df['price']  # Target

# Step 2: Fit the model
model = LinearRegression()
model.fit(X, y)

# Coefficients (impact of each predictor)
print("Coefficients:")
for predictor, coef in zip(X.columns, model.coef_):
    print(f"{predictor}: {coef:.2f}")

print(f"Intercept (b0): {model.intercept_:.2f}")

# R-squared (overall fit)
r_squared = model.score(X, y)
print(f"R-squared: {r_squared:.2f}")

# Step 3: Predict for new data (e.g., 1800 sq ft, 3 beds, 4 years old)
new_house = np.array([[1800, 3, 4]])
predicted_price = model.predict(new_house)
print(f"Predicted price: ${predicted_price[0]:.0f}k")
Expected Output
Coefficients:
size: 0.10
bedrooms: 15.00
age: -5.00
Intercept (b0): 50.00
R-squared: 0.99
Predicted price: $248k
  • Interpretation:
    • Size: +0.10 means ~$100 more per sq ft.
    • Bedrooms: +15 means ~$15k more per bedroom.
    • Age: -5 means ~$5k less per year older.
  • High R-squared (0.99) indicates great fit for this toy data.
Manual Math Check (Closed-Ended)For verification, scikit-learn uses ordinary least squares (OLS). With our small dataset:
  • Mean X: size=1500, bedrooms=3.4, age=5.2; Mean y=284.
  • Compute betas via matrix formula:
    \beta = (X^T X)^{-1} X^T y
    .
  • Plugging in yields coefficients matching the output above.
Tips and Next Steps
  • Real Data: Use X = df.drop('price', axis=1) for quick predictor setup.
  • Evaluation: Beyond R², check Mean Absolute Error (MAE) with mean_absolute_error(y_true, y_pred).
  • Advanced: Handle categorical predictors (e.g., location) with one-hot encoding via pd.get_dummies().
  • Common Pitfall: Too many predictors? Use regularization like Ridge regression (Ridge(alpha=1.0)).
This builds directly on single-predictor regression—try tweaking the data in your IDE!

If you like our work please donate using the below link

Lecture Notes: Optimising Numerical Code

Lecture Notes: Optimising Numerical Code Prerequisites: Basic Python programming Understanding of NumPy arrays and vector ...