Regression is a fundamental concept in statistics and machine learning used to model the relationship between a dependent variable (what you want to predict) and one or more independent variables (predictors). It's like drawing a "best-fit" line through data points to make predictions or understand trends.
The most common type is linear regression, which assumes a straight-line relationship. Other types include:, where:Expected Output(And a plot showing data points and the fitted line.)How It Works (Math Breakdown)
- Multiple linear regression: Uses multiple predictors.
- Logistic regression: For binary classification (e.g., yes/no outcomes).
- Polynomial regression: For curved relationships.
y = mx + b- ( y ): Predicted value (dependent variable).
- ( x ): Input feature (independent variable).
- ( m ): Slope (how much ( y ) changes per unit of ( x )).
- ( b ): Y-intercept (value of ( y ) when ).
x = 0
- Gather Data: Collect pairs of ( x ) (features) and ( y ) (targets).
- Fit the Model: Use an algorithm to compute ( m ) and ( b ).
- Evaluate: Check metrics like R-squared (how well the line fits data, 0–1 scale; closer to 1 is better).
- Predict: Use the equation for new ( x ) values.
python
# Import libraries
import numpy as np
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt # For visualization
# Step 1: Sample data (house size in sq ft, price in $1000s)
X = np.array([[800], [1200], [1500], [2000], [2500]]) # Features (2D array required)
y = np.array([150, 220, 280, 350, 420]) # Targets
# Step 2: Fit the model
model = LinearRegression()
model.fit(X, y)
# Get slope (m) and intercept (b)
m = model.coef_[0] # Slope
b = model.intercept_ # Intercept
print(f"Equation: y = {m:.2f}x + {b:.2f}")
# R-squared score
r_squared = model.score(X, y)
print(f"R-squared: {r_squared:.2f}")
# Step 3: Predict for new data (e.g., 1800 sq ft house)
new_size = np.array([[1800]])
predicted_price = model.predict(new_size)
print(f"Predicted price for 1800 sq ft: ${predicted_price[0]:.0f}k")
# Optional: Visualize
plt.scatter(X, y, color='blue', label='Actual Data')
plt.plot(X, model.predict(X), color='red', label='Regression Line')
plt.xlabel('House Size (sq ft)')
plt.ylabel('Price ($1000s)')
plt.legend()
plt.show()Equation: y = 0.12x + 22.00
R-squared: 0.99
Predicted price for 1800 sq ft: $238k- Slope Calculation: , where ( n ) is the number of data points.
m = \frac{n(\sum xy) - (\sum x)(\sum y)}{n(\sum x^2) - (\sum x)^2} - Intercept: .
b = \frac{(\sum y) - m(\sum x)}{n} - For our data: Plug in values to verify and
m \approx 0.12.b \approx 22
- Advanced: Try multiple regression with more features or use statsmodels for detailed stats.
- Practice: Datasets like Boston Housing (via sklearn) are great starters.
- Want code for logistic regression, a full tutorial, or regression testing in Python?