Lecture Notes:
Optimising Numerical Code
Prerequisites:
- Basic Python programming
- Understanding of NumPy arrays and vector
operations
- Familiarity with loops and basic function
definitions
What you'll be able to do:
- Explain the benefits of vectorised computation
and identify when to use it.
- Apply NumPy vectorised patterns to replace
explicit Python loops.
- Understand Cython and Numba as tools to
accelerate numerical code.
- Profile Python code to identify bottlenecks
and optimize performance.
1. Introduction: What is Optimising
Numerical Code and Why Should You Care?
Core Definition
Optimising numerical
code refers to the practice of writing Python programs that perform numerical
computations efficiently. It involves using strategies such as vectorised operations, just-in-time
compilation, and profiling to
reduce computation time, memory usage, and unnecessary overhead. The goal is to
make code run faster while maintaining correctness.
A Simple Analogy
Think of solving a
big puzzle. You could pick up one piece at a time (Python loops), or you could
use a system that grabs many pieces at once and places them efficiently
(vectorised operations). Both solve the puzzle, but the latter is faster and
more efficient.
Limitation: This analogy
works for understanding speed but does not cover the low-level memory and
compilation tricks that Cython or Numba use.
Why This Matters to
You
Problem it solves: Python loops
are slow for large numerical datasets. As data grows, computation time increases
drastically if inefficient code is used.
What you'll gain:
- Faster computations, especially for large
datasets.
- Cleaner, more readable code by using
vectorised patterns.
- The ability to profile and optimise real-world
applications.
Real-world context: Companies
handling financial analytics, scientific simulations, and machine learning rely
on these optimisations to save hours of computation time.
2. The Foundation: Core Concepts
Explained
Concept A: Vectorised
Patterns
Definition: A vectorised
pattern replaces explicit loops with operations on entire arrays, leveraging
optimized C and Fortran code under the hood in NumPy.
Key characteristics:
- Broadcasting: Automatically aligns
arrays of different shapes for operations.
- Memory efficiency: Avoids Python-level
loops and temporary objects.
- Speed: Uses low-level implementations for
faster computation.
A concrete example:
python
import numpy as np
a = np.array([1, 2, 3, 4])
b = np.array([10, 20, 30, 40])
# Vectorised addition
c = a + b
print(c)
Output: [11
22 33 44]
Common confusion: Beginners often
write loops for element-wise operations, not realising that NumPy can handle
entire arrays at once.
Concept B:
Cython/Numba Primer
Definition: Cython and
Numba are tools to compile Python code for speed:
- Cython: Converts Python code to C for faster
execution.
- Numba: Uses just-in-time (JIT)
compilation to optimize numerical functions at runtime.
How it relates to Vectorisation: While
vectorisation works on arrays, Cython/Numba can speed up loops and functions
that cannot easily be vectorised.
Key characteristics:
- Cython: Static typing, explicit compilation,
generates .c files.
- Numba: Dynamic, minimal code
change, uses @jit decorator.
- Targeted acceleration: Works well for
math-heavy loops.
A concrete example:
python
from numba import jit
import numpy as np
@jit
def sum_squares(n):
total = 0
for i in range(n):
total += i ** 2
return total
print(sum_squares(1000000))
Common confusion: Beginners
think @jit magically
optimizes everything. It only accelerates numeric-heavy loops and functions,
not simple Python operations like printing.
How Vectorised
Patterns and JIT Work Together
Vectorisation removes
Python loops where possible, while JIT optimizes the remaining loops. Think of
vectorisation as a first-level speedup and JIT as a second-level accelerator.
Together, they produce near C-level performance in Python code.
3. Seeing It in Action: Worked Examples
Example 1: Vectorised
Sum of Squares
Scenario: You need the sum
of squares of numbers 1 to 1,000,000.
Our approach: Use a
vectorised NumPy expression instead of Python loops.
python
import numpy as np
# Step 1: Create an array
arr = np.arange(1, 1000001)
# Step 2: Compute sum of squares
vectorised
result = np.sum(arr**2)
# Result
print(result)
What's happening here: NumPy
computes arr**2 in C internally, and np.sum reduces the
array efficiently. No Python loop overhead exists.
The output/result: 333333833333500000
Key takeaway: Vectorised
operations dramatically reduce runtime for large datasets.
⚠️ Common Misconception: Some believe Python loops
are necessary for clarity. Vectorisation can be clearer and faster.
Example 2: Using
Numba to Speed Loops
Scenario: Computing the
same sum of squares with a Python loop is too slow.
Our approach: Apply @jit from Numba to
accelerate the function.
python
from numba import jit
@jit
def sum_squares_loop(n):
total = 0
for i in range(1, n+1):
total += i**2
return total
print(sum_squares_loop(1000000))
What's happening here: Numba compiles
the Python loop to machine code at runtime, so the loop executes at near C
speed.
The output/result: Same numeric
result, runtime reduced significantly.
Key takeaway: JIT is a quick
way to speed up numeric loops that cannot be vectorised.
⚠️ Common Misconception: Adding @jit to a function
that mostly calls Python built-ins does not speed up computation.
Example 3:
Broadcasting for Efficient Array Operations
Scenario: You have two
arrays of shapes (1000, 1) and (1,
500). You want all pairwise sums.
Our approach: Use
broadcasting instead of nested loops.
python
A = np.arange(1000).reshape(1000, 1)
B = np.arange(500).reshape(1, 500)
# Broadcasting addition
C = A + B
print(C.shape)
What's happening here: NumPy
automatically stretches arrays to match each other’s shapes without creating
large temporary arrays in memory.
The output/result: (1000,
500)
Key takeaway: Broadcasting
avoids explicit loops and excessive memory use.
⚠️ Common Misconception: Beginners try for loops for each
element, which is inefficient.
Example 4: Combining
Vectorisation with Numba
Scenario: You want to
apply a complex formula on a 1D array that cannot be fully vectorised.
Our approach: Vectorise where
possible, use @jit for the remaining loops.
python
from numba import jit
@jit
def complex_transform(arr):
res = np.zeros_like(arr)
for i in range(arr.size):
res[i] = np.sin(arr[i])**2 + np.log1p(arr[i])
return res
arr = np.arange(1, 1000000)
result = complex_transform(arr)
What's happening here: Numba
accelerates the loop, while NumPy functions inside the loop (np.sin, np.log1p) remain efficient.
The output/result: Array of
transformed values, computed much faster than pure Python.
Key takeaway: Combining
vectorisation with JIT provides flexibility and speed for complex computations.
⚠️ Common Misconception: Some think JIT alone
suffices. Combining techniques yields best performance.
4. Common Pitfalls: What Can Go Wrong
and How to Avoid It
The Mistake: Using Python
loops for large arrays.
Why It's a Problem: Slow execution
and high overhead.
The Right Approach: Use vectorised
NumPy operations.
Why This Works: Computations
run in compiled C code with low overhead.
The Mistake: Assuming @jit optimizes all
code.
Why It's a Problem: Functions
calling Python objects or I/O are not accelerated.
The Right Approach: Only use JIT on
numeric-heavy functions.
Why This Works: JIT optimizes
numeric loops; non-numeric operations stay in Python.
The Mistake: Misunderstanding
broadcasting shapes.
Why It's a Problem: Operations fail
or produce unexpected results.
The Right Approach: Ensure shapes
align logically for broadcasting rules.
Why This Works: NumPy aligns
arrays intelligently, avoiding memory bloat.
The Mistake: Premature
optimisation without profiling.
Why It's a Problem: May complicate
code unnecessarily.
The Right Approach: Profile first,
then optimise hotspots.
Why This Works: Focuses effort
where it has real impact.
5. Your Turn: Practice &
Self-Assessment
Practice Task
(Estimated: 20 minutes)
The Challenge: Optimise a
function that computes f(x) = x^2 + sin(x) over 1 million
elements.
Specifications:
- Write a pure Python loop version.
- Vectorise it using NumPy.
- Apply
Numba @jit to the loop version.
Hint: Measure execution time
using time or %timeit in Jupyter.
Compare results.
Extension (optional): Try a hybrid
approach: vectorised computation inside a JIT function.
Check Your Understanding:
- Explain why vectorisation is faster than
Python loops.
- When would you prefer JIT over vectorisation?
- Identify potential memory bottlenecks in
broadcasting.
- Predict the effect of mixing Python loops with
NumPy operations.
6. Consolidation: Key Takeaways &
Next Steps
The Essential Ideas
|
Concept |
What
It Means |
Why
It Matters |
|
Vectorised Patterns |
Operations on
arrays without explicit loops |
Fast, readable,
memory-efficient |
|
Broadcasting |
Automatic shape
alignment for arrays |
Avoids loops,
reduces memory overhead |
|
JIT Compilation (@jit) |
Compile Python
functions to machine code at runtime |
Speeds up
numeric-heavy loops |
|
Cython |
Converts Python to
C code |
Achieves near C
performance for complex functions |
|
Profiling |
Measure runtime and
memory usage |
Helps identify
bottlenecks |
💬 Summary Thought:
Optimising numerical
code combines smart use of libraries, compilation, and profiling. By learning
these techniques, you reduce execution time, write cleaner code, and can scale
computations to large datasets with confidence.
Next Steps
- Practice vectorised patterns: Replace loops in your
previous assignments with NumPy operations.
- Explore
Numba: Apply @jit to loops in your code
and measure speedups.
- Profile
code: Use %timeit or cProfile to find bottlenecks.
- Combine techniques: Apply vectorisation,
broadcasting, and JIT together for optimal performance.