Skip to content

How to Write Faster Python Code for Data Science: Essential Tips for Beginners

As data scientists, we often find ourselves writing Python code to analyze, transform, and model data. While getting the code to work is the first step, making it fast and efficient is equally crucial, especially when dealing with large datasets. Today, I want to share some essential optimization techniques that can dramatically speed up your Python scripts, even if you're just starting out. No need to be a seasoned software engineer to apply these!

"Insights are the new currency—let’s mint some faster!"

Let's unbox some common performance bottlenecks and see how we can optimize them.

1. Replace Loops with List Comprehensions

One of the most common tasks in data science is creating new lists by transforming existing ones. Many beginners instinctively reach for for loops, but Python offers a much more efficient alternative: list comprehensions. They are concise, readable, and, most importantly, faster!

Why it works: List comprehensions are implemented in C under the hood, reducing the overhead associated with Python's interpreted loops, such as variable lookups and function calls.

Consider squaring a list of numbers:

python
import time

# Before Optimization: Using a for loop
def square_numbers_loop(numbers):
    result = [] 
    for num in numbers: 
        result.append(num ** 2) 
    return result

test_numbers = list(range(1000000))

start_time = time.time()
squared_loop = square_numbers_loop(test_numbers)
loop_time = time.time() - start_time
print(f"Loop time: {loop_time:.4f} seconds")

# After Optimization: Using a list comprehension
def square_numbers_comprehension(numbers):
    return [num ** 2 for num in numbers]

start_time = time.time()
squared_comprehension = square_numbers_comprehension(test_numbers)
comprehension_time = time.time() - start_time
print(f"Comprehension time: {comprehension_time:.4f} seconds")
print(f"Improvement: {loop_time / comprehension_time:.2f}x faster")

Output for 1,000,000 numbers might look something like this:

Loop time: 0.0840 seconds
Comprehension time: 0.0736 seconds
Improvement: 1.14x faster

While the improvement might seem small for a single operation, imagine this across millions of data points or repeated transformations!

2. Choose the Right Data Structure for the Job

This tip can yield hundreds of times faster performance with a simple change. The key is understanding when to use lists, sets, or dictionaries. For membership testing (if item in collection), sets are incredibly fast.

Why it works: Sets use hash tables, allowing Python to directly jump to where an item should be, rather than searching through every element sequentially as it does with lists. It's like having an index for your data!

Let's find common elements between two lists:

python
import time

# Before Optimization: Checking membership in a list
def find_common_elements_list(list1, list2):
    common = []
    for item in list1:
        if item in list2:
            common.append(item)
    return common

large_list1 = list(range(10000))     
large_list2 = list(range(5000, 15000))

start_time = time.time()
common_list = find_common_elements_list(large_list1, large_list2)
list_time = time.time() - start_time
print(f"List approach time: {list_time:.4f} seconds")

# After Optimization: Checking membership in a set
def find_common_elements_set(list1, list2):
    set2 = set(list2)  # One-time conversion cost
    return [item for item in list1 if item in set2]

start_time = time.time()
common_set = find_common_elements_set(large_list1, large_list2)
set_time = time.time() - start_time
print(f"Set approach time: {set_time:.4f} seconds")
print(f"Improvement: {list_time / set_time:.2f}x faster")

The difference is staggering for larger datasets:

List approach time: 0.8478 seconds
Set approach time: 0.0010 seconds
Improvement: 863.53x faster

3. Use Python's Built-in Functions Whenever Possible

Python's standard library is packed with highly optimized built-in functions. Before you write your own custom loop or function for a common operation like summing or finding the maximum, check if a built-in function already exists.

Why it works: Built-in functions are implemented in C, making them significantly faster than equivalent pure Python implementations.

Calculating the sum and maximum of a list:

python
import time

# Before Optimization: Manual implementation
def calculate_sum_manual(numbers):
    total = 0
    for num in numbers:  
        total += num     
    return total

def find_max_manual(numbers):
    max_val = numbers[0] 
    for num in numbers[1:]: 
        if num > max_val:    
            max_val = num   
    return max_val

test_numbers = list(range(1000000))  

start_time = time.time()
manual_sum = calculate_sum_manual(test_numbers)
manual_max = find_max_manual(test_numbers)
manual_time = time.time() - start_time
print(f"Manual approach time: {manual_time:.4f} seconds")

# After Optimization: Using built-in functions
start_time = time.time()
builtin_sum = sum(test_numbers)    
builtin_max = max(test_numbers)    
builtin_time = time.time() - start_time
print(f"Built-in approach time: {builtin_time:.4f} seconds")
print(f"Improvement: {manual_time / builtin_time:.2f}x faster")

Output:

Manual approach time: 0.0805 seconds
Built-in approach time: 0.0413 seconds
Improvement: 1.95x faster

4. Perform Efficient String Operations with join()

String concatenation using the + operator can be incredibly slow, especially when building long strings piece by piece. This is because strings in Python are immutable, meaning each += operation creates a new string object in memory. The join() method is the optimized way to concatenate strings from an iterable.

Why it works: The join() method calculates the final string size upfront and builds it in a single, efficient operation.

Building a CSV string:

python
import time

# Before Optimization: String concatenation with +
def create_csv_plus(data):
    result = ""
    for row in data:
        for i, item in enumerate(row):
            result += str(item)
            if i < len(row) - 1:
                result += ","
        result += "\n"
    return result

test_data = [[f"item_{i}_{j}" for j in range(10)] for i in range(1000)]

start_time = time.time()
csv_plus = create_csv_plus(test_data)
plus_time = time.time() - start_time
print(f"String concatenation time: {plus_time:.4f} seconds")

# After Optimization: Using the join() method
def create_csv_join(data):
    return "\n".join(",".join(str(item) for item in row) for row in data)

start_time = time.time()
csv_join = create_csv_join(test_data)
join_time = time.time() - start_time
print(f"Join method time: {join_time:.4f} seconds")
print(f"Improvement: {plus_time / join_time:.2f}x faster")

Output for 1000 rows with 10 columns each:

String concatenation time: 0.0043 seconds
Join method time: 0.0022 seconds
Improvement: 1.94x faster

5. Use Generators for Memory-Efficient Processing

When dealing with very large datasets, you often don't need to load all the data into memory at once. Generators allow you to produce values on-demand, saving significant memory. This is especially useful in data streaming or processing files line by line.

Why it works: Generators employ lazy evaluation; they only compute and yield a value when requested. The generator object itself is tiny, only holding enough information to resume computation.

Processing a large dataset:

python
import sys
import time

# Before Optimization: Storing all processed data in a list
def process_large_dataset_list(n):
    processed_data = []  
    for i in range(n):
        processed_value = i ** 2 + i * 3 + 42
        processed_data.append(processed_value)
    return processed_data

n = 100000
list_result = process_large_dataset_list(n)
list_memory = sys.getsizeof(list_result)
print(f"List memory usage: {list_memory:,} bytes")

# After Optimization: Using a generator
def process_large_dataset_generator(n):
    for i in range(n):
        processed_value = i ** 2 + i * 3 + 42
        yield processed_value  # Yield instead of append

# Create the generator (doesn't process yet)
gen_result = process_large_dataset_generator(n)
gen_memory = sys.getsizeof(gen_result)
print(f"Generator memory usage: {gen_memory:,} bytes")
print(f"Memory improvement: {list_memory / gen_memory:.0f}x less memory")

# To use the generator, you iterate through it:
total = 0
start_time = time.time()
for value in process_large_dataset_generator(n):
    total += value
# Each value is processed on-demand and can be garbage collected
generator_processing_time = time.time() - start_time
print(f"Generator processing time: {generator_processing_time:.4f} seconds (for iteration)")

Output (note the significant memory difference):

List memory usage: 800,984 bytes
Generator memory usage: 224 bytes
Memory improvement: 3576x less memory
Generator processing time: 0.0123 seconds (for iteration)

Wrapping Up

Optimizing Python code doesn't have to be a dark art. As we've seen, small, mindful changes in how you approach common programming tasks can lead to significant improvements in both speed and memory efficiency. The key is to develop an intuition for choosing the right tool or approach for the specific problem you're solving.

Remember these core principles:

  • Embrace built-ins: Leverage Python's highly optimized built-in functions.
  • Pick the right data structure: Use sets for fast membership checks and dictionaries for quick lookups.
  • Prefer comprehensions: Opt for list, set, and dictionary comprehensions over traditional loops for creating new collections.
  • Efficient string handling: Use join() for concatenating multiple strings.
  • Go lazy with generators: For large datasets, use generators to process data piece by piece and conserve memory.

By integrating these practices into your daily coding habits, you'll not only write faster Python code but also become a more efficient and effective data scientist. Keep learning, keep coding, and let's keep unboxing those black boxes!

Python Optimization Diagram