A Surprising Lesson in Python Optimization

Dr. Jessica Rudd
7 min readJan 20, 2021

I started as a Data Scientist at a startup, A42 Labs, in November and hit the ground running FAST in the world of data science consulting. I was quickly onboarded and rolled into a client project as the lead Data Scientist within a week of starting. The experience has been highly challenging and rewarding, with exponential improvements in my Python and SQL code. Perhaps the most beneficial aspect of this first client project has been the need to optimize code to fit within the parameters of our local programming requirements. This post highlights the approaches used to eliminate code loops, and prepare client data for row-wise regex string comparisons.

Working towards more elegant code; less bull in a china shop code.

The Problem

The client in question (Client A), is a large, Fortune-ranked software company. There are plugins that can be utilized with their software that are developed either internally or by 3rd party partners. These plugins represent a large revenue stream for Client A, engage thousands of partners who develop and sell these plugins, hundreds of thousands of customers who purchase and use them, and cover a wide range of functionality and industries. Internal teams at Client A would like to leverage this data to report utilization and revenue back to partners, and build machine learning models to make plugin recommendations to customers.

However, through many iterations of legacy data systems, historical data quality issues have occurred. Specifically, links between the plugin data and the partners who own/create the plugins have been lost, and no unique key exists between the data. Working with client-side SMEs, A42 Labs built a rule-based algorithm to match nearly 13,000 plugins with nearly 12,000 potential partners using regex string matching between the plugin information and the partner information where available. Due to security constraints from Client A, this process had to be conducted on a local enviroment, specifically a MacBook Pro with 16GB of RAM. This led to initial code iterations that could take more than a day to run. Since this was a highly iterative process working with SMEs to validate and update the algorithm matching rules, improving code efficiency was a priority when I started on the project to eliminate bottlenecks.

I spent many nights hanging with this guy when I started my new job.

The Solution

The rule-based algorithm was built to match plugins to their partners based on information found in the plugin and partner registration data. This repeatable, semi-automated process for matching plugins to partners relied on automatically identifying a set of candidate partner matches using regular expression string searching. Since Python Pandas is designed for vectorized manipulations it’s important to avoid the use of loops in the code logic. Pandas reads all data into memory, so looping over rows is very inefficient. The A42 Labs team decided to leverage the apply method to compare plugin and partner string information in a row-wise, vectorized fashion. An additional benefit of using this method is that, once the solution is sent to the client side production team, it can more easily be converted to a parallel process, such as using PySpark, since row-wise operations can be sharded across parallel processes. To execute this logic, we built a pairwise comparison table of all plugins matched with all possible partners.

This process alone was eating A LOT of computation time by time I joined the project. Here’s a breakdown of how optimized the creation of the pairwise comparison table.

Imports

Since this project used proprietary data from Client A, we’ll illustrate the process using some fake data. Faker is an awesome Python library for creating fake profile data! I found out about Faker from Khuyen Tran, an amazing data science blogger and programmer. Check out all her great work here.

import numpy as np
import pandas as pd
import time
from faker import Faker
fake = Faker()

Create fake data

profiles = [fake.profile() for i in range(1000)]
profiles = pd.DataFrame(profiles)

Split data

We’ll create 1000 fake profiles and then “break” the data into two dataframes to simulate the disjointed data from our project with Client A. We will also take 95% of the records from dataframe 2 to simulate the unequal sizes of our original data from Client A.

#Split dataframe into 2 dataframes
data1 = profiles.iloc[:,[0,1,2,3,4,5,6,7]]
data2 = profiles.iloc[:,[8,9,10,11,12]]
data2 = data2.sample(frac = 0.95)
data1.head()
png
data2.head()
png

Now that we’ve created our fake data sources, let’s see how large our pairwise comparison table will be:

# Number of plugins
print(data1.shape)

# Number of partners
print(data2.shape)

print(data1.shape[0] * data2.shape[0])
(1000, 8)
(950, 5)
950000

We’ll be creating a pairwise comparison table of every row from dataset 1 matched with every row of dataset 2. This is 950,000 records. For context, our problem with Client A required comparing nearly 13,000 rows of plugins with nearly 12,000 potential partners — that’s a pairwise comparison table of nearly 155 million!

Method 1: Append

When I first started at A42 Labs in November and started taking over the code base for this project, the existing code relied on nested for-loops, itterows, and append to create the pairwise comparison table, and output the table to a csv in chunks. The itterows and append methods, and the repeated IO operations lead to succesful yet computational inefficent process. Given the size of the data for Client A, this took over 18 hours to run; a process that was not sustainable for the required iterations. On our small size of fake data below this method takes over 5 minutes.

%%time
cols = ['job',
'company',
'ssn',
'residence',
'current_location',
'blood_group',
'website',
'username',
'name',
'sex',
'address',
'mail',
'birthdate']

pairwise_lst = []
rowcnt = 0

# append rows in batches of 15000
for i_out, row_out in data1.iterrows():
for i_in, row_in in data2.iterrows():
pairwise_lst.append(row_out.append(row_in))
rowcnt += 1
if rowcnt % 15000 == 0:
pairwise = pd.DataFrame(pairwise_lst,columns = cols)
pairwise.to_csv('pairwise_comparison.csv', mode='a', header=False, index=False)
pairwise_lst = []

pairwise = pd.read_csv('pairwise_comparison.csv', header=None, names = cols)
print(pairwise.shape)
pairwise.head()
(4170000, 13)
CPU times: user 5min 58s, sys: 1.84 s, total: 6min
Wall time: 6min
png

Method 2: Merge

The more I understood the project and data requirements, the more I realized that the proposed pairwise comparison table was simply a cartesian join of the two disjoint datasets. Assigning both datasets an arbitrary key value of 1, we can merge using an outer join on that key.

%%time
pairwise_merge = data1.assign(key=1).merge(data2.assign(key=1), how='outer', on='key')
print(pairwise_merge.shape)
pairwise_merge.head()
(950000, 14)
CPU times: user 102 ms, sys: 14.8 ms, total: 116 ms
Wall time: 116 ms
png

Our operation has now improved from nearly 6 minutes to 145 milliseconds, a 2500% improvement!

Method 3: Concatenation

Another method I found thanks to every programmers favorite resource, StackOverflow, was to concatenate every row from one dataset onto the other dataset. This replicates rows of the first dataframe by the length of the second dataframe, and then concatenates the data along their columns.

%%time
def pairwise_merge_concat(*args):
if len(args) == 2:
if len(args[0]) > len(args[1]):
df_1 = pd.concat([args[0]] * len(args[1]), ignore_index=True).sort_values(by=[args[0].columns[0]]).reset_index(drop=True)
df_2 = pd.concat([args[1]] * len(args[0]), ignore_index=True)
return pd.concat([df_1, df_2], 1)

pairwise_concat = pairwise_merge_concat(data1, data2)
print(pairwise_concat.shape)
pairwise_concat.head()
(950000, 13)
CPU times: user 879 ms, sys: 47 ms, total: 926 ms
Wall time: 919 ms
png

Concatenation gives us similar computation cost savings as the merge method. I ended up using this concatenation because I felt it could be more flexible with different size data frames in the future. While the original method took over 18 hours in the Client A project, this improved code only took 7.5 minutes! I was certainly excited to get my workday back and not have to worry about running this section of code overnight.

Conclusion

Even in early stages of a project, thinking critically about data structures and how data will need to be used further down the data processing and algorithm development pipeline can lead to vast improvements in computational time requirements and mitigation of pipeline bottlenecks.

Feel free to fork and play with the code for this article in this Github repo.

--

--

Dr. Jessica Rudd

Senior Data Engineer | MLOps | Endurance Athlete | Using computation to improve analytics pipelines. #ChangeTheRatio