Parallel Processing in Pandas

Vivek Pandit
2 min readOct 1, 2019

--

All cores utilized

More often than not, while working on Machine Learning projects, you would be dealing with huge datasets. Issue with bigger datasets is that they not only need bigger memory to fit them in memory while processing them, but also need better compute.

While you may have a n-core machine with you but how do you make sure that its not only 1 core that is doing all the heavy lifting and rest n-1 are not sitting on a couch and watching work being done.

Python is very programmer friendly language, has a wonderful community support that makes Python as first choice for programming for many. Parallel processing in python is pretty much a cake walk.

from multiprocessing import Pool

def f(x):
return x*x

if __name__ == '__main__':
p = Pool(5)
print(p.map(f, [1, 2, 3]))

But do you know there are libraries out there to specially handle pandas operations? All you go to do is change you code from this:

def fmatch(row): # custom function
do something
return choice # return a scalar value
ledger['opportunity'] = ledger.apply(fmatch,axis=1)
#ledger is a pandas dataframe

To this:

from pandarallel import pandarallelpandarallel.initialize()def fmatch(row):
do something

return choice
ledger['opportunity'] = ledger.parallel_apply(fmatch,axis=1)#ledger is a pandas dataframe

pandarallel library creates multiple processes that will parallelize your computation. In my case I used pandarallel to run program that otherwise ran for more than 24 hours, was still running, before I killed it manually. This program executed in less than 30 mins. I was able to use all my cores efficiently.

--

--

Vivek Pandit
Vivek Pandit

Responses (2)