Parallel Processing in Pandas
More often than not, while working on Machine Learning projects, you would be dealing with huge datasets. Issue with bigger datasets is that they not only need bigger memory to fit them in memory while processing them, but also need better compute.
While you may have a n-core machine with you but how do you make sure that its not only 1 core that is doing all the heavy lifting and rest n-1 are not sitting on a couch and watching work being done.
Python is very programmer friendly language, has a wonderful community support that makes Python as first choice for programming for many. Parallel processing in python is pretty much a cake walk.
from multiprocessing import Pool
def f(x):
return x*x
if __name__ == '__main__':
p = Pool(5)
print(p.map(f, [1, 2, 3]))
But do you know there are libraries out there to specially handle pandas operations? All you go to do is change you code from this:
def fmatch(row): # custom function
do something
return choice # return a scalar valueledger['opportunity'] = ledger.apply(fmatch,axis=1)
#ledger is a pandas dataframe
To this:
from pandarallel import pandarallelpandarallel.initialize()def fmatch(row):
do something
return choiceledger['opportunity'] = ledger.parallel_apply(fmatch,axis=1)#ledger is a pandas dataframe
pandarallel library creates multiple processes that will parallelize your computation. In my case I used pandarallel to run program that otherwise ran for more than 24 hours, was still running, before I killed it manually. This program executed in less than 30 mins. I was able to use all my cores efficiently.