multiprocessing is cool

So, in general, I’ve avoided multithreaded processing; it’s one of those things that historically has been tricky to get right, and I don’t typically have embarrassingly parallel problems.

Today, however, I was parsing a set of 25,000 HTML files using BeatifulSoup, to pull out a small set of data (~500 bytes of JSON per HTML file). I briefly tried to simplify some of the code, but then realized that the lion’s share of the CPU time was being spent on the initial parse; there wasn’t going to be a way to clean up my code enough to make the script that much faster, no matter how good the rest of my code was.

Enter multiprocessing. With a 5 line change to my Python code, I was able to move from one core to four. Instead of:

def handle_place():
     for i in glob.glob("beerplaces/*"):
           # Do stuff with I
     return data

I have:

from multiprocessing import Pool
def handle_place(filename):
    # Do stuff with filename
if __name__ == "__main__":
    p = Pool(4)
    data = p.map(handle_place, glob.glob("places/*"))

Once I made the change, I went from using one CPU fully to using all four — and instead of taking 25 minutes to generate my output, the total time was under 7.

multiprocessing is cool.

Comments are closed.