So, in general, I’ve avoided multithreaded processing; it’s one of those things that historically has been tricky to get right, and I don’t typically have embarrassingly parallel problems.
Today, however, I was parsing a set of 25,000 HTML files using BeatifulSoup, to pull out a small set of data (~500 bytes of JSON per HTML file). I briefly tried to simplify some of the code, but then realized that the lion’s share of the CPU time was being spent on the initial parse; there wasn’t going to be a way to clean up my code enough to make the script that much faster, no matter how good the rest of my code was.
Enter multiprocessing. With a 5 line change to my Python code, I was able to move from one core to four. Instead of:
def handle_place(): for i in glob.glob("beerplaces/*"): # Do stuff with I return data
from multiprocessing import Pool def handle_place(filename): # Do stuff with filename if __name__ == "__main__": p = Pool(4) data = p.map(handle_place, glob.glob("places/*"))
Once I made the change, I went from using one CPU fully to using all four — and instead of taking 25 minutes to generate my output, the total time was under 7.
multiprocessing is cool.