multiprocessing is cool
So, in general, I’ve avoided multithreaded processing; it’s one of those things that historically has been tricky to get right, and I don’t typically have embarrassingly parallel problems.
Today, however, I was parsing a set of 25,000 HTML files using BeatifulSoup, to pull out a small set of data (~500 bytes of JSON per HTML file). I briefly tried to simplify some of the code, but then realized that the lion’s share of the CPU time was being spent on the initial parse; there wasn’t going to be a way to clean up my code enough to make the script that much faster, no matter how good the rest of my code was.
Enter multiprocessing. With a 5 line change to my Python code, I was able to move from one core to four. Instead of:
def handle_place():
for i in glob.glob("beerplaces/*"):
# Do stuff with I
return data
I have:
from multiprocessing import Pool
def handle_place(filename):
# Do stuff with filename
if __name__ == "__main__":
p = Pool(4)
data = p.map(handle_place, glob.glob("places/*"))
Once I made the change, I went from using one CPU fully to using all four — and instead of taking 25 minutes to generate my output, the total time was under 7.
multiprocessing is cool.