A while back I built a small web app to parse one of our system configuration files because the application’s interface doesn’t have a search function (yeah, it’s that bad…). It worked OK, but over time slowed down as the XML file grew to 2.5MB and ~10k lines. The slow part was definitely BeautifulSoup’s parsing step, but it took a little poking to work out why. At this stage it was taking over 20 seconds to handle the file.
The code sample’s pretty simple. Read the contents of the XML file into memory, and pass it to BS to parse. I’m running this on Python 3.4, but 3.6 didn’t seem to make any difference, nor did updating BS.
# read the config file to memory with open(filename, 'r') as fh: filecontents = fh.read() # parse the xml soup = BeautifulSoup(filecontents, 'lxml')
Turns out, if you decode the bytes to a string, everything’s much faster! Here’s the new code:
# read the config file to memory with open(filename, 'r') as fh: filecontents = fh.read() # decode the bytes format to text - this speeds up soup 10x filecontents = filecontents.decode('utf-8') # parse the xml soup = BeautifulSoup(filecontents, 'lxml')
I don’t have time or inclination to look into why this is the case, but it’s certainly a curious one.