It’s amazing how quickly a seemingly simple project can turn into a really intense learning experience. My week started (as they normally do) with a minor annoyance: Starwood Preferred Guest (a hotel rewards program) posts all points activity on separate pages, sixteen transactions apiece. After pausing to consider how much of a #firstworldproblem that was, I decided it might be a good opportunity to learn web scraping with Python.
Challenge 1: Authentication for spg.com
My feeble background in HTTP Headers from the Python Challenge had not prepared me for this. I realize from the ‘https:’ in my Chrome window that there is some kind of authentication used to keep people out of my account logged into the correct user accounts. But…how do I do this?
I used the Chrome element inspector to see if I could pick up any hints from the Username and Password box on the spg website, but no dice. After some failed (mostly blind) attempts with urllib, urllib2, httplib, httplib2, and bolacha, I figured it was time to start approaching this like an engineer. What do engineers do best? BLOW THINGS UP. Huh. I can’t blow up my laptop – not a good idea. What else do engineers do? They use TOOLS. I figured if Chrome doesn’t have any trouble passing my username and password to spg.com, maybe I can just eavesdrop on the communication and learn to emulate it.
Enter Fiddler: a tool created to capture browser communications. The GUI is rather intuitive, and in a few minutes I was able to run some targeted tests on Chrome (in incognito mode to remove any extraneous variables). I ended up with a lot more information than ever before, but I was still stumped. There wasn’t anything really obvious in the header information or the response except…
successPath=%2Findex.html&login=heythatsmylogin&password=ooothatsmypassword!
Brilliant! Now…how to pass this? I was still a little sore on urllib and httplib, so I sought out yet another python library to help. By sheer luck Google, I stumbled upon Requests: HTTP for Humans. Not only was that totally reassuring, but the documentation was simple and easy to use. In no time, I had all the makings of a new authentication scheme that worked. Huzzah! Now what?
Challenge 2: Scraping the sites for data
Iterating through the pages was fairly simple (start with 0, increment the URL value by 16 each time), but the real challenge was making sense of the source. To begin with, the source for each page was almost 5000 lines long. Most of this was style information and scripts that I really didn’t care about. The real meat of the data was in a table that looked something like this:
<tr>
<td class="even first">Activity</td>
<td class="even first">Points</td>
<td class="even first">Post Date</td>
<td class="even first">Details</td>
Within the Details section there are only two things I care about:
<a class="propertyName" href="addressblahblahblah">Property Name</a>
<div class="stayDates">Type</div>
</tr>
The data I wanted was alternately assigned to the ‘even’ and ‘odd’ class, which was slightly annoying, but understandable. More annoying were the td sections immediately following each of them with class=’odd inilineBookingRow’ that contained all kinds of garbage. As an initial step, I decided to pull all the items with either the even or odd class attribute – I could clean it up later.
Challenge 3: Cleaning the Data
After a few hours of poking around for a good scraping tool, I settled on BeautifulSoup. It seemed to have the best reputation around for parsing and navigating tree structures from the web, so I gave it a shot. Unfortunately, HTML parsers for BeautifulSoup break each attribute into its own section, so there was no way to define (either in SoupStrainer or soup.find_all) that I wanted the td values with class=’even’ but not those with class=’even inilineBookingRow’. The only silver lining on this was that the elements in the inlineBookingRow section were mostly non-printing characters. I resigned myself to grabbing all odd/even tags and outputting only the text of those elements.
The rest of the process actually wasn’t too bad – I needed a lot of string, list, and tuple manipulation, but those are things I’m pretty familiar with at this point. Since my approach de-associated all of the data (but thankfully preserved the order), I pulled the four elements back into a list of tuples by popping them off one at a time. Once they were back in tuples, I could write them to a pipe-delimited file (there are commas in the data, so no csv) and post a message to the user that everything was done.
All the filtering slows the script down quite a bit, but it runs for around 45 seconds. I would wager that much of this slowness is due to BeautifulSoup, which is notorious on the internets for its glacial processing. I can try another parser later on, but my initial attempt to install lxml was met with abject failure.
Anyway, here’s the code. Now that I have the dataset, it might be nice to spit out some pretty charts in R. I’ll…keep working on that.