More on SPGpoints.py

After some further tooling, I’ve created a rather robust script that outputs a data file, a summary file, and some summary information to stdout.  Also, it’s nicely generalized for anyone with an SPG account that cares to run it.  Check it out here.

I’ve got a few challenges left:

  1. Pretty charts and graphs in R
  2. Compiling this in a full package so anyone can run it on a windows machine (regardless of whether they have Python or the requests/Beautiful Soup packages installed)
  3. Code optimization using a different parser

Learning Web Scraping with Python

It’s amazing how quickly a seemingly simple project can turn into a really intense learning experience.  My week started (as they normally do) with a minor annoyance: Starwood Preferred Guest (a hotel rewards program) posts all points activity on separate pages, sixteen transactions apiece.  After pausing to consider how much of a #firstworldproblem that was, I decided it might be a good opportunity to learn web scraping with Python.

Challenge 1: Authentication for spg.com

My feeble background in HTTP Headers from the Python Challenge had not prepared me for this.  I realize from the ‘https:’ in my Chrome window that there is some kind of authentication used to keep people out of my account logged into the correct user accounts.  But…how do I do this?

I used the Chrome element inspector to see if I could pick up any hints from the Username and Password box on the spg website, but no dice.  After some failed (mostly blind) attempts with urllib, urllib2, httplib, httplib2, and bolacha, I figured it was time to start approaching this like an engineer.  What do engineers do best?  BLOW THINGS UP.  Huh.  I can’t blow up my laptop – not a good idea.  What else do engineers do?  They use TOOLS.  I figured if Chrome doesn’t have any trouble passing my username and password to spg.com, maybe I can just eavesdrop on the communication and learn to emulate it.

Enter Fiddler: a tool created to capture browser communications.  The GUI is rather intuitive, and in a few minutes I was able to run some targeted tests on Chrome (in incognito mode to remove any extraneous variables).  I ended up with a lot more information than ever before, but I was still stumped.  There wasn’t anything really obvious in the header information or the response except…

successPath=%2Findex.html&login=heythatsmylogin&password=ooothatsmypassword!

Brilliant!  Now…how to pass this?  I was still a little sore on urllib and httplib, so I sought out yet another python library to help.  By sheer luck Google, I stumbled upon Requests: HTTP for Humans.  Not only was that totally reassuring, but the documentation was simple and easy to use.  In no time, I had all the makings of a new authentication scheme that worked.  Huzzah!   Now what?

Challenge 2: Scraping the sites for data

Iterating through the pages was fairly simple (start with 0, increment the URL value by 16 each time), but the real challenge was making sense of the source.  To begin with, the source for each page was almost 5000 lines long.  Most of this was style information and scripts that I really didn’t care about.  The real meat of the data was in a table that looked something like this:

<tr>
<td class="even first">Activity</td>
<td class="even first">Points</td>
<td class="even first">Post Date</td>
<td class="even first">Details</td>

Within the Details section there are only two things I care about:

<a class="propertyName" href="addressblahblahblah">Property Name</a>
<div class="stayDates">Type</div>
</tr>

The data I wanted was alternately assigned to the ‘even’ and ‘odd’ class, which was slightly annoying, but understandable.  More annoying were the td sections immediately following each of them with class=’odd inilineBookingRow’ that contained all kinds of garbage.  As an initial step, I decided to pull all the items with either the even or odd class attribute – I could clean it up later.

Challenge 3: Cleaning the Data

After a few hours of poking around for a good scraping tool, I settled on BeautifulSoup.  It seemed to have the best reputation around for parsing and navigating tree structures from the web, so I gave it a shot.  Unfortunately, HTML parsers for BeautifulSoup break each attribute into its own section, so there was no way to define (either in SoupStrainer or soup.find_all) that I wanted the td values with class=’even’ but not those with class=’even inilineBookingRow’.  The only silver lining on this was that the elements in the inlineBookingRow section were mostly non-printing characters.  I resigned myself to grabbing all odd/even tags and outputting only the text of those elements.

The rest of the process actually wasn’t too bad – I needed a lot of string, list, and tuple manipulation, but those are things I’m pretty familiar with at this point.  Since my approach de-associated all of the data (but thankfully preserved the order), I pulled the four elements back into a list of tuples by popping them off one at a time.  Once they were back in tuples, I could write them to a pipe-delimited file (there are commas in the data, so no csv) and post a message to the user that everything was done.

All the filtering slows the script down quite a bit, but it runs for around 45 seconds.  I would wager that much of this slowness is due to BeautifulSoup, which is notorious on the internets for its glacial processing.  I can try another parser later on, but my initial attempt to install lxml was met with abject failure.

Anyway, here’s the code.  Now that I have the dataset, it might be nice to spit out some pretty charts in R.  I’ll…keep working on that.

Truly Stopping SOPA and PIPA

As the excitement around today’s SOPA/PIPA protest fades away (as it inevitably will in the next few weeks), it’s absolutely necessary to take stock of the players involved in this battle and the organizations that support them.  Clay Shirky has an excellent TED talk that breaks down the history of the American Home Recording Act and the Digital Millennium Copyright Act and goes into the more onerous provisions of PIPA and SOPA.

He also makes a passing reference to the Combating Online Infringement and Counterfeits Act, which is the forefather of both PIPA and SOPA.  In fact, very little was changed between COICA and PIPA – it contained the same absurdly broad definitions, provisions for DNS blocking, and penalties for organizations that failed to enforce them.  COICA died in committee last year, thanks in part to Senator Ron Wyden (D-OH), but it didn’t see nearly as much protest as SOPA/PIPA have elicited.  The EFF and some geekier members of the public stepped up to complain, and most Congressmen responded with poorly researched responses from overworked staffers like the one I received from Scott Brown a little over a year ago:

 

Dear Mark,

Thank you for contacting me regarding the Combating Online Infringement and Counterfeiting Act (S. 3804). I always value the input of my constituents on all issues and appreciate hearing from you.

S. 3804 would provide the Department of Justice a new legal pathway to shut down websites that provide unauthorized access to copyrighted materials. If a federal court determines that a website allows illegal access to copyrighted materials, the Department of Justice has the ability to take the website down if it is registered in the United States. If the website is registered in a foreign country then Internet Service Providers must limit access to the website.

Intellectual property (IP) theft is a growing issue affecting many American industries, including music, movie studios and software companies. IP theft results in a loss of revenue that costs jobs in Massachusetts and across the country. The Department of Justice has successfully shut down organizations in the United States that are dedicated to piracy, but has encountered difficulty with modern peer-to-peer downloading through “torrent” technology. With this technology, individuals can download without a central server and by basing their activities in regions with less enforcement of copyright laws.

You may be interested to know that the United States recently released finalized text of the Anti-Counterfeiting Trade Agreement. Talks involved member countries of the European Union as well as Australia, Canada, Singapore, and South Korea. The agreement aims to better combat the sale of counterfeit and pirated goods. Work on the Trade Agreement will continue into 2011. International cooperation on intellectual property is critical to combat illegal downloading.

Again, thank you for sharing your thoughts with me. Should legislation on this topic come before the full Senate for debate, I will consider it with your thoughts in mind. If I can be of further assistance, please feel free to contact me or visit my website at www.scottbrown.senate.gov.

Sincerely,
Scott P. Brown
United States Senator

I especially enjoyed the quotations around “torrent” and the paragraph regarding ACTA, which is potentially more horrible than SOPA and PIPA put together, but we will never know, since it was drafted and signed in secret.

Why did COICA fail where SOPA and PIPA were about to succeed?  An experienced congressional cynic would assume that the only difference between COICA and SOPA/PIPA is lobbying dollars.  The content industry spent $1.8 million on COICA, and it failed miserably.  Similar numbers aren’t out yet for SOPA/PIPA, but some people have ventured a guess, and remarkably, it’s a larger number.

On the upside, today’s protests seem to have worked.  Oh, and Scott Brown?  He announced via twitter that he would vote ‘no’ on PIPA.  The same Senator whose office sent out the text above is now switching sides because he is running for re-election and is desperately trying to pick up votes in a notoriously blue state.  That’s how congress should work.  If enough of the populous cares about something – really cares – their voices should be heard.

What is it going to take to kill the ideas behind SOPA and PIPA?  Extreme vigilance on the part of the digerati.  In a sense, we blew our collective load today on a massive internet protest.  I sincerely doubt we will ever see a repeat of today’s intensity towards a copyright issue.  With any luck, we have educated enough of the public about the really terrible portions of SOPA/PIPA and the content industry will be forced to leave DNS alone.

They will find another way to drastically reduce our ability to copy media.  They will try to enforce it through any means available.  Those who understand the internet and the intended function of copyright will need to stand up (again) and say simply, “No.”

…and we’ll have to hope that the rest of the world joins us.

Phonetic Python Revisited

After reading through some of the other stock python functions, I realized that an unchanging set of information is begging to be placed in a ‘dict’ instead of an external file.  Also, the hyphen in “X-Ray” broke my regex function :/

The only problem with using dict – I would need to change the format of the external file I was using.  I thought briefly about copying the table into excel and concatenating together all of the items that I needed to create the dictionary, since that’s been my weapon of choice as of late.  Seconds later, I realized that I had printed the entire list when I was testing the initial program – why couldn’t I use a small python script to do the work for me?  So that’s what I did :)

If you’re curious, here’s the helper script to write the dict and the finished (simpler) code that uses the dict.

Brushing up on Python

Given that work hasn’t reached full intensity, I decided to start brushing up on the little python that I know.  I haven’t touched it in a while (since tooling around with the OLPC), and my schooling was limited mostly to the Google Python Class.  If you have the urge to learn Python and some time to watch the videos, I highly recommend it :)

As with most languages (computing or otherwise), I’ve found that it’s easier to get excited about learning when you have a goal in mind.  For reasons that I don’t fully understand, I was compelled to create a little program that take in a word and output the phonetic alphabet characters associated with each letter (WTF –> Whiskey Tango Foxtrot).

After a couple slightly frustrated hours and some Google searches, I was able to figure out that I had to loop through both the letters in the word and the reference file.  It seems painfully obvious now :/  Here are some links to the finished code (rename it to a .py file if you want to run it) and the reference file for the phonetic alphabet.

All of this turned out to be a really good thing, considering the free Stanford class I signed up for on Natural Language Processing requires either Java or Python proficiency.

Setup for WordPress on NearlyFreeSpeech.NET

WordPress is a really nice blogging/web publishing tool, but it was kind of a pain to set up on nearlyfreespeech.net (excellent host, btw).  After wading through the WordPress instructions, some blogs, and stackoverflow, I think I’ve gotten it down to a science.

Here are the most helpful links I found:

Key lessons from this experience:

  • NearlyFreeSpeech requires that you set up a MySQL process, THEN create a new database.  Try not to forget the second step.
  • When you upload files, determine how you want your website to look and put wordpress in the right place.  By default, NFS will dump you in the /home/public directory when you SSH into your account.  If you want your website to be hosted directly on your domain (e.g., notpace.com), put that here.  If you want it on a subdomain, create a folder for it in the /home/public directory and put all the wordpress files there (/home/public/wordpress would show up at notpace.com/wordpress in my case).
  • The permissions on the wordpress files that you upload are not set up properly.  Immediately after uploading your wordpress installation files, check to make sure that you have changed the permissions so that the group is correct (check the link above).
  • Before you complete the installation for wordpress, make absolutely sure that you have made all the required edits to wp-config.php.  It’s a huge pain to back out those changes.
  • To upload anything successfully to your site via FTP, you need to enable FTP on NFS in not one, but TWO places: the profile tab and the site tab.
  • WordPress defaults to your latest posts as the homepage for your site, but you can change this in the Settings -> Reading menu
  • You can rotate text using CSS, but Height/Width of the container for that text is calculated BEFORE it is rotated, which makes for some very absurd layouts if you’re not careful.  There is a reason that many web developers give up and use pictures rather than creating entire sites with text and CSS styling.

Welcome to notpace

In the waning hours of 2011, I finally have time to do something with this domain name.  In light of the de-socailization of Google Reader, I guess my longform writing will have a new home.  Also, I can play around with some web development.

Glorious :)