2018/09/07/scrapebys

The Sotheby's art auction house started moving towards digital auctions a little over 9 years ago, with high quality scans, about 2000px on their longest side, of various works of art that Sotheby's has sold online. There was a blessed period in 2011 where some intern started uploading the images at 4000px, but it was short lived. If you know where to go to read this blog then you've likely already heard me mention this a million times by now.

Earlier last year, I had been going through the auctions by hand, picking and choosing paintings that I wanted to download, but this year I came back to their website to see if I could automate the process and just grab all of their paintings at once. Please note: my imaginary legal department wanted me to pass on the fact that scraping the Sotheby's website is against their terms of use (check condition 7d), so I suggest you do this at your own discretion. On the bright side, their robots.txt file suggests that they only care for the protection of their receipts, which actually contain what one might consider valuable data. While programming a scraper isn't particularly difficult, I've catalogued some brief notes on my process of cracking Sotheby's faberge egg.

The code, as it stands, is hosted as a bitbucket repo.

We can start by checking out their catalog of past sales. If you open up the Network tab in your favourite modern browser's console and navigate to their auction archive, you'll spot an interesting GET call for a JSON file: ajax.auctions.json.

The file only retrieves a maximum of 500 auctions at a time. We can work around this by doing multiple requests and changing the end date to the earliest returned auction.

auction_json = requests.get(url, headers=headers).text
auction_json = json.loads(auction_json)
# Sotheby's limits the number of results to 500 per request:
# Once we get less than that we can finish up.
if len(auction_json['events']) < 500:
    end_date = 0
else:
    end_date = auction_json['events'][-1]['startTimeStampInMilliSecs']

When visiting the URL for an auction, the items on sale (referred to as "lots") pop up on a list. But how is this list populated? I figured there was another API call to get the data sent via a JSON file, but I wasn't quite right. I burned through most of the JSON and JS files loaded alongside the auction page with no results. It was only after explicitly searching for the name of a lot listed on the auction that I found out the data for each lot was stored as a JSON-formatted string within a Javascript array.

It turns out this doesn't always apply to older auctions, which will sometimes have no detail page or have no overview. Usually those are older auctions, however, and are not digitized.

Now we can access arbitrary auctions and their contents, we can just save the auction data to a JSON file and start downloading the images in each auction. Lots that remain under copyright, lots that aren'ts photographed, and lots that are wine (there's an "isWine" flag in the JSON just for that!) have a placeholder image, so we can just ignore them while downloading other lots:

for lot in lots_dict['lots']:
    # don't load blank images, don't load copyright placeholders, don't drink and drive
    if "underCopyright" in lot["image"] or "lot.jpg" in lot["image"] or lot["isWine"] == "true":
        continue

Theres still a few missing features in the program that I might tackle if I feel some pressing desire for their inclusion. Just don't hold your breath for them:

  • filtering based on more than just auction title
  • avoiding the downloading of lots that are being resold from previous auctions
  • better file names (the current-slug-format-isn't-very-convenient.jpg)
  • turning the program into a python package

For now I've been using the files I've scraped as randomized desktop backgrounds. I may not publicize far and wide, but if you've found this text, I hope you find my program useful.