ContentFull

When to start watching the James Bond series - Part 1

Sun 19 May 2019

I have watched the James Bond series several times in chronological order (by release date). But what week of the year should you start watching the series so that: 1. Summer-y episodes fall in the summer months (Example: Dr. No, The Living Daylights) 2. Snow filled episodes are watched during winter months (Example: On Her Majesty Secret Service, For Your Eyes Only)

There are 25 movies in the series and I enjoy watching at most 1 per week. While it is impossible to have a perfect alignment of movie to season, there should be one that is the best.

Discretization & Labeling

To solve this problem I settled on counting number of scenes that had weather that could be considered "hot", "neutral", or "cold." Scene extraction is not a trivial process, so I sampled a single frame every 150 frames from each movie. This resulted in a total of 1270 frames to be annotated. To make it easier I used Amazon's Mechanical Turk (MTurk) to first annotate each frame about whether it was set indoors, outdoors, or neither. While the results were not 100% accurate, I decided that precision could always be increased later. I then manually annotated each outdoor frame with the labels: "hot", "neutral", or "cold." To be more specific:

  • Hot is a scene where people seem hot, usually a desert or tropical setting.
  • Neutral is a catchall category for outdoor scenes that are neither hot nor cold, think Spring and Fall weather.
  • Cold means either snow or ice is in the scene and people are wearing winter clothing.

The result is that there are 156 "hot" frames, 53 "neutral" frames, and 26 "cold" frames. The hardest distinction is between "hot" and "neutral" frames - I believe many "hot" scenes should have been labeled as "neutral".

With the frame labeling complete I generated summaries for each movie with each frame annotated like the one below:

Summary of Dr. No where each sampled frame is annotated. Notice that is it not a perfect labeling as MTurk was not perfect.

Each movie is represented by an array where each sampled frame has one of 4 values: 'h' (for "hot"), 'n' (for "neutral"), 'c' (for "cold"), and null for frames where weather could not be determined (or the scene was indoors).

Feature Vector

There are several ways to summarize each movie into a feature vector. For now, I'm using the trivial way that is by counting the number of "hot", "neutral", and "cold" frames in each movie and dividing it by the total number of sampled frames for the movie. An extension is to weigh the significance of each label because winter scenes tend to be really important in Bond movies. Here is what the feature vectors look like when visualized next to each other.

Movies' feature vectors next to each other.

The Problem

This is probably a good time to pause and formalize the solution I'm looking for. What is the best week of the year to start watching the Bond series so the weather in the movies match the current weather. If this sounds weird just think: Dr. No is a great movie during the summer and an excuse to drink Red Stripe. On the other hand On Her Majesty's Secret Service takes place during Christmas and it would be nice to watch that movie around December. Here are some rules and ideas to consider: - Movies are always watched in order of their release - Movies are watched at most 1 per week - Option: Movies do not need to be watched every week, there could be breaks between each movie watched

Next Time

In the next part I'll discuss how I represented the weather for any given week of the year in New York. Also, first results of trying to find the best time of the year to begin watching the Bond series!

Neighborhoods

Sun 17 February 2019

Figuring out the neighborhood a business is located in is not trivial. To illustrate here are 10 addresses in New York:

  • Bathtub Gin (132 9th Avenue, New York, NY 10010)
  • Beauty Bar (231 East 14th Street New York, NY 10003)
  • Botanic Lab (86 Orchard Street, New York, NY 10002)
  • City Winery (155 Varick Street, New York, NY 10013)
  • Coney Island USA (1208 Surf Ave., Brooklyn, New York 11224)
  • Dromedary Bar (266 Irving Avenue, Brooklyn, NY 11237)
  • Hill & Dale (115 Allen Street, New York, NY 10002)
  • KGB Bar (85 East 4th Street, New York, NY 10003)
  • Talon Bar (220 Wyckoff Ave, Brooklyn, New York 11237)

OpenStreetMap(OSM) was my first and most straightforward approach to mapping addresses to neighborhoods. Partially the ease is because of the Nominatim API that makes querying OSM a trivial task. Immediately two things become obvious. First, OSM uses neighbourhood as opposed to neighborhood. Second, the concepts of suburbs and neighborhoods is hard to understand.

Location OSM Neighborhood OSM Suburb
Bathtub Gin Chelsea
Beauty Bar Park Slope BK
Botanic Lab Chinatown
City Winery Hudson Square
Coney Island USA West Brighton
Dromedary Bar Bushwick
Hill & Dale Lower East Side
KGB Bar New Dorp Staten Island
Talon Bar Ridgewood

Diving deeper into the OSM data it becomes clear that there is a complicated ranking system for determining the primary neighborhood for an address. Notice how for the Bathtub Gin, there are actually 3 possible neighborhoods. Even more interesting is that the response in Python of querying OSM returns Chelsea as a suburb and not a neighborhood. In any case, OSM is not a good solution for resolving neighborhood queries because it leaves many blanks. Additionally, some of its results are too correct (City Winery by most people is in Chelsea and no one knows where Hudson Square is).

OpenStreetMap data for the Bathtub Gin

Second attempt was with Yelp. Yelp seemed to have good answers. Check this out:

Location Yelp
Bathtub Gin Chelsea
Beauty Bar Gramercy
Botanic Lab Lower East Side
City Winery South Village
Coney Island USA Coney Island
Dromedary Bar Bushwick
Hill & Dale Lower East Side
KGB Bar East Village
Talon Bar Bushwick

This looks perfect! Except the neighborhood information is not available through their API. Getting neighborhood data should not require scraping the internet for the answer.

At this point I realized that one place that has excellent neighborhood data is real estate sites. Checkout StreetEasy — they neatly display a hierarchy of neighborhoods.

OpenStreetMap data for the Bathtub Gin

Unfortunately StreetEasy does not make their neighborhood data available. Again, I'm not going to scrape their website for a shapefile. But their competitor, Zillow, does make their shapefile available here.

Pulling their data, and narrowing down the regions of interest down to only the New York City area, was surprisingly easy. The results were not bad, at least in par with Yelp:

Location Zillow
Bathtub Gin Chelsea
Beauty Bar Park Slope
Botanic Lab Lower East Side
City Winery SoHo
Coney Island USA Coney Island
Dromedary Bar Bushwick
Hill & Dale Lower East Side
KGB Bar New Dorp
Talon Bar Bushwick

Neighborhoods are not trivial to figure out. None of the solutions I tried worked well. Either some addresses had no neighborhood data or the neighborhood was too specific. If there is a correct neighborhood for every address, it should not be assumed that it is the one people use. None of the approaches I tried solved this problem elegantly nor could be generalized to other cities.

from geopy.geocoders import Nominatim
import shapefile
from shapely.geometry import Polygon, Point

def run():

    addresses = [
        ('Bathtub Gin', '132 9th Avenue, New York, NY 10010'),
        ('Beauty Bar', '231 East 14th Street New York, NY 10003'),
        ('Botanic Lab', '86 Orchard Street, New York, NY 10002'),
        ('City Winery', '155 Varick Street, New York, NY 10013'),
        ('Coney Island USA', '1208 Surf Ave., Brooklyn, New York 11224'),
        ('Dromedary Bar', '266 Irving Avenue, Brooklyn, NY 11237'),
        ('Hill & Dale', '115 Allen Street, New York, NY 10002'),
        ('KGB Bar', '85 East 4th Street, New York, NY 10003'),
        ('Talon Bar', '220 Wyckoff Ave, Brooklyn, New York 11237')
    ]

    # Generic client to query from OSM
    client = Nominatim(user_agent="my-application")

    # Attempt 1
    for address in addresses:
        response = client.geocode(address[1], addressdetails=True, extratags=True).raw
        print(address[0], response['address'].get('neighbourhood'), response['address'].get('suburb'))

    # Attempt 3
    regions = {}

    with shapefile.Reader("ZillowNeighborhoods-NY") as sf:
        for i in range(len(sf)):
            # Extract regions from the shapefile that are in the city of New York
            if sf.record(i)[2] == 'New York':
                regions[sf.record(i)[-2]] = Polygon(sf.shape(i).points)

    for address in addresses:
        # Lookup address Lon/Lat
        response = client.geocode(address[1], addressdetails=True, extratags=True).raw
        point = Point(float(response['lon']), float(response['lat']))

        # Filter neighborhoods which contain the address's Lon/Lat
        names = list(filter(lambda region: region[1].intersects(point), regions.items()))

        # Get the name of the neighborhoods
        names = [n[0] for n in names]

        if len(names):
            print(address[0], names)