Manhattan Cafe Wars: Starbucks & Subway
@kaggle.shiratoriseto_manhattan_cafe_wars
@kaggle.shiratoriseto_manhattan_cafe_wars
Nine analysis-ready files covering Manhattan's cafe landscape, transit infrastructure, building attributes, demographics, and location fitness scoring:
| File | Source | License |
|---|---|---|
| Starbucks & Cafes | OpenStreetMap via Overpass API | ODbL 1.0 |
| MTA Ridership | data.ny.gov (MTA Subway Hourly Ridership) | OPEN NY Terms |
| Building attributes | MapPLUTO (NYC Dept. of City Planning) | NYC Open Data Terms |
| Demographics | US Census ACS 2022 5-Year Estimates | Public domain |
| Tract boundaries | TIGER/Line Shapefiles (Census Bureau) | Public domain |
| Pedestrian Counts | NYC DOT Bi-Annual Pedestrian Counts (NYC Open Data) | NYC Open Data Terms |
OpenStreetMap attribution: (c) OpenStreetMap contributors. Data available under the Open Database License (ODbL) v1.0.
MTA data: Provided by the Metropolitan Transportation Authority via New York State Open Data portal.
NYC DOT data: Provided by the NYC Department of Transportation via NYC Open Data.
All coordinates are in WGS84 (EPSG:4326) — standard latitude/longitude.
| Column | Type | Description |
|---|---|---|
| osm_id | int | OpenStreetMap element ID |
| osm_element | str | OSM element type (node/way) |
| name | str | Store name |
| brand | str | Brand name (Starbucks) |
| addr_street | str | Street name (18% missing) |
| addr_housenumber | str | House number (18% missing) |
| addr_postcode | str | ZIP code (18% missing) |
| addr_city | str | City name |
| phone | str | Phone number (24% missing) |
| opening_hours | str | Opening hours in OSM format (46% missing) |
| wheelchair | str | Wheelchair accessibility |
| outdoor_seating | str | Outdoor seating available |
| indoor_seating | str | Indoor seating available |
| drive_through | str | Drive-through available |
| takeaway | str | Takeaway available |
| lat | float | Latitude (WGS84) |
| lon | float | Longitude (WGS84) |
| Column | Type | Description |
|---|---|---|
| osm_id | int | OpenStreetMap element ID |
| osm_element | str | OSM element type |
| name | str | Cafe name ('Unknown Cafe' if missing) |
| brand | str | Brand name (NaN for independents) |
| amenity | str | OSM amenity tag |
| cuisine | str | Cuisine type tag |
| addr_street | str | Street name |
| addr_housenumber | str | House number |
| lat | float | Latitude (WGS84) |
| lon | float | Longitude (WGS84) |
| brand_category | str | One of: starbucks, dunkin, branded, independent |
| Column | Type | Description |
|---|---|---|
| station_complex_id | str | MTA station complex identifier |
| station_name | str | Station name with subway lines |
| lat | float | Latitude (WGS84) |
| lon | float | Longitude (WGS84) |
| avg_daily_ridership | int | Average daily ridership (all fare types combined) |
| total_ridership_q4_2024 | int | Total ridership Oct-Dec 2024 |
| data_period | str | Data collection period |
| data_days | int | Number of days in the period (92) |
| Column | Type | Description |
|---|---|---|
| osm_id – takeaway | various | Same as manhattan_starbucks_osm.csv (17 columns) |
| pluto_bbl | float | NYC Borough-Block-Lot identifier |
| pluto_landuse | float | PLUTO land use code |
| pluto_bldgclass | str | Building class code |
| pluto_numfloors | float | Number of floors (9 missing) |
| pluto_yearbuilt | float | Year built |
| pluto_unitstotal | float | Total residential units |
| pluto_retailarea | float | Retail floor area in sq ft (8 missing) |
| pluto_assesstot | float | Total assessed value |
| pluto_comarea | float | Commercial floor area in sq ft (8 missing) |
| pluto_lotarea | float | Lot area in sq ft |
| pluto_zonedist1 | str | Primary zoning district |
| pluto_dist_m | float | Distance to matched PLUTO lot (meters) |
| mta_station_id | str | Nearest MTA station complex ID |
| mta_station_name | str | Nearest station name |
| mta_dist_m | float | Distance to nearest station (meters) |
| mta_avg_daily_ridership | int | Nearest station's average daily ridership |
| n_starbucks_250m | int | Starbucks count within 250m |
| n_dunkin_250m | int | Dunkin' count within 250m |
| n_other_cafe_250m | int | Other cafe count within 250m |
| n_starbucks_500m | int | Starbucks count within 500m |
| n_dunkin_500m | int | Dunkin' count within 500m |
| n_other_cafe_500m | int | Other cafe count within 500m |
| n_starbucks_1000m | int | Starbucks count within 1km |
| n_dunkin_1000m | int | Dunkin' count within 1km |
| n_other_cafe_1000m | int | Other cafe count within 1km |
| nearest_competitor_dist_m | float | Distance to nearest non-Starbucks cafe (meters) |
| nearest_starbucks_dist_m | float | Distance to nearest other Starbucks (meters) |
| census_tract_id | int | Census Tract GEOID |
| tract_population | int | Tract total population (ACS 2022) |
| tract_median_income | float | Tract median household income in dollars (3 missing) |
| tract_pct_walk_commute | float | Percent of workers who walk to work |
| tract_pct_bachelors_plus | float | Percent of adults with bachelor's degree or higher |
| station_cluster | int | MTA station ridership-pattern cluster (0-3) |
| station_cluster_name | str | Cluster label: Morning Peak / Balanced / Midday-Heavy / Evening Peak |
| ped_count_nearest | float | Pedestrian count at the nearest NYC DOT counter location |
| ped_dist_m | float | Distance to the nearest pedestrian count location (meters) |
| tract_starbucks_count | int | Number of Starbucks in the same Census Tract |
| tract_total_cafes | int | Total number of cafes in the same Census Tract |
| tract_competitor_cafes | int | Number of non-Starbucks cafes in the same Census Tract |
| tract_mta_ridership | float | Total MTA ridership aggregated at the Census Tract level |
| tract_avg_ped_count | float | Average pedestrian count at tract level |
| tract_pop_density | float | Population density of the Census Tract (people per sq km) |
| demand_proxy_index | float | Composite demand proxy index (ridership + pedestrian + population) |
| supply_index | float | Supply concentration index (cafe density relative to demand) |
| location_fitness_score | float | Location Fitness Score: demand-supply balance metric (higher = more underserved demand) |
| tract_lisa_cluster | str | LISA spatial autocorrelation cluster for the tract (High-High, Low-Low, etc.) |
| Column | Type | Description |
|---|---|---|
| GEOID | str | Census Tract FIPS code |
| ALAND | int | Land area in square meters |
| AWATER | int | Water area in square meters |
| NAMELSAD | str | Tract name |
| starbucks_count | int | Number of Starbucks in this tract |
| lisa_cluster | str | LISA classification: High-High, Low-Low, High-Low, Low-High, or Not Significant |
| tract_starbucks_count | int | Starbucks count (same as starbucks_count) |
| tract_total_cafes | int | Total number of cafes in the tract |
| tract_competitor_cafes | int | Number of non-Starbucks cafes |
| tract_mta_ridership | float | Total MTA daily ridership attributed to the tract |
| tract_avg_ped_count | float | Average pedestrian count in the tract |
| tract_population | int | Tract total population (ACS 2022) |
| tract_median_income | float | Tract median household income |
| tract_pct_walk_commute | float | Percent of workers who walk to work |
| tract_pct_bachelors_plus | float | Percent with bachelor's degree or higher |
| tract_pop_density | float | Population density (people per sq km) |
| tract_avg_ped_count_filled | float | Pedestrian count with missing values filled |
| dpi_ridership | float | Demand proxy sub-index: MTA ridership component |
| dpi_pedestrian | float | Demand proxy sub-index: pedestrian flow component |
| dpi_population | float | Demand proxy sub-index: population component |
| demand_proxy_index | float | Composite Demand Proxy Index (0-1 scale) |
| supply_index | float | Supply concentration index (cafe count relative to area) |
| supply_normalized | float | Normalized supply index (0-1 scale) |
| location_fitness_score | float | Location Fitness Score: demand minus supply (higher = underserved) |
| Column | Type | Description |
|---|---|---|
| the_geom | str | Point geometry (WKT) |
| OBJECTID | int | Unique object identifier |
| Loc | int | Location identifier |
| Borough | str | Borough name (Manhattan) |
| Street_Nam | str | Street name of the count location |
| From_Stree | str | Cross street (from) |
| To_Street | str | Cross street (to) |
| Iex | str | Index/intersection identifier |
| {Period}_AM | int | AM peak pedestrian count for the survey period (e.g., May07_AM, Sept07_AM, ... May25_AM) |
| {Period}_PM | int | PM peak pedestrian count for the survey period |
| {Period}_MD | int | Midday pedestrian count for the survey period |
Survey periods span from May 2007 to May 2025, conducted bi-annually (May and September/October), with 3 time slots each (AM, PM, Midday).
Census Tract polygons with LISA (Local Indicators of Spatial Association) results:
| Property | Type | Description |
|---|---|---|
| GEOID | str | Census Tract FIPS code |
| NAMELSAD | str | Tract name |
| geometry | Polygon | Tract boundary (WGS84) |
| starbucks_count | int | Number of Starbucks in this tract |
| lisa_cluster | str | LISA classification: High-High, Low-Low, High-Low, Low-High, or Not Significant |
| lisa_q | int | LISA quadrant (1-4) |
| lisa_p | float | LISA p-value |
| lisa_I | float | Local Moran's I statistic |
| Column | Type | Description |
|---|---|---|
| station_complex_id | str | MTA station complex identifier (join key) |
| cluster | int | Cluster assignment (0-3) |
| cluster_name | str | Morning Peak (Residential) / Balanced (Transit Hub) / Midday-Heavy (Tourism) / Evening Peak (Office) |
| Notebook | Theme | Link |
|---|---|---|
| Manhattan Cafe Wars | Theme 0: EDA & competitor mapping | Open |
| Starbucks 10-K NLP | Theme 1: keyword trends, LDA topics, NLP × store count | Open |
| Starbucks Spatial Clustering | Theme 2A: Moran's I, LISA, Ripley's K | Open |
| Starbucks Location Fitness | Theme 2B: demand-supply scoring & backtest | Open |
| Starbucks Data Pipeline | Pipeline: EDGAR & OSM to CSV, data quality report | Open |
Related dataset: Starbucks 30-Year 10-K NLP Corpus — NLP data for Theme 1
@kaggle
@owid
Share link
Anyone who has the link will be able to view this.