Baselight
Sign In
kaggle

Manhattan Cafe Wars: Starbucks & Subway

Kaggle

@kaggle.shiratoriseto_manhattan_cafe_wars

Loading...
Loading...

171 Starbucks enriched with PLUTO, MTA, Census, LFS & pedestrian counts

Dataset Description

What's inside

Nine analysis-ready files covering Manhattan's cafe landscape, transit infrastructure, building attributes, demographics, and location fitness scoring:

Core files (v1)

  1. manhattan_starbucks_osm.csv — 171 Starbucks locations from OpenStreetMap with addresses and amenity attributes
  2. manhattan_cafes_osm.csv — 1,335 cafes (Starbucks, Dunkin', branded chains, and independents) with brand classification
  3. manhattan_mta_ridership_summary.csv — 123 subway station complexes with average daily ridership (Q4 2024)

Added in v2

  1. manhattan_tracts_lisa.geojson — Census Tract polygons with LISA spatial autocorrelation cluster labels (High-High, Low-Low, etc.)
  2. mta_station_clusters.csv — 123 stations classified into 4 ridership-pattern clusters (Morning Peak, Balanced, Midday-Heavy, Evening Peak)

New in v3

  1. stores_enriched_v4.csv — 171 Starbucks x 63 columns: everything from v3's 51 columns + 12 new columns including tract-level cafe counts, MTA ridership, pedestrian counts, population density, demand/supply indices, Location Fitness Score, LISA cluster, and nearest pedestrian counter (replaces stores_enriched_v3.csv)
  2. tract_demand_supply.csv — 309 Census Tracts x 24 columns: comprehensive demand-supply scoring for all Manhattan tracts, including ridership-based and pedestrian-based demand proxy indices, supply index, and Location Fitness Scores (LFS)
  3. manhattan_pedestrian_counts.csv — NYC DOT Bi-Annual Pedestrian Counts at 36 Manhattan locations (2007-2025), with AM/PM/Midday counts per survey period

Use cases

  • Spatial competition analysis (Starbucks vs. competitors)
  • Spatial autocorrelation (Moran's I, LISA hotspot detection)
  • Point pattern analysis (Ripley's K / Besag's L function)
  • Transit-oriented location scoring
  • Walk-distance network analysis with OSMnx
  • Demand proxy modeling using subway ridership and pedestrian flow
  • Location Fitness Score (LFS) analysis and site selection
  • Demand-supply gap identification for new store candidates
  • Urban retail geography research
  • Pedestrian flow temporal analysis (seasonal and multi-year trends)

Data sources & licenses

File Source License
Starbucks & Cafes OpenStreetMap via Overpass API ODbL 1.0
MTA Ridership data.ny.gov (MTA Subway Hourly Ridership) OPEN NY Terms
Building attributes MapPLUTO (NYC Dept. of City Planning) NYC Open Data Terms
Demographics US Census ACS 2022 5-Year Estimates Public domain
Tract boundaries TIGER/Line Shapefiles (Census Bureau) Public domain
Pedestrian Counts NYC DOT Bi-Annual Pedestrian Counts (NYC Open Data) NYC Open Data Terms

OpenStreetMap attribution: (c) OpenStreetMap contributors. Data available under the Open Database License (ODbL) v1.0.

MTA data: Provided by the Metropolitan Transportation Authority via New York State Open Data portal.

NYC DOT data: Provided by the NYC Department of Transportation via NYC Open Data.

Coordinate Reference System

All coordinates are in WGS84 (EPSG:4326) — standard latitude/longitude.

Column descriptions

manhattan_starbucks_osm.csv (171 rows x 17 columns)

Column Type Description
osm_id int OpenStreetMap element ID
osm_element str OSM element type (node/way)
name str Store name
brand str Brand name (Starbucks)
addr_street str Street name (18% missing)
addr_housenumber str House number (18% missing)
addr_postcode str ZIP code (18% missing)
addr_city str City name
phone str Phone number (24% missing)
opening_hours str Opening hours in OSM format (46% missing)
wheelchair str Wheelchair accessibility
outdoor_seating str Outdoor seating available
indoor_seating str Indoor seating available
drive_through str Drive-through available
takeaway str Takeaway available
lat float Latitude (WGS84)
lon float Longitude (WGS84)

manhattan_cafes_osm.csv (1,335 rows x 11 columns)

Column Type Description
osm_id int OpenStreetMap element ID
osm_element str OSM element type
name str Cafe name ('Unknown Cafe' if missing)
brand str Brand name (NaN for independents)
amenity str OSM amenity tag
cuisine str Cuisine type tag
addr_street str Street name
addr_housenumber str House number
lat float Latitude (WGS84)
lon float Longitude (WGS84)
brand_category str One of: starbucks, dunkin, branded, independent

manhattan_mta_ridership_summary.csv (123 rows x 8 columns)

Column Type Description
station_complex_id str MTA station complex identifier
station_name str Station name with subway lines
lat float Latitude (WGS84)
lon float Longitude (WGS84)
avg_daily_ridership int Average daily ridership (all fare types combined)
total_ridership_q4_2024 int Total ridership Oct-Dec 2024
data_period str Data collection period
data_days int Number of days in the period (92)

stores_enriched_v4.csv (171 rows x 63 columns)

Column Type Description
osm_id – takeaway various Same as manhattan_starbucks_osm.csv (17 columns)
pluto_bbl float NYC Borough-Block-Lot identifier
pluto_landuse float PLUTO land use code
pluto_bldgclass str Building class code
pluto_numfloors float Number of floors (9 missing)
pluto_yearbuilt float Year built
pluto_unitstotal float Total residential units
pluto_retailarea float Retail floor area in sq ft (8 missing)
pluto_assesstot float Total assessed value
pluto_comarea float Commercial floor area in sq ft (8 missing)
pluto_lotarea float Lot area in sq ft
pluto_zonedist1 str Primary zoning district
pluto_dist_m float Distance to matched PLUTO lot (meters)
mta_station_id str Nearest MTA station complex ID
mta_station_name str Nearest station name
mta_dist_m float Distance to nearest station (meters)
mta_avg_daily_ridership int Nearest station's average daily ridership
n_starbucks_250m int Starbucks count within 250m
n_dunkin_250m int Dunkin' count within 250m
n_other_cafe_250m int Other cafe count within 250m
n_starbucks_500m int Starbucks count within 500m
n_dunkin_500m int Dunkin' count within 500m
n_other_cafe_500m int Other cafe count within 500m
n_starbucks_1000m int Starbucks count within 1km
n_dunkin_1000m int Dunkin' count within 1km
n_other_cafe_1000m int Other cafe count within 1km
nearest_competitor_dist_m float Distance to nearest non-Starbucks cafe (meters)
nearest_starbucks_dist_m float Distance to nearest other Starbucks (meters)
census_tract_id int Census Tract GEOID
tract_population int Tract total population (ACS 2022)
tract_median_income float Tract median household income in dollars (3 missing)
tract_pct_walk_commute float Percent of workers who walk to work
tract_pct_bachelors_plus float Percent of adults with bachelor's degree or higher
station_cluster int MTA station ridership-pattern cluster (0-3)
station_cluster_name str Cluster label: Morning Peak / Balanced / Midday-Heavy / Evening Peak
ped_count_nearest float Pedestrian count at the nearest NYC DOT counter location
ped_dist_m float Distance to the nearest pedestrian count location (meters)
tract_starbucks_count int Number of Starbucks in the same Census Tract
tract_total_cafes int Total number of cafes in the same Census Tract
tract_competitor_cafes int Number of non-Starbucks cafes in the same Census Tract
tract_mta_ridership float Total MTA ridership aggregated at the Census Tract level
tract_avg_ped_count float Average pedestrian count at tract level
tract_pop_density float Population density of the Census Tract (people per sq km)
demand_proxy_index float Composite demand proxy index (ridership + pedestrian + population)
supply_index float Supply concentration index (cafe density relative to demand)
location_fitness_score float Location Fitness Score: demand-supply balance metric (higher = more underserved demand)
tract_lisa_cluster str LISA spatial autocorrelation cluster for the tract (High-High, Low-Low, etc.)

tract_demand_supply.csv (309 rows x 24 columns)

Column Type Description
GEOID str Census Tract FIPS code
ALAND int Land area in square meters
AWATER int Water area in square meters
NAMELSAD str Tract name
starbucks_count int Number of Starbucks in this tract
lisa_cluster str LISA classification: High-High, Low-Low, High-Low, Low-High, or Not Significant
tract_starbucks_count int Starbucks count (same as starbucks_count)
tract_total_cafes int Total number of cafes in the tract
tract_competitor_cafes int Number of non-Starbucks cafes
tract_mta_ridership float Total MTA daily ridership attributed to the tract
tract_avg_ped_count float Average pedestrian count in the tract
tract_population int Tract total population (ACS 2022)
tract_median_income float Tract median household income
tract_pct_walk_commute float Percent of workers who walk to work
tract_pct_bachelors_plus float Percent with bachelor's degree or higher
tract_pop_density float Population density (people per sq km)
tract_avg_ped_count_filled float Pedestrian count with missing values filled
dpi_ridership float Demand proxy sub-index: MTA ridership component
dpi_pedestrian float Demand proxy sub-index: pedestrian flow component
dpi_population float Demand proxy sub-index: population component
demand_proxy_index float Composite Demand Proxy Index (0-1 scale)
supply_index float Supply concentration index (cafe count relative to area)
supply_normalized float Normalized supply index (0-1 scale)
location_fitness_score float Location Fitness Score: demand minus supply (higher = underserved)

manhattan_pedestrian_counts.csv (36 rows x 113 columns)

Column Type Description
the_geom str Point geometry (WKT)
OBJECTID int Unique object identifier
Loc int Location identifier
Borough str Borough name (Manhattan)
Street_Nam str Street name of the count location
From_Stree str Cross street (from)
To_Street str Cross street (to)
Iex str Index/intersection identifier
{Period}_AM int AM peak pedestrian count for the survey period (e.g., May07_AM, Sept07_AM, ... May25_AM)
{Period}_PM int PM peak pedestrian count for the survey period
{Period}_MD int Midday pedestrian count for the survey period

Survey periods span from May 2007 to May 2025, conducted bi-annually (May and September/October), with 3 time slots each (AM, PM, Midday).

manhattan_tracts_lisa.geojson (289 tracts)

Census Tract polygons with LISA (Local Indicators of Spatial Association) results:

Property Type Description
GEOID str Census Tract FIPS code
NAMELSAD str Tract name
geometry Polygon Tract boundary (WGS84)
starbucks_count int Number of Starbucks in this tract
lisa_cluster str LISA classification: High-High, Low-Low, High-Low, Low-High, or Not Significant
lisa_q int LISA quadrant (1-4)
lisa_p float LISA p-value
lisa_I float Local Moran's I statistic

mta_station_clusters.csv (123 rows x 3 columns)

Column Type Description
station_complex_id str MTA station complex identifier (join key)
cluster int Cluster assignment (0-3)
cluster_name str Morning Peak (Residential) / Balanced (Transit Hub) / Midday-Heavy (Tourism) / Evening Peak (Office)

Missing data policy

  • lat/lon: Never missing (required fields)
  • name: Never missing (filled with 'Unknown Cafe' if absent in OSM)
  • brand: NaN for independent cafes (by design, not an error)
  • Address fields: NaN where OSM contributors haven't added the information
  • Ridership: All 123 stations have ridership > 0
  • PLUTO fields: numfloors (9 missing), retailarea/comarea (8 missing) — buildings where PLUTO lacks data
  • tract_median_income: 3 missing — Census tracts with suppressed income data
  • ped_count_nearest: Available for all 171 stores (interpolated from nearest DOT counter)
  • tract_avg_ped_count: Some tracts have no nearby pedestrian counter; filled via tract_avg_ped_count_filled in tract_demand_supply.csv

Related notebooks

Notebook Theme Link
Manhattan Cafe Wars Theme 0: EDA & competitor mapping Open
Starbucks 10-K NLP Theme 1: keyword trends, LDA topics, NLP × store count Open
Starbucks Spatial Clustering Theme 2A: Moran's I, LISA, Ripley's K Open
Starbucks Location Fitness Theme 2B: demand-supply scoring & backtest Open
Starbucks Data Pipeline Pipeline: EDGAR & OSM to CSV, data quality report Open

Related dataset: Starbucks 30-Year 10-K NLP Corpus — NLP data for Theme 1

Updates

  • v1.0 (2026-03-12): Initial release — 3 CSV files with Q4 2024 ridership
  • v2.0 (2026-03-13): Added stores_enriched_v3.csv (PLUTO + MTA + competitors + Census joined), manhattan_tracts_lisa.geojson (LISA clusters), mta_station_clusters.csv (ridership pattern clusters)
  • v3.0 (2026-03-13): Replaced stores_enriched_v3 with stores_enriched_v4 (51 -> 63 columns: tract-level aggregates, demand/supply indices, Location Fitness Score, pedestrian counts). Added tract_demand_supply.csv (309 tracts with full LFS scoring). Added manhattan_pedestrian_counts.csv (NYC DOT bi-annual pedestrian counts at 36 Manhattan locations, 2007-2025).

Related Datasets

Share link

Anyone who has the link will be able to view this.