171 Starbucks enriched with PLUTO, MTA, Census, LFS & pedestrian counts

What's inside

Nine analysis-ready files covering Manhattan's cafe landscape, transit infrastructure, building attributes, demographics, and location fitness scoring:

Core files (v1)

manhattan_starbucks_osm.csv — 171 Starbucks locations from OpenStreetMap with addresses and amenity attributes
manhattan_cafes_osm.csv — 1,335 cafes (Starbucks, Dunkin', branded chains, and independents) with brand classification
manhattan_mta_ridership_summary.csv — 123 subway station complexes with average daily ridership (Q4 2024)

Added in v2

manhattan_tracts_lisa.geojson — Census Tract polygons with LISA spatial autocorrelation cluster labels (High-High, Low-Low, etc.)
mta_station_clusters.csv — 123 stations classified into 4 ridership-pattern clusters (Morning Peak, Balanced, Midday-Heavy, Evening Peak)

New in v3

stores_enriched_v4.csv — 171 Starbucks x 63 columns: everything from v3's 51 columns + 12 new columns including tract-level cafe counts, MTA ridership, pedestrian counts, population density, demand/supply indices, Location Fitness Score, LISA cluster, and nearest pedestrian counter (replaces stores_enriched_v3.csv)
tract_demand_supply.csv — 309 Census Tracts x 24 columns: comprehensive demand-supply scoring for all Manhattan tracts, including ridership-based and pedestrian-based demand proxy indices, supply index, and Location Fitness Scores (LFS)
manhattan_pedestrian_counts.csv — NYC DOT Bi-Annual Pedestrian Counts at 36 Manhattan locations (2007-2025), with AM/PM/Midday counts per survey period

Use cases

Spatial competition analysis (Starbucks vs. competitors)
Spatial autocorrelation (Moran's I, LISA hotspot detection)
Point pattern analysis (Ripley's K / Besag's L function)
Transit-oriented location scoring
Walk-distance network analysis with OSMnx
Demand proxy modeling using subway ridership and pedestrian flow
Location Fitness Score (LFS) analysis and site selection
Demand-supply gap identification for new store candidates
Urban retail geography research
Pedestrian flow temporal analysis (seasonal and multi-year trends)

Data sources & licenses

File	Source	License
Starbucks & Cafes	OpenStreetMap via Overpass API	ODbL 1.0
MTA Ridership	data.ny.gov (MTA Subway Hourly Ridership)	OPEN NY Terms
Building attributes	MapPLUTO (NYC Dept. of City Planning)	NYC Open Data Terms
Demographics	US Census ACS 2022 5-Year Estimates	Public domain
Tract boundaries	TIGER/Line Shapefiles (Census Bureau)	Public domain
Pedestrian Counts	NYC DOT Bi-Annual Pedestrian Counts (NYC Open Data)	NYC Open Data Terms

OpenStreetMap attribution: (c) OpenStreetMap contributors. Data available under the Open Database License (ODbL) v1.0.

MTA data: Provided by the Metropolitan Transportation Authority via New York State Open Data portal.

NYC DOT data: Provided by the NYC Department of Transportation via NYC Open Data.

Coordinate Reference System

All coordinates are in WGS84 (EPSG:4326) — standard latitude/longitude.

Column descriptions

manhattan_starbucks_osm.csv (171 rows x 17 columns)

Column	Type	Description
osm_id	int	OpenStreetMap element ID
osm_element	str	OSM element type (node/way)
name	str	Store name
brand	str	Brand name (Starbucks)
addr_street	str	Street name (18% missing)
addr_housenumber	str	House number (18% missing)
addr_postcode	str	ZIP code (18% missing)
addr_city	str	City name
phone	str	Phone number (24% missing)
opening_hours	str	Opening hours in OSM format (46% missing)
wheelchair	str	Wheelchair accessibility
outdoor_seating	str	Outdoor seating available
indoor_seating	str	Indoor seating available
drive_through	str	Drive-through available
takeaway	str	Takeaway available
lat	float	Latitude (WGS84)
lon	float	Longitude (WGS84)

manhattan_cafes_osm.csv (1,335 rows x 11 columns)

Column	Type	Description
osm_id	int	OpenStreetMap element ID
osm_element	str	OSM element type
name	str	Cafe name ('Unknown Cafe' if missing)
brand	str	Brand name (NaN for independents)
amenity	str	OSM amenity tag
cuisine	str	Cuisine type tag
addr_street	str	Street name
addr_housenumber	str	House number
lat	float	Latitude (WGS84)
lon	float	Longitude (WGS84)
brand_category	str	One of: starbucks, dunkin, branded, independent

manhattan_mta_ridership_summary.csv (123 rows x 8 columns)

Column	Type	Description
station_complex_id	str	MTA station complex identifier
station_name	str	Station name with subway lines
lat	float	Latitude (WGS84)
lon	float	Longitude (WGS84)
avg_daily_ridership	int	Average daily ridership (all fare types combined)
total_ridership_q4_2024	int	Total ridership Oct-Dec 2024
data_period	str	Data collection period
data_days	int	Number of days in the period (92)

stores_enriched_v4.csv (171 rows x 63 columns)

Column	Type	Description
osm_id – takeaway	various	Same as manhattan_starbucks_osm.csv (17 columns)
pluto_bbl	float	NYC Borough-Block-Lot identifier
pluto_landuse	float	PLUTO land use code
pluto_bldgclass	str	Building class code
pluto_numfloors	float	Number of floors (9 missing)
pluto_yearbuilt	float	Year built
pluto_unitstotal	float	Total residential units
pluto_retailarea	float	Retail floor area in sq ft (8 missing)
pluto_assesstot	float	Total assessed value
pluto_comarea	float	Commercial floor area in sq ft (8 missing)
pluto_lotarea	float	Lot area in sq ft
pluto_zonedist1	str	Primary zoning district
pluto_dist_m	float	Distance to matched PLUTO lot (meters)
mta_station_id	str	Nearest MTA station complex ID
mta_station_name	str	Nearest station name
mta_dist_m	float	Distance to nearest station (meters)
mta_avg_daily_ridership	int	Nearest station's average daily ridership
n_starbucks_250m	int	Starbucks count within 250m
n_dunkin_250m	int	Dunkin' count within 250m
n_other_cafe_250m	int	Other cafe count within 250m
n_starbucks_500m	int	Starbucks count within 500m
n_dunkin_500m	int	Dunkin' count within 500m
n_other_cafe_500m	int	Other cafe count within 500m
n_starbucks_1000m	int	Starbucks count within 1km
n_dunkin_1000m	int	Dunkin' count within 1km
n_other_cafe_1000m	int	Other cafe count within 1km
nearest_competitor_dist_m	float	Distance to nearest non-Starbucks cafe (meters)
nearest_starbucks_dist_m	float	Distance to nearest other Starbucks (meters)
census_tract_id	int	Census Tract GEOID
tract_population	int	Tract total population (ACS 2022)
tract_median_income	float	Tract median household income in dollars (3 missing)
tract_pct_walk_commute	float	Percent of workers who walk to work
tract_pct_bachelors_plus	float	Percent of adults with bachelor's degree or higher
station_cluster	int	MTA station ridership-pattern cluster (0-3)
station_cluster_name	str	Cluster label: Morning Peak / Balanced / Midday-Heavy / Evening Peak
ped_count_nearest	float	Pedestrian count at the nearest NYC DOT counter location
ped_dist_m	float	Distance to the nearest pedestrian count location (meters)
tract_starbucks_count	int	Number of Starbucks in the same Census Tract
tract_total_cafes	int	Total number of cafes in the same Census Tract
tract_competitor_cafes	int	Number of non-Starbucks cafes in the same Census Tract
tract_mta_ridership	float	Total MTA ridership aggregated at the Census Tract level
tract_avg_ped_count	float	Average pedestrian count at tract level
tract_pop_density	float	Population density of the Census Tract (people per sq km)
demand_proxy_index	float	Composite demand proxy index (ridership + pedestrian + population)
supply_index	float	Supply concentration index (cafe density relative to demand)
location_fitness_score	float	Location Fitness Score: demand-supply balance metric (higher = more underserved demand)
tract_lisa_cluster	str	LISA spatial autocorrelation cluster for the tract (High-High, Low-Low, etc.)

tract_demand_supply.csv (309 rows x 24 columns)

Column	Type	Description
GEOID	str	Census Tract FIPS code
ALAND	int	Land area in square meters
AWATER	int	Water area in square meters
NAMELSAD	str	Tract name
starbucks_count	int	Number of Starbucks in this tract
lisa_cluster	str	LISA classification: High-High, Low-Low, High-Low, Low-High, or Not Significant
tract_starbucks_count	int	Starbucks count (same as starbucks_count)
tract_total_cafes	int	Total number of cafes in the tract
tract_competitor_cafes	int	Number of non-Starbucks cafes
tract_mta_ridership	float	Total MTA daily ridership attributed to the tract
tract_avg_ped_count	float	Average pedestrian count in the tract
tract_population	int	Tract total population (ACS 2022)
tract_median_income	float	Tract median household income
tract_pct_walk_commute	float	Percent of workers who walk to work
tract_pct_bachelors_plus	float	Percent with bachelor's degree or higher
tract_pop_density	float	Population density (people per sq km)
tract_avg_ped_count_filled	float	Pedestrian count with missing values filled
dpi_ridership	float	Demand proxy sub-index: MTA ridership component
dpi_pedestrian	float	Demand proxy sub-index: pedestrian flow component
dpi_population	float	Demand proxy sub-index: population component
demand_proxy_index	float	Composite Demand Proxy Index (0-1 scale)
supply_index	float	Supply concentration index (cafe count relative to area)
supply_normalized	float	Normalized supply index (0-1 scale)
location_fitness_score	float	Location Fitness Score: demand minus supply (higher = underserved)

manhattan_pedestrian_counts.csv (36 rows x 113 columns)

Column	Type	Description
the_geom	str	Point geometry (WKT)
OBJECTID	int	Unique object identifier
Loc	int	Location identifier
Borough	str	Borough name (Manhattan)
Street_Nam	str	Street name of the count location
From_Stree	str	Cross street (from)
To_Street	str	Cross street (to)
Iex	str	Index/intersection identifier
{Period}_AM	int	AM peak pedestrian count for the survey period (e.g., May07_AM, Sept07_AM, ... May25_AM)
{Period}_PM	int	PM peak pedestrian count for the survey period
{Period}_MD	int	Midday pedestrian count for the survey period

Survey periods span from May 2007 to May 2025, conducted bi-annually (May and September/October), with 3 time slots each (AM, PM, Midday).

manhattan_tracts_lisa.geojson (289 tracts)

Census Tract polygons with LISA (Local Indicators of Spatial Association) results:

Property	Type	Description
GEOID	str	Census Tract FIPS code
NAMELSAD	str	Tract name
geometry	Polygon	Tract boundary (WGS84)
starbucks_count	int	Number of Starbucks in this tract
lisa_cluster	str	LISA classification: High-High, Low-Low, High-Low, Low-High, or Not Significant
lisa_q	int	LISA quadrant (1-4)
lisa_p	float	LISA p-value
lisa_I	float	Local Moran's I statistic

mta_station_clusters.csv (123 rows x 3 columns)

Column	Type	Description
station_complex_id	str	MTA station complex identifier (join key)
cluster	int	Cluster assignment (0-3)
cluster_name	str	Morning Peak (Residential) / Balanced (Transit Hub) / Midday-Heavy (Tourism) / Evening Peak (Office)

Missing data policy

lat/lon: Never missing (required fields)
name: Never missing (filled with 'Unknown Cafe' if absent in OSM)
brand: NaN for independent cafes (by design, not an error)
Address fields: NaN where OSM contributors haven't added the information
Ridership: All 123 stations have ridership > 0
PLUTO fields: numfloors (9 missing), retailarea/comarea (8 missing) — buildings where PLUTO lacks data
tract_median_income: 3 missing — Census tracts with suppressed income data
ped_count_nearest: Available for all 171 stores (interpolated from nearest DOT counter)
tract_avg_ped_count: Some tracts have no nearby pedestrian counter; filled via tract_avg_ped_count_filled in tract_demand_supply.csv

Related notebooks (15-notebook series)

#	Notebook	Theme	Link
0	Manhattan Cafe Wars	EDA & competitor mapping	Open
1	Starbucks 10-K NLP	Keyword trends, LDA topics	Open
1F	LDA Topic Explorer	Interactive pyLDAvis	Open
2A	Spatial Clustering	Moran's I, LISA, Ripley's K	Open
2B	Location Fitness	Demand-supply scoring & backtest	Open
2C	Walk-Distance Analysis	OSMnx network analysis	Open
5A	LFS Predictive Model	XGBoost/Random Forest model	Open
5B	Strategic Location Insights	So What — actionable recommendations	Open
5C	Optimal Store Placement	Greedy maximal coverage algorithm	Open
5D	Hourly Demand Patterns	Time-of-day station demand profiles	Open
5E	Interactive Strategy Map	5-layer Folium visualization	Open
T	Generalization Template	LFS framework for any city	Open
C	Data Pipeline	EDGAR & OSM to CSV	Open
D	US Expansion Choropleth	30-year animated map	Open
G	NLP × Spatial Integration	Cross-theme synthesis	Open

Related dataset: Starbucks 30-Year 10-K NLP Corpus — NLP data for Theme 1

Updates

v1.0 (2026-03-12): Initial release — 3 CSV files with Q4 2024 ridership
v2.0 (2026-03-13): Added stores_enriched_v3.csv (PLUTO + MTA + competitors + Census joined), manhattan_tracts_lisa.geojson (LISA clusters), mta_station_clusters.csv (ridership pattern clusters)
v3.0 (2026-03-13): Replaced stores_enriched_v3 with stores_enriched_v4 (51 -> 63 columns: tract-level aggregates, demand/supply indices, Location Fitness Score, pedestrian counts). Added tract_demand_supply.csv (309 tracts with full LFS scoring). Added manhattan_pedestrian_counts.csv (NYC DOT bi-annual pedestrian counts at 36 Manhattan locations, 2007-2025).

Related Datasets

Starbucks NYC: 1000+ Reviews ☕

@kaggle
Wars On Territory

@owid
MTA NYCT MetroCard History: 2010 - 2021

@usgov
Food Expenditure (USDA/ERS, 2023)

@owid
MTA Subway Turnstile Usage Data: 2022

@usgov
Dietary Choices Of Brits And Americans

@owid

Starbucks NYC: 1000+ Reviews ☕

Wars On Territory

MTA NYCT MetroCard History: 2010 - 2021

Food Expenditure (USDA/ERS, 2023)

MTA Subway Turnstile Usage Data: 2022

Dietary Choices Of Brits And Americans