171 Starbucks enriched with PLUTO, MTA, Census, LFS & pedestrian counts
Dataset Description
What's inside
Nine analysis-ready files covering Manhattan's cafe landscape, transit infrastructure, building attributes, demographics, and location fitness scoring:
Core files (v1)
- manhattan_starbucks_osm.csv — 171 Starbucks locations from OpenStreetMap with addresses and amenity attributes
- manhattan_cafes_osm.csv — 1,335 cafes (Starbucks, Dunkin', branded chains, and independents) with brand classification
- manhattan_mta_ridership_summary.csv — 123 subway station complexes with average daily ridership (Q4 2024)
Added in v2
- manhattan_tracts_lisa.geojson — Census Tract polygons with LISA spatial autocorrelation cluster labels (High-High, Low-Low, etc.)
- mta_station_clusters.csv — 123 stations classified into 4 ridership-pattern clusters (Morning Peak, Balanced, Midday-Heavy, Evening Peak)
New in v3
- stores_enriched_v4.csv — 171 Starbucks x 63 columns: everything from v3's 51 columns + 12 new columns including tract-level cafe counts, MTA ridership, pedestrian counts, population density, demand/supply indices, Location Fitness Score, LISA cluster, and nearest pedestrian counter (replaces stores_enriched_v3.csv)
- tract_demand_supply.csv — 309 Census Tracts x 24 columns: comprehensive demand-supply scoring for all Manhattan tracts, including ridership-based and pedestrian-based demand proxy indices, supply index, and Location Fitness Scores (LFS)
- manhattan_pedestrian_counts.csv — NYC DOT Bi-Annual Pedestrian Counts at 36 Manhattan locations (2007-2025), with AM/PM/Midday counts per survey period
Use cases
- Spatial competition analysis (Starbucks vs. competitors)
- Spatial autocorrelation (Moran's I, LISA hotspot detection)
- Point pattern analysis (Ripley's K / Besag's L function)
- Transit-oriented location scoring
- Walk-distance network analysis with OSMnx
- Demand proxy modeling using subway ridership and pedestrian flow
- Location Fitness Score (LFS) analysis and site selection
- Demand-supply gap identification for new store candidates
- Urban retail geography research
- Pedestrian flow temporal analysis (seasonal and multi-year trends)
Data sources & licenses
| File | Source | License |
|---|---|---|
| Starbucks & Cafes | OpenStreetMap via Overpass API | ODbL 1.0 |
| MTA Ridership | data.ny.gov (MTA Subway Hourly Ridership) | OPEN NY Terms |
| Building attributes | MapPLUTO (NYC Dept. of City Planning) | NYC Open Data Terms |
| Demographics | US Census ACS 2022 5-Year Estimates | Public domain |
| Tract boundaries | TIGER/Line Shapefiles (Census Bureau) | Public domain |
| Pedestrian Counts | NYC DOT Bi-Annual Pedestrian Counts (NYC Open Data) | NYC Open Data Terms |
OpenStreetMap attribution: (c) OpenStreetMap contributors. Data available under the Open Database License (ODbL) v1.0.
MTA data: Provided by the Metropolitan Transportation Authority via New York State Open Data portal.
NYC DOT data: Provided by the NYC Department of Transportation via NYC Open Data.
Coordinate Reference System
All coordinates are in WGS84 (EPSG:4326) — standard latitude/longitude.
Column descriptions
manhattan_starbucks_osm.csv (171 rows x 17 columns)
| Column | Type | Description |
|---|---|---|
| osm_id | int | OpenStreetMap element ID |
| osm_element | str | OSM element type (node/way) |
| name | str | Store name |
| brand | str | Brand name (Starbucks) |
| addr_street | str | Street name (18% missing) |
| addr_housenumber | str | House number (18% missing) |
| addr_postcode | str | ZIP code (18% missing) |
| addr_city | str | City name |
| phone | str | Phone number (24% missing) |
| opening_hours | str | Opening hours in OSM format (46% missing) |
| wheelchair | str | Wheelchair accessibility |
| outdoor_seating | str | Outdoor seating available |
| indoor_seating | str | Indoor seating available |
| drive_through | str | Drive-through available |
| takeaway | str | Takeaway available |
| lat | float | Latitude (WGS84) |
| lon | float | Longitude (WGS84) |
manhattan_cafes_osm.csv (1,335 rows x 11 columns)
| Column | Type | Description |
|---|---|---|
| osm_id | int | OpenStreetMap element ID |
| osm_element | str | OSM element type |
| name | str | Cafe name ('Unknown Cafe' if missing) |
| brand | str | Brand name (NaN for independents) |
| amenity | str | OSM amenity tag |
| cuisine | str | Cuisine type tag |
| addr_street | str | Street name |
| addr_housenumber | str | House number |
| lat | float | Latitude (WGS84) |
| lon | float | Longitude (WGS84) |
| brand_category | str | One of: starbucks, dunkin, branded, independent |
manhattan_mta_ridership_summary.csv (123 rows x 8 columns)
| Column | Type | Description |
|---|---|---|
| station_complex_id | str | MTA station complex identifier |
| station_name | str | Station name with subway lines |
| lat | float | Latitude (WGS84) |
| lon | float | Longitude (WGS84) |
| avg_daily_ridership | int | Average daily ridership (all fare types combined) |
| total_ridership_q4_2024 | int | Total ridership Oct-Dec 2024 |
| data_period | str | Data collection period |
| data_days | int | Number of days in the period (92) |
stores_enriched_v4.csv (171 rows x 63 columns)
| Column | Type | Description |
|---|---|---|
| osm_id – takeaway | various | Same as manhattan_starbucks_osm.csv (17 columns) |
| pluto_bbl | float | NYC Borough-Block-Lot identifier |
| pluto_landuse | float | PLUTO land use code |
| pluto_bldgclass | str | Building class code |
| pluto_numfloors | float | Number of floors (9 missing) |
| pluto_yearbuilt | float | Year built |
| pluto_unitstotal | float | Total residential units |
| pluto_retailarea | float | Retail floor area in sq ft (8 missing) |
| pluto_assesstot | float | Total assessed value |
| pluto_comarea | float | Commercial floor area in sq ft (8 missing) |
| pluto_lotarea | float | Lot area in sq ft |
| pluto_zonedist1 | str | Primary zoning district |
| pluto_dist_m | float | Distance to matched PLUTO lot (meters) |
| mta_station_id | str | Nearest MTA station complex ID |
| mta_station_name | str | Nearest station name |
| mta_dist_m | float | Distance to nearest station (meters) |
| mta_avg_daily_ridership | int | Nearest station's average daily ridership |
| n_starbucks_250m | int | Starbucks count within 250m |
| n_dunkin_250m | int | Dunkin' count within 250m |
| n_other_cafe_250m | int | Other cafe count within 250m |
| n_starbucks_500m | int | Starbucks count within 500m |
| n_dunkin_500m | int | Dunkin' count within 500m |
| n_other_cafe_500m | int | Other cafe count within 500m |
| n_starbucks_1000m | int | Starbucks count within 1km |
| n_dunkin_1000m | int | Dunkin' count within 1km |
| n_other_cafe_1000m | int | Other cafe count within 1km |
| nearest_competitor_dist_m | float | Distance to nearest non-Starbucks cafe (meters) |
| nearest_starbucks_dist_m | float | Distance to nearest other Starbucks (meters) |
| census_tract_id | int | Census Tract GEOID |
| tract_population | int | Tract total population (ACS 2022) |
| tract_median_income | float | Tract median household income in dollars (3 missing) |
| tract_pct_walk_commute | float | Percent of workers who walk to work |
| tract_pct_bachelors_plus | float | Percent of adults with bachelor's degree or higher |
| station_cluster | int | MTA station ridership-pattern cluster (0-3) |
| station_cluster_name | str | Cluster label: Morning Peak / Balanced / Midday-Heavy / Evening Peak |
| ped_count_nearest | float | Pedestrian count at the nearest NYC DOT counter location |
| ped_dist_m | float | Distance to the nearest pedestrian count location (meters) |
| tract_starbucks_count | int | Number of Starbucks in the same Census Tract |
| tract_total_cafes | int | Total number of cafes in the same Census Tract |
| tract_competitor_cafes | int | Number of non-Starbucks cafes in the same Census Tract |
| tract_mta_ridership | float | Total MTA ridership aggregated at the Census Tract level |
| tract_avg_ped_count | float | Average pedestrian count at tract level |
| tract_pop_density | float | Population density of the Census Tract (people per sq km) |
| demand_proxy_index | float | Composite demand proxy index (ridership + pedestrian + population) |
| supply_index | float | Supply concentration index (cafe density relative to demand) |
| location_fitness_score | float | Location Fitness Score: demand-supply balance metric (higher = more underserved demand) |
| tract_lisa_cluster | str | LISA spatial autocorrelation cluster for the tract (High-High, Low-Low, etc.) |
tract_demand_supply.csv (309 rows x 24 columns)
| Column | Type | Description |
|---|---|---|
| GEOID | str | Census Tract FIPS code |
| ALAND | int | Land area in square meters |
| AWATER | int | Water area in square meters |
| NAMELSAD | str | Tract name |
| starbucks_count | int | Number of Starbucks in this tract |
| lisa_cluster | str | LISA classification: High-High, Low-Low, High-Low, Low-High, or Not Significant |
| tract_starbucks_count | int | Starbucks count (same as starbucks_count) |
| tract_total_cafes | int | Total number of cafes in the tract |
| tract_competitor_cafes | int | Number of non-Starbucks cafes |
| tract_mta_ridership | float | Total MTA daily ridership attributed to the tract |
| tract_avg_ped_count | float | Average pedestrian count in the tract |
| tract_population | int | Tract total population (ACS 2022) |
| tract_median_income | float | Tract median household income |
| tract_pct_walk_commute | float | Percent of workers who walk to work |
| tract_pct_bachelors_plus | float | Percent with bachelor's degree or higher |
| tract_pop_density | float | Population density (people per sq km) |
| tract_avg_ped_count_filled | float | Pedestrian count with missing values filled |
| dpi_ridership | float | Demand proxy sub-index: MTA ridership component |
| dpi_pedestrian | float | Demand proxy sub-index: pedestrian flow component |
| dpi_population | float | Demand proxy sub-index: population component |
| demand_proxy_index | float | Composite Demand Proxy Index (0-1 scale) |
| supply_index | float | Supply concentration index (cafe count relative to area) |
| supply_normalized | float | Normalized supply index (0-1 scale) |
| location_fitness_score | float | Location Fitness Score: demand minus supply (higher = underserved) |
manhattan_pedestrian_counts.csv (36 rows x 113 columns)
| Column | Type | Description |
|---|---|---|
| the_geom | str | Point geometry (WKT) |
| OBJECTID | int | Unique object identifier |
| Loc | int | Location identifier |
| Borough | str | Borough name (Manhattan) |
| Street_Nam | str | Street name of the count location |
| From_Stree | str | Cross street (from) |
| To_Street | str | Cross street (to) |
| Iex | str | Index/intersection identifier |
| {Period}_AM | int | AM peak pedestrian count for the survey period (e.g., May07_AM, Sept07_AM, ... May25_AM) |
| {Period}_PM | int | PM peak pedestrian count for the survey period |
| {Period}_MD | int | Midday pedestrian count for the survey period |
Survey periods span from May 2007 to May 2025, conducted bi-annually (May and September/October), with 3 time slots each (AM, PM, Midday).
manhattan_tracts_lisa.geojson (289 tracts)
Census Tract polygons with LISA (Local Indicators of Spatial Association) results:
| Property | Type | Description |
|---|---|---|
| GEOID | str | Census Tract FIPS code |
| NAMELSAD | str | Tract name |
| geometry | Polygon | Tract boundary (WGS84) |
| starbucks_count | int | Number of Starbucks in this tract |
| lisa_cluster | str | LISA classification: High-High, Low-Low, High-Low, Low-High, or Not Significant |
| lisa_q | int | LISA quadrant (1-4) |
| lisa_p | float | LISA p-value |
| lisa_I | float | Local Moran's I statistic |
mta_station_clusters.csv (123 rows x 3 columns)
| Column | Type | Description |
|---|---|---|
| station_complex_id | str | MTA station complex identifier (join key) |
| cluster | int | Cluster assignment (0-3) |
| cluster_name | str | Morning Peak (Residential) / Balanced (Transit Hub) / Midday-Heavy (Tourism) / Evening Peak (Office) |
Missing data policy
- lat/lon: Never missing (required fields)
- name: Never missing (filled with 'Unknown Cafe' if absent in OSM)
- brand: NaN for independent cafes (by design, not an error)
- Address fields: NaN where OSM contributors haven't added the information
- Ridership: All 123 stations have ridership > 0
- PLUTO fields: numfloors (9 missing), retailarea/comarea (8 missing) — buildings where PLUTO lacks data
- tract_median_income: 3 missing — Census tracts with suppressed income data
- ped_count_nearest: Available for all 171 stores (interpolated from nearest DOT counter)
- tract_avg_ped_count: Some tracts have no nearby pedestrian counter; filled via tract_avg_ped_count_filled in tract_demand_supply.csv
Related notebooks (15-notebook series)
| # | Notebook | Theme | Link |
|---|---|---|---|
| 0 | Manhattan Cafe Wars | EDA & competitor mapping | Open |
| 1 | Starbucks 10-K NLP | Keyword trends, LDA topics | Open |
| 1F | LDA Topic Explorer | Interactive pyLDAvis | Open |
| 2A | Spatial Clustering | Moran's I, LISA, Ripley's K | Open |
| 2B | Location Fitness | Demand-supply scoring & backtest | Open |
| 2C | Walk-Distance Analysis | OSMnx network analysis | Open |
| 5A | LFS Predictive Model | XGBoost/Random Forest model | Open |
| 5B | Strategic Location Insights | So What — actionable recommendations | Open |
| 5C | Optimal Store Placement | Greedy maximal coverage algorithm | Open |
| 5D | Hourly Demand Patterns | Time-of-day station demand profiles | Open |
| 5E | Interactive Strategy Map | 5-layer Folium visualization | Open |
| T | Generalization Template | LFS framework for any city | Open |
| C | Data Pipeline | EDGAR & OSM to CSV | Open |
| D | US Expansion Choropleth | 30-year animated map | Open |
| G | NLP × Spatial Integration | Cross-theme synthesis | Open |
Related dataset: Starbucks 30-Year 10-K NLP Corpus — NLP data for Theme 1
Updates
- v1.0 (2026-03-12): Initial release — 3 CSV files with Q4 2024 ridership
- v2.0 (2026-03-13): Added stores_enriched_v3.csv (PLUTO + MTA + competitors + Census joined), manhattan_tracts_lisa.geojson (LISA clusters), mta_station_clusters.csv (ridership pattern clusters)
- v3.0 (2026-03-13): Replaced stores_enriched_v3 with stores_enriched_v4 (51 -> 63 columns: tract-level aggregates, demand/supply indices, Location Fitness Score, pedestrian counts). Added tract_demand_supply.csv (309 tracts with full LFS scoring). Added manhattan_pedestrian_counts.csv (NYC DOT bi-annual pedestrian counts at 36 Manhattan locations, 2007-2025).
Related Datasets
-
Starbucks NYC: 1000+ Reviews ☕
@kaggle
-
Wars On Territory
@owid