Learning Objectives: - Apply spatial operations to answer policy-relevant research questions - Integrate census demographic data with spatial analysis - Create publication-quality visualizations and maps - Work with spatial data from multiple sources - Communicate findings effectively for policy audiences
Part 1: Healthcare Access for Vulnerable Populations
Research Question
Which Pennsylvania counties have the highest proportion of vulnerable populations (elderly + low-income) living far from hospitals?
Your analysis should identify counties that should be priorities for healthcare investment and policy intervention.
Required Analysis Steps
Complete the following analysis, documenting each step with code and brief explanations:
Step 1: Data Collection (5 points)
Load the required spatial data: - Pennsylvania county boundaries - Pennsylvania hospitals (from lecture data) - Pennsylvania census tracts
Questions to answer: - How many hospitals are in your dataset? There are 223 hospitals in my dataset.
How many census tracts? There are 3445 census tracts in my dataset.
What coordinate reference system is each dataset in? All of them is now in WGS 1984
Step 2: Get Demographic Data
Use tidycensus to download tract-level demographic data for Pennsylvania.
Required variables: - Total population - Median household income - Population 65 years and over (you may need to sum multiple age categories)
Your Task:
Questions to answer: - What year of ACS data are you using? 2022 5-year ACS data - How many tracts have missing income data? 62 - What is the median income across all PA census tracts? 70188
Step 3: Define Vulnerable Populations
Identify census tracts with vulnerable populations based on TWO criteria: 1. Low median household income (choose an appropriate threshold) 2. Significant elderly population (choose an appropriate threshold)
Your Task:
Questions to answer: - What income threshold did you choose and why? I chose the 25th percentile of median household income across PA tracts — meaning that tracts with income in the bottom quarter of the state are flagged
What elderly population threshold did you choose and why? I chose the 75th percentile of elderly share across PA tracts — meaning that tracts where the proportion of 65+ residents falls in the top quarter of the state are flagged.
How many tracts meet your vulnerability criteria? 165 out of 3,383 valid PA census tracts meet the vulnerability criteria.
What percentage of PA census tracts are considered vulnerable by your definition? Approximately 4.9% of valid PA census tracts are considered vulnerable by this definition.
Step 4: Calculate Distance to Hospitals
For each vulnerable tract, calculate the distance to the nearest hospital.
Your Task:
[1] 4.745881
[1] 19.15872
[1] 9
Requirements: - Use an appropriate projected coordinate system for Pennsylvania - Calculate distances in miles - Explain why you chose your projection
Questions to answer: - What is the average distance to the nearest hospital for vulnerable tracts? On average, vulnerable tracts are 4.75 miles from the nearest hospital.
What is the maximum distance? The maximum distance to the nearest hospital among vulnerable tracts is 19.16 miles.
How many vulnerable tracts are more than 15 miles from the nearest hospital? NINE vulnerable tracts are more than 15 miles from the nearest hospital.
Step 5: Identify Underserved Areas
Define “underserved” as vulnerable tracts that are more than 15 miles from the nearest hospital.
Your Task:
[1] 9
[1] 5.454545
Questions to answer: - How many tracts are underserved? Nine.
What percentage of vulnerable tracts are underserved? Approximately 5.5% vulnerable tracts are defined as underserved.
Does this surprise you? Why or why not? This is not entirely surprising. Pennsylvania has a relatively dense hospital network, particularly in urban areas like Philadelphia and Pittsburgh, which keeps most tracts within reasonable distance. At 5.5%, the underserved share is reasonably small — but the 9 tracts that do exceed 15 miles are concerning, as low-income elderly residents in these areas are least likely to have reliable transportation to bridge that gap.
Step 6: Aggregate to County Level
Use spatial joins and aggregation to calculate county-level statistics about vulnerable populations and hospital access.
Required county-level statistics: - Number of vulnerable tracts - Number of underserved tracts
- Percentage of vulnerable tracts that are underserved - Average distance to nearest hospital for vulnerable tracts - Total vulnerable population
Questions to answer: - Which 5 counties have the highest percentage of underserved vulnerable tracts? Perry, Bradford, Columbia, Juniata, and Monroe all have 100% of their vulnerable tracts classified as underserved, with average distances ranging from 15.9 to 17.7 miles from the nearest hospital.
Which counties have the most vulnerable people living far from hospitals? Luzerne leads with 6,125 vulnerable elderly residents and 1 underserved tract, followed by Warren (5,963), Crawford (4,683), Clearfield (4,368), and Northumberland (3,395).
Are there any patterns in where underserved counties are located? The underserved counties are predominantly rural, located in central and northern Pennsylvania (Perry, Juniata, Columbia, Bradford, Clearfield, Warren, Crawford). This aligns with the broader pattern of rural healthcare deserts, where low population density makes hospital infrastructure harder to sustain.
Step 7: Create Summary Table
Create a professional table showing the top 10 priority counties for healthcare investment.
Your Task:
# Create and format priority counties tablelibrary(knitr)library(kableExtra)county_stats %>%arrange(desc(pct_underserved), desc(total_vulnerable_pop)) %>%head(10) %>%mutate(avg_dist =round(avg_dist, 2),pct_underserved =round(pct_underserved, 1),total_vulnerable_pop =comma(total_vulnerable_pop) ) %>%rename("County"= COUNTY_NAM,"Vulnerable Tracts"= n_vulnerable,"Underserved Tracts"= n_underserved,"% Underserved"= pct_underserved,"Avg. Distance to Hospital (miles)"= avg_dist,"Total Vulnerable Population"= total_vulnerable_pop ) %>%kable(caption ="Top 10 Priority Counties for Healthcare Investment in Pennsylvania: Counties with the highest share of vulnerable tracts lacking adequate hospital access",align =c("l", "c", "c", "c", "c", "c") ) %>%kable_styling(bootstrap_options =c("striped", "hover", "condensed"),full_width =FALSE )
Top 10 Priority Counties for Healthcare Investment in Pennsylvania: Counties with the highest share of vulnerable tracts lacking adequate hospital access
County
Vulnerable Tracts
Underserved Tracts
% Underserved
Avg. Distance to Hospital (miles)
Total Vulnerable Population
PERRY
2
2
100
17.53
1,558
BRADFORD
1
1
100
16.66
1,545
JUNIATA
1
1
100
15.91
492
MONROE
1
1
100
17.68
314
COLUMBIA
1
1
100
16.91
282
SULLIVAN
1
1
100
16.91
282
DAUPHIN
2
1
50
9.95
2,122
TIOGA
2
1
50
8.65
1,416
CLINTON
2
1
50
11.03
1,402
FOREST
2
1
50
15.25
1,311
Requirements: - Use knitr::kable() or similar for formatting - Include descriptive column names - Format numbers appropriately (commas for population, percentages, etc.) - Add an informative caption - Sort by priority (you decide the metric)
Part 2: Comprehensive Visualization
Using the skills from Week 3 (Data Visualization), create publication-quality maps and charts.
Map 1: County-Level Choropleth
Create a choropleth map showing healthcare access challenges at the county level.
Your Task:
# Create county-level access maphospitals <-st_transform(hospitals, st_crs(pa_counties))# Join county_stats to pa_counties spatial datacounty_map <- pa_counties %>%left_join(county_stats, by ="COUNTY_NAM")# Create county-level access mapggplot() +geom_sf(data = county_map,aes(fill = pct_underserved),color ="white", linewidth =0.3) +geom_sf(data = hospitals,color ="red", size =0.8, alpha =0.6) +scale_fill_distiller(palette ="YlOrRd",direction =1,na.value ="gray90",name ="% Underserved\nVulnerable Tracts",labels = scales::label_percent(scale =1) ) +labs(title ="County-Level Hospital Access for Vulnerable Populations in Pennsylvania",subtitle ="Share of vulnerable tracts more than 15 miles from the nearest hospital",caption ="Source: U.S. Census Bureau ACS 2022, PA County Boundaries\nVulnerable tracts defined as low-income (bottom 25%) with high elderly share (top 25%)" ) +theme_void()
Requirements: - Fill counties by percentage of vulnerable tracts that are underserved - Include hospital locations as points - Use an appropriate color scheme - Include clear title, subtitle, and caption - Use theme_void() or similar clean theme - Add a legend with formatted labels
Map 2: Detailed Vulnerability Map
Create a map highlighting underserved vulnerable tracts.
Your Task:
# Create detailed tract-level mapggplot() +geom_sf(data = pa_counties, fill ="gray95", color ="gray60", linewidth =0.3) +geom_sf(data = vulnerable_tracts %>%filter(!underserved),fill ="lightyellow", color =NA, alpha =0.7) +geom_sf(data = vulnerable_tracts %>%filter(underserved),fill ="red", color =NA, alpha =0.9) +geom_sf(data = hospitals,color ="blue", size =0.8, alpha =0.6) +labs(title ="Underserved Vulnerable Tracts in Pennsylvania",subtitle ="Red tracts are low-income, elderly, and more than 15 miles from the nearest hospital",caption ="Source: U.S. Census Bureau ACS 2022\nBlue points = Hospitals" ) +theme_void()
Requirements: - Show underserved vulnerable tracts in a contrasting color - Include county boundaries for context - Show hospital locations - Use appropriate visual hierarchy (what should stand out?) - Include informative title and subtitle
Chart: Distribution Analysis
Create a visualization showing the distribution of distances to hospitals for vulnerable populations.
Your Task:
# Create distribution visualization# Plot 1: Histogram weighted by elderly populationp1 <-ggplot(vulnerable_tracts, aes(x = nearest_hosp_dist, weight = pop_65_plus)) +geom_histogram(aes(fill =after_stat(x) >15), bins =30, alpha =0.8) +scale_fill_manual(values =c("FALSE"="steelblue", "TRUE"="red"),labels =c("FALSE"="Served", "TRUE"="Underserved"),name ="" ) +labs(title ="Elderly Population by Distance to Nearest Hospital",subtitle ="Among vulnerable tracts in Pennsylvania",x ="Distance to Nearest Hospital (miles)",y ="Total Elderly Population (65+)",caption ="Red bars = beyond 15-mile underserved threshold" ) +theme_minimal()# Plot 2: Scatter plot of distance vs elderly populationp2 <-ggplot(vulnerable_tracts, aes(x = nearest_hosp_dist, y = pop_65_plus, color = underserved)) +geom_point(alpha =0.7) +scale_color_manual(values =c("FALSE"="steelblue", "TRUE"="red"),labels =c("FALSE"="Served", "TRUE"="Underserved"),name ="" ) +labs(title ="Elderly Population vs. Distance to Nearest Hospital by Tract",subtitle ="Among vulnerable tracts in Pennsylvania",x ="Distance to Nearest Hospital (miles)",y ="Total Elderly Population (65+)",caption ="Source: U.S. Census Bureau ACS 2022" ) +theme_minimal()# Comparisonp1 / p2
#The histogram shows the total elderly population living in the census tracts at certain distance range from the nearest hospital. #The majority of bars are concentrated within 5 miles and decline sharply beyond that#The dashline marks the threshold of underserved population.
Suggested chart types: - Histogram or density plot of distances - Box plot comparing distances across regions - Bar chart of underserved tracts by county - Scatter plot of distance vs. vulnerable population size
Requirements: - Clear axes labels with units - Appropriate title - Professional formatting - Brief interpretation (1-2 sentences as a caption or in text)
Part 3: Bring Your Own Data Analysis
Choose your own additional spatial dataset and conduct a supplementary analysis.
Challenge Options
Choose ONE of the following challenge exercises, or propose your own research question using OpenDataPhilly data (https://opendataphilly.org/datasets/).
School Safety Zones - Data: Schools, Crime Incidents, Collision - Question: “Are school zones safe for commuting, or are they danger hotspots?” - Operations: Buffer schools (1000ft safety zone), spatial join with crime incidents and collision count - Policy relevance: Safe Routes to School programs, crossing guard placement
Your Analysis
Your Task:
Find and load additional data
Document your data source
Check and standardize the CRS
Provide basic summary statistics
Questions to answer: - What dataset did you choose and why? I chose three datasets from OpenDataPhilly: Philadelphia crime incidents (2020–2024 aggregate), traffic collision data (2020–2024), and school parcels. Crime and collision data were chosen because both directly affect student safety around schools. Violent crime threatens personal safety while traffic collisions are a major concern for students walking or biking to school.
What is the data source and date? All datasets are from OpenDataPhilly, maintained by the City of Philadelphia. Crime and collision data span 2020–2024 and school parcels reflect current school locations.
How many features does it contain? After filtering for relevant crime categories, the crime dataset contains 234,858 incidents across five years. The collision dataset contains 36,303 records over the same period.
What CRS is it in? Did you need to transform it? All datasets are in EPSG 3857 (WGS 84 / Pseudo-Mercator), the same CRS as “pa_counties”. No additional transformation was needed since they already matched.
Pose a research question
Are school safety zones in Philadelphia’s high-violation tracts disproportionately exposed to crime incidents and traffic hazards?
Examples: - “Do vulnerable tracts have adequate public transit access to hospitals?” - “Are EMS stations appropriately located near vulnerable populations?” - “Do areas with low vehicle access have worse hospital access?”
Conduct spatial analysis
Use at least TWO spatial operations to answer your research question.
## buffer school safety zone and calculate number of crime and collision for each school# 1000ft = 304.8 meters in EPSG 3857safety_zone <- schools %>%st_buffer(dist =304.8)# Spatial joincrime_in_buffer <- safety_zone %>%st_join(crime) %>%st_drop_geometry() %>%group_by(objectid.x) %>%summarise(crime_count =n()) %>%rename(objectid = objectid.x)collision_in_buffer <- safety_zone %>%st_join(collision) %>%st_drop_geometry() %>%group_by(objectid.x) %>%summarise(collision_count =n()) %>%rename(objectid = objectid.x)# Join countssafety_zone <- schools %>%st_buffer(dist =304.8) %>%left_join(crime_in_buffer, by ="objectid") %>%left_join(collision_in_buffer, by ="objectid") %>%mutate(crime_count =replace_na(crime_count, 0),collision_count =replace_na(collision_count, 0),danger_score = crime_count + collision_count )##identify the top 5 dangerous school parcels by crime/collision/crime + collision (= danger)/##top 20% crime and top 20% collision# Flag high-violation schools (top 20% in BOTH crime and collision)crime_cut <-quantile(safety_zone$crime_count, 0.80, na.rm =TRUE)collision_cut <-quantile(safety_zone$collision_count, 0.80, na.rm =TRUE)safety_zone <- safety_zone %>%mutate(high_violation = crime_count >= crime_cut & collision_count >= collision_cut)sum(safety_zone$high_violation)
[1] 52
ggplot() +geom_sf(data = schools, fill ="gray90", color ="gray60", linewidth =0.3) +geom_sf(data = safety_zone %>%filter(high_violation),fill ="red", alpha =0.6, color ="darkred", linewidth =0.3) +labs(title ="High-Violation School Safety Zones in Philadelphia",subtitle ="Schools in top 20% of both crime and collision counts",caption ="Source: OpenDataPhilly 2020-2024" ) +theme_void()
##calculate the average distance from danger point in a school buffer to the schoolschools <-rename(schools, OBJECTID = objectid)# Crime distancescrime_dist_matrix <-st_distance(crime, schools)crime_dist <- crime %>%mutate(dist_to_school =apply(crime_dist_matrix, 1, min)) %>%st_join(safety_zone %>%select(objectid)) %>%st_drop_geometry() %>%group_by(objectid.y) %>%summarise(avg_crime_dist =mean(dist_to_school, na.rm =TRUE)) %>%rename(objectid = objectid.y)safety_zone <- safety_zone %>%left_join(crime_dist, by ="objectid")# Collision distancescollision_dist_matrix <-st_distance(collision, schools)collision_dist <- collision %>%mutate(dist_to_school =apply(collision_dist_matrix, 1, min)) %>%st_join(safety_zone %>%select(objectid)) %>%st_drop_geometry() %>%group_by(objectid.y) %>%summarise(avg_collision_dist =mean(dist_to_school, na.rm =TRUE)) %>%rename(objectid = objectid.y)safety_zone <- safety_zone %>%left_join(collision_dist, by ="objectid")# Top 5 by average crime distancetop5_crime_dist <- safety_zone %>%st_drop_geometry() %>%arrange(avg_crime_dist) %>%head(5) %>%select(objectid, school_nam, avg_crime_dist) %>%kable(caption ="Top 5 Schools with Least Average Crime Distance (meters)",col.names =c("ID", "School Name", "Avg Crime Distance (m)")) %>%kable_styling(bootstrap_options =c("striped", "hover"), full_width =FALSE)# Top 5 by average collision distancetop5_collision_dist <- safety_zone %>%st_drop_geometry() %>%arrange(avg_collision_dist) %>%head(5) %>%select(objectid, school_nam, avg_collision_dist) %>%kable(caption ="Top 5 Schools by Average Collision Distance (meters)",col.names =c("ID", "School Name", "Avg Collision Distance (m)")) %>%kable_styling(bootstrap_options =c("striped", "hover"), full_width =FALSE)top5_crime_dist
Top 5 Schools with Least Average Crime Distance (meters)
ID
School Name
Avg Crime Distance (m)
73
NAZARETH ACADEMY HIGH SCHOOL
39.56615
384
MATH, SCIENCE AND TECHNOLOGY
55.24363
473
NA
100.19317
304
ROXBOROUGH HIGH SCHOOL
100.85450
130
FELTONVILLE ARTS & SCIENCES
114.11754
top5_collision_dist
Top 5 Schools by Average Collision Distance (meters)
ID
School Name
Avg Collision Distance (m)
186
MIFFLIN, THOMAS SCHOOL
95.84298
153
PENNSYLVANIA SCHOOL FOR THE DEAF
104.24374
173
CARNELL, LAURA H. SCHOOL
110.58027
467
SAINTS TABERNACLE DAY SCHOOL
114.22460
254
MASTERY CHARTER @ PICKETT
116.87737
# Flag schools with least 20% crime distance and least 20% collision distancecrime_dist_cut <-quantile(safety_zone$avg_crime_dist, 0.20, na.rm =TRUE)collision_dist_cut <-quantile(safety_zone$avg_collision_dist, 0.20, na.rm =TRUE)safety_zone <- safety_zone %>%mutate(high_violation_dist = avg_crime_dist <= crime_dist_cut & avg_collision_dist <= collision_dist_cut)cat("Number of high-violation distance schools:", sum(safety_zone$high_violation_dist, na.rm =TRUE), "\n")
Number of high-violation distance schools: 46
##aggregate by census tract and county# Step 5: Aggregate by census tract and county# First transform census_tracts to match CRSpa_tract <-st_transform(pa_tract, st_crs(safety_zone))# Spatial join safety_zone to census tractssafety_zone_tracts <- safety_zone %>%st_join(pa_tract %>%select(GEOID, NAME)) %>%st_drop_geometry() %>%group_by(GEOID, NAME) %>%summarise(n_schools =n(),n_high_violation =sum(high_violation, na.rm =TRUE),n_high_violation_dist =sum(high_violation_dist, na.rm =TRUE),avg_crime_count =mean(crime_count, na.rm =TRUE),avg_collision_count =mean(collision_count, na.rm =TRUE),avg_crime_dist =mean(avg_crime_dist, na.rm =TRUE),avg_collision_dist =mean(avg_collision_dist, na.rm =TRUE) ) %>%mutate(high_violation_rate = n_high_violation / n_schools *100,high_violation_dist_rate = n_high_violation_dist / n_schools *100,tract_num =str_extract(NAME, "(?<=Census Tract )\\d+\\.?\\d*") ) %>%arrange(desc(n_high_violation))
##visualization# Visualization 1: Tract-level choropleth maptract_map <- pa_tract %>%left_join(safety_zone_tracts %>%st_drop_geometry(), by ="GEOID") %>%filter(!is.na(high_violation_rate))ggplot() +geom_sf(data = pa_counties %>%filter(COUNTY_NAM =="PHILADELPHIA"),fill ="gray95", color ="gray60", linewidth =0.3) +geom_sf(data = tract_map,aes(fill = high_violation_rate),color ="white", linewidth =0.2) +scale_fill_distiller(palette ="YlOrRd",direction =1,na.value ="gray90",name ="% High Violation\nSchools",labels = scales::label_percent(scale =1) ) +labs(title ="High-Violation School Safety Zones by Census Tract",subtitle ="Share of schools in top 20% of both crime and collision counts",caption ="Source: OpenDataPhilly 2020-2024" ) +theme_void()
# Visualization 2: Compare high_violation vs high_violation_distggplot() +geom_sf(data = pa_counties %>%filter(COUNTY_NAM =="PHILADELPHIA"),fill ="gray95", color ="gray60", linewidth =0.3) +geom_sf(data = safety_zone,fill ="gray80", color =NA, alpha =0.5) +geom_sf(data = safety_zone %>%filter(high_violation),fill ="orange", color =NA, alpha =0.7) +geom_sf(data = safety_zone %>%filter(high_violation_dist),fill ="red", color =NA, alpha =0.7) +labs(title ="High-Violation School Safety Zones in Philadelphia",subtitle ="Orange = high crime/collision count | Red = close crime/collision violations",caption ="Source: OpenDataPhilly 2020-2024" ) +theme_void()
# Visualization 3: Crime and Collision Distribution Across Philadelphia School Safety Zonesp3 <-ggplot(safety_zone, aes(x = crime_count, fill = high_violation)) +geom_histogram(bins =30, alpha =0.8) +scale_fill_manual(values =c("FALSE"="steelblue", "TRUE"="red"),labels =c("FALSE"="Normal", "TRUE"="High Violation"),name ="" ) +labs(title ="Distribution of Crime Counts Across School Safety Zones",x ="Crime Count (2020-2024)",y ="Number of Schools",caption ="Red = top 20% in both crime and collision counts" ) +theme_minimal()# p2: Scatter plot of crime_count vs collision_count colored by high_violationp4 <-ggplot(safety_zone, aes(x = crime_count, y = collision_count, color = high_violation)) +geom_point(alpha =0.7) +scale_color_manual(values =c("FALSE"="steelblue", "TRUE"="red"),labels =c("FALSE"="Normal", "TRUE"="High Violation"),name ="" ) +labs(title ="Crime Count vs. Collision Count by School",x ="Crime Count (2020-2024)",y ="Collision Count (2020-2024)",caption ="Source: OpenDataPhilly 2020-2024" ) +theme_minimal()p3 / p4
Analysis requirements: - Clear code comments explaining each step - Appropriate CRS transformations - Summary statistics or counts - At least one map showing your findings - Brief interpretation of results (3-5 sentences)
Your interpretation: This analysis examined 490 school safety zones across Philadelphia using crime incident and traffic collision data from 2020 to 2024. Of the 490 schools, 52 (10.6%) were classified as high-violation by count — falling in the top 20% of both crime and collision counts — while 46 schools (9.4%) were flagged as high-violation by distance, meaning violations tend to cluster very close to the school itself. The histogram shows that most schools have fewer than 500 crime incidents over the five-year period, with high-violation schools appearing at the higher end of the distribution, confirming that the top 20% threshold effectively separates genuinely dangerous zones from the rest. The scatter plot further reveals a positive relationship between crime count and collision count — schools with more crime also tend to have more collisions — and high-violation schools cluster distinctly in the upper right, indicating that danger tends to be compounded rather than isolated to a single incident type. At the census tract level, dangerous school zones are mostly concentrated in tracts along North Broad Street, Kensington, and West Philadelphia, underscoring the need for targeted Safe Routes to School investments in these specific corridors where students face both high volumes of crime and violations occurring in close proximity to school grounds.
Finally - A few comments about your incorporation of feedback!
I try to avoid unnecessary code like print() and head(). Besides, I used “#| echo: false, #| message: false, and #| results: hide” to hide setup code.
Submission Requirements
What to submit:
Rendered HTML document posted to your course portfolio with all code, outputs, maps, and text
Use embed-resources: true in YAML so it’s a single file
All code should run without errors
All maps and charts should display correctly
Submit the correct and working links of your assignment on Canvas