Considering the importance of education in an increasingly knowledge based economy, I performed an exploratory data analysis of school performance in relation to various attributes that might potentially have an influence, with the following objectives.
Analysis was restricted to NYC puclic schools
( comprising 32 school districts)
Attendance rate
School safety
Class size
School district size
Poverty ratio
Ethnic background
Gender ratio
English language learners ratio
SAT score.
covering Math , Reading and Writing was used as an indicator of school performance.
Following datasets were used for the analysis
nearZeroVar
function, so they can be dropped from feature set, there were none .vif
function, so they can be dropped from feature set, there were none .regsubsets
was for used feature selection - following 3 features out of the total 8 feature, were picked up by regsubsets as features that have some influence on SAT scoresClass size
Poverty
andGender Ratio
To cross validate, feature selection was repeated with steps
- the same 3 features were picked up by steps
function as well.
## (Intercept) poverty.ratio size female.ratio
## -4.4799063 -0.4308900 0.1759099 0.1954207
A linear regression of School Performance with these three variables as the predictors was performed.
##
## Call:
## lm(formula = total.percent ~ poverty.ratio + size + female.ratio,
## data = scaled.district.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.0172 -0.4736 -0.1017 0.3043 1.8254
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.47991 1.55994 -2.872 0.00769 **
## poverty.ratio -0.43089 0.14734 -2.925 0.00676 **
## size 0.17591 0.06108 2.880 0.00754 **
## female.ratio 0.19542 0.12841 1.522 0.13925
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6657 on 28 degrees of freedom
## Multiple R-squared: 0.5997, Adjusted R-squared: 0.5568
## F-statistic: 13.98 on 3 and 28 DF, p-value: 9.303e-06
Only class-size
and poverty
features displayed statistically significant influence and hence gender-ratio was dropped from further analysis.
I decided to take a closer look on the impact of these two key attributes on the overall performance.
When a regression plot on school performance was plotted against class size and poverty, the result was a surprise.
While the influence of poverty on SAT scores was in line with the expectation (increased poverty rates result in decreased scores), the impact of class size was totally unexpected.
The trend line shows performance declining with smaller class sizes.
Taking a second look at these plots , the impact of class size over school performance looks like almost a mirror image of poverty plot. I wanted to understand the relation between these two factors. What I found was really interesting.
Clearly, most of the schools in poorer neighborhoods have smaller class sizes than the school districts that are better off.
I decided to look for explanation and came across “Contracts for Excellence Legislation” .
This legislation funded a set of initiatives over a 5 yr period from 2007, including reduction of class size, focused on poor neighborhoods and schools performing below state standards.
Our findings confirm that some action seems to have been taken under the legislation since 2007 as reflected in the 2010 class size data , and that most of the poor neighborhood / low performance schools have comparatively smaller class sizes .
Have smaller class sizes really helped to improve the performance of the school districts compared to those with bigger class sizes?
Let us check by comparing the performance of the 2010 class with 2014 class from these schools with smaller class sizes.
Let us take a look at performance change of class size groups by comparing total 2014 scores against 2010 by class size group .
Looking at the overlap in notches in the box plot across size groups , there seems to be no significant improvement in scores for the classes with smaller sizes.
Considering the fact that the city is making significant investments towards reducing class sizes with an objective to improve school performance, it will be interesting to check if there are any changes in scores at a subject level (Math , Reading and Writing) , before rejecting the influence of class size reduction on performance improvement..
Linear model and plot comparing the change in math score (2014 vs 2010) is shown below. There is no statistically significant improvement in the math performance of schools with smaller classes over this period.
##
## Call:
## lm(formula = chMath ~ math.size, data = scaled.school.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.13160 -0.20311 -0.00902 0.20622 0.95957
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0093060 0.1108796 0.084 0.933
## math.size -0.0003822 0.0044963 -0.085 0.932
##
## Residual standard error: 0.3276 on 347 degrees of freedom
## Multiple R-squared: 2.082e-05, Adjusted R-squared: -0.002861
## F-statistic: 0.007225 on 1 and 347 DF, p-value: 0.9323
Linear model and plot comparing the change in reading score (2014 vs 2010) is shown below. There is no statistically significant improvement in the reading performance of schools with smaller classes over this period.
##
## Call:
## lm(formula = chReading ~ english.size, data = scaled.school.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.21348 -0.20204 0.00022 0.22107 1.04587
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.112576 0.112919 -0.997 0.319
## english.size 0.004623 0.004572 1.011 0.313
##
## Residual standard error: 0.3517 on 347 degrees of freedom
## Multiple R-squared: 0.002938, Adjusted R-squared: 6.421e-05
## F-statistic: 1.022 on 1 and 347 DF, p-value: 0.3127
Linear model and plot comparing the change in writing score (2014 vs 2010) is shown below. There is no statistically significant improvement in the writing performance of schools with smaller classes over this period.
##
## Call:
## lm(formula = chWriting ~ english.size, data = scaled.school.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.04316 -0.19708 0.01551 0.22388 1.05277
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.145804 0.106370 -1.371 0.171
## english.size 0.005988 0.004307 1.390 0.165
##
## Residual standard error: 0.3313 on 347 degrees of freedom
## Multiple R-squared: 0.005539, Adjusted R-squared: 0.002673
## F-statistic: 1.933 on 1 and 347 DF, p-value: 0.1654
After a detailed analysis of the change in score trends by class size , there is no confirmation that the smaller class sizes have resulted in a statistically significant improvement in student performance over a three year period.
Github Source Code