Statistics
Sample Size
- Descriptive Statistics – discussed in Module 1, characterize data, dispersion, central tendency (baseball analytics)
- Inferential Statistics – use a sample of the population and draw inferences about the whole population (projections)
- Law of Large Numbers – as sample size goes up, the sample estimate gets closer to the population estimate
- Calculating sample size is more of a statistical focus than this course. Factors are z score, standard deviation of the population, and margin of error
- If z score goes up, sample size goes up
- If SD goes up, sample size goes up
- If margin of error goes up, sample size goes down
- Russell Carleton (pizzacutter) study in 2007 todetermine how long baseball stats take to normalize (stay consistent with each other)
- Plotted Batting Average in odd at bats vs. Batting Average in even at bats
- Wanted R^2 of at least 0.50
Sabermetrics
WAR
- Wins Above Replacement
- Statistical framework for total offense and defense contributions that reflects performance and playing time
- Variations
- WARP (Baseball Prospectus)
- fWAR (fangraphs)
- bWAR / rWAR (Baseball Reference)
- oWAR (Open source WAR)
- Components:
- Replacement level
- Runs to Wins
- Runs Estimators
- Morris Greenberg has studied the three main variations of WAR
- Cannot predict any WAR measure just by knowing another alternative WAR calculation
- The three variations do not have a relationship to each other
- The three methods are probably assigning different values of wins for hitter production
- Increased playing time leads to a greater variance between the three measures (amplifies the differences)
- Hitters who play 2B, SS, CF have more variant outcomes than other positions, meaning these positions have different variations in fielding calculations
- There are also more variation for pitchers in the extremes (the top and the bottom pitchers). Probably due to how the systems attribute pitcher outcomes to pitchers versus defense.
- Basis framework of calculating WAR is to use the pythagorean calculations and extending out the calculation to become: W/L = (R/RA)^2
- Instead of squaring this calculation, you can try todetermine the best exponent to use (instead of 2)
- For 1901 to current, the best fit exponent is 1.863
Technology
- Park Factor = 100 * ((homeRS+homeRA)/homeG) / ((roadRS+roadRA)/roadG)
- Anything over 100 is above average, below 100 is below average
- Can use this same formula for any batting statistic
- This SQL query on the retrosheet game log can help you calculate it for teams (this one is doing a HR park factor)
R Studio
- lm() – fitting linear models
- function(arglist) expr
return(value)
- Build functions to create automated method of recalculating things you would commonly like to reperform
SQL
- CASE
WHEN (expression)
ELSE
END
History
George Lindsey
- Canadian, University of Toronto, WWII, PhD from Cambridge, Nuclear Scientist
- First work, in 1959, was a look at batting average and if it could be used to predict future performance.
- Concludes there were too many variables at play (differing pitchers, parks, situations, leagues, etc.).
- Conducted a study of same-sided batter-pitcher matchups and found a statistically significant result that an advantage exists for opposite handed hitters
- First published statistical study of L/R batting splits
- Also first published look at run scoring expectancy in bases loaded situations
- Was trying to figure out if you should try for a force out at home in bases loaded situations or if you should try for a double play
- 2nd paper, Progress of the Score During a Baseball Game
- Isthe probability of scoring in any half inning the same? Is run scoring by inning homogeneous?
- 1st (high), 2nd (low), 3rd (high) are different from mean
- Likely due to structure of batting order
- Does the history of scoring affectsubsequent innings? Is scoring independent?
- Large run scoring seems to not be independent
- Lead changes follow a model of independence, meaning there is no tendency to overcome leads
- Studied home field advantage (54%)
- 3rd paper, An Investigation of Strategies in Baseball
- Tries to address steal a base or not to, sacrifice bunt with less than 2 outs, IBB when 1st base is open, etc.
- The most important runs (in terms of affecting win probability) are those that tie the score or put you 1 run ahead
- Analyzed all base states and the probability of scoring 0, 1, 2 runs. Also looked at the expected runs for each situation.
- He uses these expectations to determine optimal strategy (if you bunt in this situation, what does it do to your expected runs?).
- If you are in a specific situation and then you hit a triple, what happens? What happens to your expected runs in the new situation? How much does it increase from before?
- The beginning of linear weights
- A value added approach of measuring a player
- Concludes IBB is not useful except for winning by one run and runners on 2 and 3 (the likelihood of scoring 0 runs increases by actually loading the bases, but the overall expected run value of this goes up too)
- Concludes to always play double play depth with a runner on 1st, except with a runner on third that would tie the game/take the lead
- Concludes sacrifice play does not improve scoring chances.
- Concludes stolen bases value is dependent upon an individual’s success rate
We use cookies to ensure that we give you the best experience on our website. If you continue to use this site we will assume that you are happy with it.Ok