03-10 03:28

Notice

Recent Posts

Recent Comments

Link

« 2025/03 »
일	월	화	수	목	금	토
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30	31

Tags more

Archives

Today

Total

관리 메뉴

Scientific Computing & Data Science

[Data Science / Baseball] Lahman 데이터를 이용한 야구 데이터 분석 Part 3. 본문

Data Science/ Baseball Data Analysis

[Data Science / Baseball] Lahman 데이터를 이용한 야구 데이터 분석 Part 3.

cinema4dr12 2017. 3. 9. 22:03

Lahman 데이터를 이용한 야구 데이터 분석 Part 3.

QUESTIONS

Q1. American League의 지명타자 제도 도입으로 양 리그(National League와 American League) 간 득점의 차이가 생겼을까?

Q2. MLB 전체 히스토리에서 투수의 완투비율은 어떻게 변화되어 왔는가?

지난 포스팅에 이어 질문을 하고 이에 대해 답하는 형식으로 야구 데이터 분석을 진행해 보기로 한다.

Q1. American League의 지명타자 제도 도입으로 양 리그(National League와 American League) 간 득점의 차이가 생겼을까?

이 질문에 대답을 하기 위해 American League에 지명타자 제도가 처음으로 도입된 해인 1973년도 이전과 이후의 양 리그 간 득점의 추이를 비교할 필요가 있다.

이에 필요한 데이터는 "Teams"이며, 이를 불어오도록 한다:

read.csv() 함수를 이용하여 불러와도 되고,

R CODE:

teams <- read.csv("../data/baseballdatabank-master/core/Teams.csv")

지난 글의 분석 환경 설정과 같이 MongoDB로부터 불러올 수 있다:

R CODE:

base::source('./ImportCollection.R', echo=FALSE)
teams <- ImportCollection("Teams")

1901년 National League와 American League 양대 리그 체제가 출범한 이래로 각 리그 별로 매년 경기수와 득점수를 저장할 수 있도록 dataset이라는 Data Frame 변수를 초기화한다:

R CODE:

# initialize dataset: total games & runs for each league per year
dataset <- base::data.frame(matrix(ncol = 8, nrow = 1))
base::names(dataset) <- base::c("year", "NL_GAMES", "NL_RUNS", "NL_AVG_RUNS", "AL_GAMES", "AL_RUNS", "AL_AVG_RUNS", "DIFF")

Teams로부터 각 해마다 lgID가 NL 및 AL이면, 그 해의 각 리그 별로 총 경기수와 총 득점수를 이들을 통해 계산된 평균 득점수 및 리그 간 평균득점 차이(AL평균득점수 - NL평균득점수)를 dataset에 기록한다:

R CODE:

rowIndex <- 0
for(year in base::min(teams$yearID):base::max(teams$yearID)) {
  nl <- base::subset(x = teams, subset = ((yearID == year) & (lgID == "NL")));
  al <- base::subset(x = teams, subset = ((yearID == year) & (lgID == "AL")));
  
  if((base::nrow(nl) > 0) & (base::nrow(al) > 0)) {
    nl_games <- base::sum(nl$G) / 2;
    al_games <- base::sum(al$G) / 2;
    
    nl_runs <- base::sum(nl$R);
    al_runs <- base::sum(al$R);
    
    nl_avg_runs <- nl_runs / nl_games;
    al_avg_runs <- al_runs / al_games;
    
    diff <- al_avg_runs - nl_avg_runs;
    
    rowIndex <- rowIndex + 1;
    dataset[rowIndex,] <- base::c(year, nl_games, nl_runs, nl_avg_runs, al_games, al_runs, al_avg_runs, diff);
  }
}

plotly 라이브러리 패키지를 이용하여 각 리그의 평균 득점 데이터를 시각화한다:

R CODE:

# plotting with Plotly
p <- plotly::plot_ly(data = dataset,
                     x = ~year,
                     y = ~NL_AVG_RUNS,
                     name = "NL_AVG_RUNS",
                     type = 'scatter',
                     mode = 'lines+markers',
                     line = list(color = 'rgb(205, 12, 24)', width = 3)) %>%
  add_trace(y = ~AL_AVG_RUNS,
            name = "AL_AVG_RUNS",
            line = list(color = 'rgb(22, 96, 167)',
                        width = 4)) %>%
  layout(title = "Average Runs per Year",
         xaxis = list(title = "Year"),
         yaxis = list (title = "Average Runs for each League"))

# print results
print(p)

Plot을 살펴보면, American League에 지명타자 제도가 도입된 1973년 이전에는 AL와 NL의 평균득점이 엎치락 뒤치락하는 형세를 보이나 1973년부터는 거의 AL의 평균득점이 NL의 평균득점보다 지속적으로 우세함을 알 수 있다.

그럼 양 리그 간 평균 득점 차이는 어떠한지 역시 plotly 라이브러리를 이용하여 시각화 해본다:

R CODE:

# plotting Avg. Score Difference with Plotly
p <- plotly::plot_ly(data = dataset,
                     x = ~year,
                     y = ~DIFF,
                     name = "DIFF",
                     type = 'scatter',
                     mode = 'lines+markers',
                     line = list(color = 'rgb(205, 12, 24)', width = 3)) %>%
  layout(title = "Difference of Average Runs between Leagues per Year",
         xaxis = list(title = "Year"),
         yaxis = list (title = "Difference of Average Runs"))

# print results
print(p)

상기 플롯은 AL평균득점수에서 NL평균득점수를 뺀 것을 표현한 것이다. 1974년 단 한 번을 제외하고는 AL에 DH가 도입된 이래로 AL평균득점수가 NL평균득점수 보다 높았다. 사실, 1974년에도 양 리그 간 평균 득점 차이는 0.1점 밖에 나지 않는다.

그럼 1901년~1972년과 1973년~2016년의 양 리그 전체 평균득점을 계산해 보자:

> before_DH <- base::subset(dataset, year < 1973)
> after_DH <- base::subset(dataset, year >= 1973)
> sum(before_DH$NL_RUNS) / sum(before_DH$NL_GAMES)
[1] 8.522631
> sum(before_DH$AL_RUNS) / sum(before_DH$AL_GAMES)
[1] 8.732953
> sum(after_DH$NL_RUNS) / sum(after_DH$NL_GAMES)
[1] 8.691861
> sum(after_DH$AL_RUNS) / sum(after_DH$AL_GAMES)
[1] 9.243191

1901년~1972년의 전체 경기수에 대한 평균득점은 NL, AL에 대하여 각각 8.52, 8.73이며, 1973년~2016년의 전체 경기수에 대한 평균득점은 NL, AL에 대하여 각각 8.69, 9.24이다. 이 수치를 비교해 봐도 투수가 타석에 들지 않는 것에 대한 득점 영향력을 알 수 있다.

Q1.에 대한 전체 코드를 수록한다:

R Code for Q1

base::source('./ImportCollection.R', echo=FALSE)
if (! ("plotly" %in% rownames(installed.packages()))) { install.packages("plotly") }
library(plotly)

#######################################################################
# Question 3
#######################################################################
# import "Teams" from database
teams <- ImportCollection("Teams")

# initialize dataset: total games & runs for each league per year
dataset <- base::data.frame(matrix(ncol = 8, nrow = 1))
base::names(dataset) <- base::c("year", "NL_GAMES", "NL_RUNS", "NL_AVG_RUNS", "AL_GAMES", "AL_RUNS", "AL_AVG_RUNS", "DIFF")

rowIndex <- 0
for(year in base::min(teams$yearID):base::max(teams$yearID)) {
  nl <- base::subset(x = teams, subset = ((yearID == year) & (lgID == "NL")));
  al <- base::subset(x = teams, subset = ((yearID == year) & (lgID == "AL")));
  
  if((base::nrow(nl) > 0) & (base::nrow(al) > 0)) {
    nl_games <- base::sum(nl$G) / 2;
    al_games <- base::sum(al$G) / 2;
    
    nl_runs <- base::sum(nl$R);
    al_runs <- base::sum(al$R);
    
    nl_avg_runs <- nl_runs / nl_games;
    al_avg_runs <- al_runs / al_games;
    
    diff <- al_avg_runs - nl_avg_runs;
    
    rowIndex <- rowIndex + 1;
    dataset[rowIndex,] <- base::c(year, nl_games, nl_runs, nl_avg_runs, al_games, al_runs, al_avg_runs, diff);
  }
}

# plotting with Plotly
p <- plotly::plot_ly(data = dataset,
                     x = ~year,
                     y = ~NL_AVG_RUNS,
                     name = "NL_AVG_RUNS",
                     type = 'scatter',
                     mode = 'lines+markers',
                     line = list(color = 'rgb(205, 12, 24)', width = 3)) %>%
  add_trace(y = ~AL_AVG_RUNS,
            name = "AL_AVG_RUNS",
            line = list(color = 'rgb(22, 96, 167)',
                        width = 4)) %>%
  layout(title = "Average Runs per Year",
         xaxis = list(title = "Year"),
         yaxis = list (title = "Average Runs for each League"))

# print results
print(p)
print(dataset)

# plotting Avg. Score Difference with Plotly
p <- plotly::plot_ly(data = dataset,
                     x = ~year,
                     y = ~DIFF,
                     name = "DIFF",
                     type = 'scatter',
                     mode = 'lines+markers',
                     line = list(color = 'rgb(205, 12, 24)', width = 3)) %>%
  layout(title = "Difference of Average Runs between Leagues per Year",
         xaxis = list(title = "Year"),
         yaxis = list (title = "Difference of Average Runs"))

# print results
print(p)

# comparison between before & after DH adopted
before_DH <- base::subset(dataset, year < 1973)
after_DH <- base::subset(dataset, year >= 1973)

sum(before_DH$NL_RUNS) / sum(before_DH$NL_GAMES)
sum(before_DH$AL_RUNS) / sum(before_DH$AL_GAMES)

sum(after_DH$NL_RUNS) / sum(after_DH$NL_GAMES)
sum(after_DH$AL_RUNS) / sum(after_DH$AL_GAMES)

Q2. MLB 전체 히스토리에서 투수의 완투비율은 어떻게 변화되어 왔는가?

이 질문 역시 Teams로부터 답할 수 있다. Q1에서와 같이 Teams 데이터를 불러온다.

dataset이라는 이름의 Data Frame 변수를 정의하는데, 경기가 열린 해(year), 그 해의 총 경기 수(GAMES), 완투경기 수(CG; Complete Games), 경기 당 완투율(RATE_CG)을 저장한다:

R CODE:

# initialize dataset: rate of complete games  for each league per year
dataset <- base::data.frame(matrix(ncol = 4, nrow = 1))
base::names(dataset) <- base::c("year", "GAMES", "CG", "RATE_CG")

yearID의 최소(1871년)-최대(2016년) 범위에 대한 시퀀스를 정의하고:

R CODE:

# year sequence
seq_range <- base::seq(from = base::min(teams$yearID), to = base::max(teams$yearID), by = 1)

seq_range로부터 각각의 해(Year)에 대하여 총 경기 수, 총 완투경기 수를 계산한 후, 이들을 이용하여 완투경기율을 계산하여 각각을 dataset에 저장한다:

R CODE:

rowIndex <- 0
for(i in seq_range) {
  sub_teams <- base:: subset(teams, yearID == i)
  num_games <- base::sum(sub_teams$G) / 2
  num_cgs <- base::sum(sub_teams$CG) / 2
  rate_cgs <- 100.0 * num_cgs / num_games
  
  rowIndex <- rowIndex + 1
  dataset[rowIndex,] <- base::c(i, num_games, num_cgs, rate_cgs)
}

이 때, 매해 총 경기 수와 완투경기 수는 각 두 팀에 대해 중복 기록된 것이므로 각각 2로 나눈 것이다. 완투경기 수의 백분율을 계산하기 위해 완투경기 수를 총 경기 수로 나눈 것에 100을 곱하였다.

이제 완투경기율을 plotly 라이브러리 패키지를 이용하여 그래프로 표시하면:

R CODE:

# plotting with Plotly
p <- plotly::plot_ly(data = dataset,
                     x = ~year,
                     y = ~RATE_CG,
                     type = 'scatter',
                     mode = 'lines+markers',
                     line = list(color = 'rgb(205, 12, 24)',width = 3)) %>%
  layout(title = "Rate of Complete Games of Each Year",
         xaxis = list(title = "Year"),
         yaxis = list (title = "Rate of Complete Games(%)"))

# print results
print(p)

다음과 같은 그래프를 얻을 수 있다.

미국 야구 역사에 있어 초창기인 1900년까지는 완투율이 80% 이상을 육박하다가 현대 야구에 오면서 비율이 꾸준히 감소함을 알 수 있다.

아마도 야구가 현대화 되면서 투수 선수층이 두터워지기도 했고 선수 관리(보호) 차원 및 확실한 역할 분담(Starter, Reliever, Closer 등)이 되었기 때문으로 해석할 수 있을 것이다.

Q2.에 대한 전체 코드를 수록한다:

R Code for Q2

#######################################################################
# Question 2
#######################################################################
# import "Teams" from database
teams <- ImportCollection("Teams")

# initialize dataset: rate of complete games  for each league per year
dataset <- base::data.frame(matrix(ncol = 4, nrow = 1))
base::names(dataset) <- base::c("year", "GAMES", "CG", "RATE_CG")

# year sequence
seq_range <- base::seq(from = base::min(teams$yearID), to = base::max(teams$yearID), by = 1)

rowIndex <- 0
for(i in seq_range) {
  sub_teams <- base:: subset(teams, yearID == i)
  num_games <- base::sum(sub_teams$G) / 2
  num_cgs <- base::sum(sub_teams$CG) / 2
  rate_cgs <- 100.0 * num_cgs / num_games
  
  rowIndex <- rowIndex + 1
  dataset[rowIndex,] <- base::c(i, num_games, num_cgs, rate_cgs)
}

# plotting with Plotly
p <- plotly::plot_ly(data = dataset,
                     x = ~year,
                     y = ~RATE_CG,
                     type = 'scatter',
                     mode = 'lines+markers',
                     line = list(color = 'rgb(205, 12, 24)',width = 3)) %>%
  layout(title = "Rate of Complete Games of Each Year",
         xaxis = list(title = "Year"),
         yaxis = list (title = "Rate of Complete Games(%)"))

# print results
print(p)
print(dataset)

이로써 Lahman 데이터를 이용한 야구 데이터 분석 세번째 포스팅을 마치며, 다음 네번째 포스팅도 기대해 주시길 바란다.

by Geol Choi | March 9, 2017

'Data Science > Baseball Data Analysis' 카테고리의 다른 글

[Data Science / Baseball] Retrosheet의 Game Log 데이터로부터 MLB 역대 관중수 알아보기 (1)	2017.04.05
[Data Science / Baseball] 온라인 야구 데이터를 MongoDB에 저장하기 (0)	2017.03.23
[Data Science / Baseball] Lahman 데이터를 이용한 야구 데이터 분석 Part 2. (0)	2017.03.04
[Data Science / Baseball] Lahman 데이터를 이용한 야구 데이터 분석 Part 1. - 데이터 준비 (2)	2017.03.02
[Data Science / Baseball] About PITCHf/x (0)	2017.02.20

공유하기 링크

페이스북
카카오스토리
트위터

'Data Science/ Baseball Data Analysis' Related Articles

Comments

Scientific Computing & Data Science

[Data Science / Baseball] Lahman 데이터를 이용한 야구 데이터 분석 Part 3. 본문

[Data Science / Baseball] Lahman 데이터를 이용한 야구 데이터 분석 Part 3.

Lahman 데이터를 이용한 야구 데이터 분석 Part 3.

'Data Science > Baseball Data Analysis' 카테고리의 다른 글

티스토리툴바