[Data Science / Baseball] rvest 패키지를 이용하여 웹페이지로부터 야구 데이터 가져오기

07-15 10:57

Notice

Recent Posts

Recent Comments

Link

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Tags more

Archives

Today

Total

관리 메뉴

Scientific Computing & Data Science

[Data Science / Baseball] rvest 패키지를 이용하여 웹페이지로부터 야구 데이터 가져오기 본문

Data Science/ Baseball Data Analysis

[Data Science / Baseball] rvest 패키지를 이용하여 웹페이지로부터 야구 데이터 가져오기

cinema4dr12 2017. 5. 9. 12:55

by Geol Choi | May 9, 2017

이번 포스팅에서는 R의 rvest 패키지를 이용하여 유명 야구 데이터 사이트인 baseball-reference.com으로부터 데이터를 가져오는 방법에 대해 알아보도록 하겠다 - 데이터를 가져오는 방법에 대해서만 다룰 것이며, 데이터 분석에 대한 내용은 아니다.

rvest는 R의 웹 스크래핑(Web Scraping)을 위한 패키지로 Tag Selection, CSS Selection 등 다양한 기능이 있지만, 본 포스팅은 rvest 패키지 사용법 자체를 소개하려는 목적은 아니므로, 이를 이용한 다양한 웹 스크래핑 기능을 알고 싶다면 rvest의 CRAN 페이지나 관련 PDF 파일을 참고하길 바란다.

그럼 이제 본격적으로 진행해 보도록 하겠다.

웹페이지 가져오기

가져올 대상 데이터 페이지는 http://www.baseball-reference.com/leagues/MLB/2017.shtml이다.

이 페이지에는 MLB 2017년 시즌의 Team Standard Batting, Team Standard Pitching, MLB Wins Above Avg By Position, Team Fielding이며, HTML 소스 문서를 보면 모두 <table> 태그에 기록되어 있다.

가져올 페이지의 url을 지정하고, xml2::read_html()를 이용하여 해당 url의 콘텐츠를 가져온다:

1
2
url <- "http://www.baseball-reference.com/leagues/MLB/2017.shtml"
webpage <- xml2::read_html(url)
Colored by Color Scripter
cs

가져온 HTML 콘텐츠를 살펴보면 다음과 같다 (페이지 중 일부만 표시됨):

> print(webpage)
{xml_document}
<html data-version="klecko-" data-root="/home/br/build" itemscope="" itemtype="http://schema.org/WebSite" lang="en" class="no-js">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">\n<meta name="viewport" content="width=device-w ...
[2] <body class="br">\n\n<div id="wrap">\n  <div id="header" role="banner">\n  <ul id="subnav">\n<li><a href="http://www.sport ...

<table> 콘텐츠 가져오기

해당 url 페이지의 Team Standard Batting의 콘텐츠를 가져오도록 한다. 웹 브라우저의 개발자 도구의 Elements 탭에서 HTML의 DOM 구조를 쭈~욱 따라가다 보면 id="teams_standard_batting"을 갖는 <table> 태그를 찾을 수 있을 것인데,

바로 이 태그에 Teams Standard Batting에 대한 콘텐츠가 담겨져있다. 이 내용을 가져오기 위해 rvest의 rvest::html_nodes() 함수를 이용한다. 그런데, 이 페이지 내에는 Team Standard Batting, Team Standard Pitching 등 여러 개의 <table> 태그로 구성된 콘텐츠가 있는데 이를 구별하기 위해 xpath 옵션을 지정한다. 먼저 rvest::html_nodes() 함수의 원형을 살펴보면,

[Usage]
html_nodes(x, css, xpath)

[Arguments]
x: Either a document, a node set or a single node.
css, xpath: Nodes to select. Supply one of css or xpath depending on whether you want to use a css or xpath 1.0 selector.

그렇다면, xpath는 어떻게 얻는 것일까? 웹 브라우저의 개발자 도구에서 해당 태그를 마우스 우클릭하여 Copy 옵션에서 Copy XPath를 클릭한다. 웹 브라우저마다 약간은 차이가 있을 수 있지만, 방법은 동일하다.

Copy한 Team Standard Batting의 <table> 태그 xpath는 다음과 같을 것이다:

//*[@id="teams_standard_batting"]

위의 xpath를 rvest::html_nodes() 함수의 xpath 옵션에 그대로 입력하면 된다:

1
sb_table <- rvest::html_nodes(x=webpage, xpath='//*[@id="teams_standard_batting"]')
cs

변수 sb_table에는 해당 노드(또는 HTML 태그)에 해당하는 오브젝트들이 저장된다:

> sb_table
{xml_nodeset (1)}
[1] <table class="sortable stats_table" id="teams_standard_batting" data-cols-to-freeze="1">\n<caption>Team Standard Batting T ...

rvest::html_table() 함수를 호출하여 sb_table에 저장된 HTML Table을 데이터 프레임(Data Frame)으로 파싱한다:

1
sb <- rvest::html_table(sb_table)[[1]]
cs

sb에 저장된 내용을 살펴보면,

> head(sb)
   Tm #Bat BatAge  R/G  G   PA   AB   R   H 2B 3B HR RBI SB CS  BB  SO   BA  OBP  SLG  OPS OPS+  TB GDP HBP SH SF IBB LOB
1 ARI   30   28.2 4.82 33 1255 1131 159 290 54  9 39 150 37 10 106 304 .256 .324 .424 .748   91 479  17   9  6  3   6 227
2 ATL   31   29.9 4.48 29 1128 1011 130 263 46  5 31 125 21  5  87 223 .260 .326 .408 .733   92 412  35  14 11  5  12 198
3 BAL   34   29.5 4.40 30 1143 1042 132 260 49  0 36 122 10  3  82 256 .250 .308 .400 .708   95 417  17  10  1  8   1 206
4 BOS   34   27.9 4.45 31 1175 1057 138 290 56  4 27 132 14  7 103 187 .274 .342 .412 .754  104 435  35   9  0  6   7 220
5 CHC   28   27.3 4.97 31 1279 1120 154 275 56  8 36 144 15  6 131 283 .246 .333 .406 .739   97 455  23  18  6  4  15 252
6 CHW   31   27.6 4.03 30 1112 1011 121 241 41  6 28 118  9 10  75 242 .238 .299 .374 .673   91 378  23  15  4  7   2 190

과 같이, Team Standard Batting 데이터가 저장되어 있음을 알 수 있다. 그런데, sb의 행(Row) 수를 보면 33개이다:

> nrow(sb)
[1] 33

그러나, MLB 전체 팀수는 30팀이므로 행은 30개가 되어야 한다. 실제로 sb$Tm을 출력하여 sb에 기록된 팀의 리스트를 확인해 보면:

> sb$Tm
 [1] "ARI"   "ATL"   "BAL"   "BOS"   "CHC"   "CHW"   "CIN"   "CLE"   "COL"   "DET"   "HOU"   "KCR"   "LAA"   "LAD"   "MIA"  
[16] "MIL"   "MIN"   "NYM"   "NYY"   "OAK"   "PHI"   "PIT"   "SDP"   "SEA"   "SFG"   "STL"   "TBR"   "TEX"   "TOR"   "WSN"  
[31] "LgAvg" ""      "Tm"

으로, "LgAvg"와 "Tm"이 쓸데없이 포함되어 있다. 이것은 아래 이미지의 빨간 네모로 표시한 부분이 포함되어 들어갔기 때문이다:

그래서 이 마지막 3개의 행을 지우도록 한다:

1
sb <- sb[1:(base::nrow(sb)-3),]
cs

다시 팀 리스트를 확인해 보면 정상적으로 30개의 팀이 출력될 것이다:

> sb$Tm
 [1] "ARI" "ATL" "BAL" "BOS" "CHC" "CHW" "CIN" "CLE" "COL" "DET" "HOU" "KCR" "LAA" "LAD" "MIA" "MIL" "MIN" "NYM" "NYY" "OAK"
[21] "PHI" "PIT" "SDP" "SEA" "SFG" "STL" "TBR" "TEX" "TOR" "WSN"

마찬가지 방법으로, Team Standard Pitching 데이터를 가져와 보도록 하자.

1
sp_table <- rvest::html_nodes(x=webpage, xpath='//*[@id="teams_standard_pitching"]')
cs

그리고나서 R Console에서 sp_table을 출력해 보았더니,

> sp_table
{xml_nodeset (0)}

어이없게도 아무런 내용이 없음을 확인할 수 있다. 어찌된 일인가 하고 웹 브라우저 개발자 도구의 Elements 탭을 확인해 보았더니,

id="teams_standard_pitching"을 갖는 <table> 태그 부분이 주석(Comment) 처리되어 있음을 알 수 있다. 즉,  사이에 Team Standard Pitching 데이터가 존재하는 것이다.

이를 해결하기 위해, 즉, webpage 변수에 저장된 HTML 콘텐츠의  를 제거하기 위해 다음 섹션과 같은 꼼수를 썼다.

HTML 콘텐츠에서 주석문 제거하기

HTML 콘텐츠에서 주석문을 제거하기 위해 다음과 같은 절차로 코딩하였다:

1. 가져온 HTML 콘텐츠를 임시 HTML 파일(temp.html)로 저장한다.
2. 저장한 HTML 파일로부터 Text 오브젝트로 불러온다.
3. 임시 HTML 파일 temp.html을 제거한다.
4. Text 오브젝트에서 HTML 주석문() 관련 스트링을 제거한다.
5. 이 Text 오브젝트를 HTML 파일(output.html)로 저장한다.
6. output.html 파일로부터 xml2::read_html() 함수를 호출하여 HTML 오브젝트로 불러온다.
7. output.html 파일을 제거한다.
8. 콘텐츠를 파싱하여 원하는 데이터를 가져온다.

약간 복잡해 보일수도 있는데 사실 그리 별거는 아니다. 순서대로 하나씩 짚어보도록 하자.

1. 가져온 HTML 콘텐츠를 임시 HTML 파일(temp.html)로 저장

xml2::write_html() 함수를 호출하여 현재 Working Directory에 임시 HTML 파일인 temp.html로 저장한다:

1
2
3
4
5
url <- "http://www.baseball-reference.com/leagues/MLB/2017.shtml"
webpage <- xml2::read_html(url)
 
## export webpage to temporary html file
xml2::write_html(webpage, "./temp.html")
cs

2. 저장한 HTML 파일로부터 Text 오브젝트로 불러오기

임시로 저장된 temp.html 파일을 base::readLines() 함수를 이용하여 Text 오브젝트(Character)를 webpage 변수에 저장한다:

1
2
3
4
## read from html file
conn <- base::file(description="./temp.html", open = "r")
webpage <- base::readLines(con=conn)
close(conn)
Colored by Color Scripter
cs

3. 임시 HTML 파일 temp.html을 제거하기

이제 더이상 temp.html 파일이 필요하지 않으므로 Working Directory에서 제거한다:

1
2
## remove the temporary html file
if (base::file.exists("./temp.html")) base::file.remove("./temp.html")
cs

4. Text 오브젝트에서 HTML 주석문() 관련 스트링 제거하기

변수 webpage로부터 HTML 주석문 관련 코드를 base:::gsub() 함수를 이용하여 제거한다:

1
2
3
## remove html comments
webpage <- base:::gsub(pattern="<!--", replace="", x=webpage)
webpage <- base:::gsub(pattern="-->", replace="", x=webpage)
cs

5. 이 Text 오브젝트를 HTML 파일(output.html)로 저장하기

이제 webpage 변수에는 HTML 주석문이 제거되었으므로, 이를 다시 Working Directory에 HTML 파일 output.html로 저장한다:

1
2
## write string to html file
base::write(x=webpage, file="./output.html")
cs

6. output.html 파일로부터 xml2::read_html() 함수를 호출하여 HTML 오브젝트로 불러오기

웹 사이트가 아닌 앞서 저장한 Working Directory 내의 output.html 파일로부터 HTML 오브젝트를 불러온다:

1
2
3
## read from url
url <- "./output.html"
webpage <- xml2::read_html(x=url)
cs

7. output.html 파일을 제거하기

output.html 파일의 임무를 마쳤으므로 이를 Working Directory로부터 제거한다:

1
2
## remove output.html
if (base::file.exists("./output.html")) base::file.remove("./output.html")
cs

8. 콘텐츠를 파싱하여 원하는 데이터를 가져오기

다시 Team Standard Batting 데이터부터 차례차례로 <table> 태그로 지정된 데이터를 불러오자:

1
2
3
4
## import Team Standard Batting data
sb_table <- rvest::html_nodes(x=webpage, xpath='//*[@id="teams_standard_batting"]')
sb <- rvest::html_table(sb_table)[[1]]
sb <- sb[1:(base::nrow(sb)-3),]
Colored by Color Scripter
cs

동일한 방식으로 Team Standard Pitching 데이터를 불러온다:

1
2
3
4
## import Team Standard Pitching data
sp_table <- rvest::html_nodes(x=webpage, xpath='//*[@id="teams_standard_pitching"]')
sp <- rvest::html_table(sp_table)[[1]]
sp <- sp[1:(base::nrow(sb)-3),]
Colored by Color Scripter
cs

Pitching 데이터도 마찬가지로 마지막 3개의 행은 쓸모없는 것이므로 제거하였음을 유념하기 바란다 (sp <- sp[1:(base::nrow(sb)-3),]).

> head(sp)
   Tm #P PAge RA/G  W  L W-L%  ERA  G GS GF CG tSho cSho SV    IP   H   R  ER HR  BB IBB  SO HBP BK WP   BF ERA+  FIP  WHIP  H9 HR9 BB9 SO9 SO/W LOB
1 ARI 16 29.1 4.18 18 15 .545 3.87 33 33 32  1    1    0  8 291.0 277 138 125 34 105   3 303   6  2 15 1247  118 3.58 1.313 8.6 1.1 3.2 9.4 2.89 236
2 ATL 16 31.6 5.45 11 18 .379 4.82 29 29 29  0    0    0  5 261.2 265 158 140 39 103   6 202  11  0 12 1133   90 4.70 1.406 9.1 1.3 3.5 6.9 1.96 190
3 BAL 19 28.3 4.03 20 10 .667 3.88 30 30 30  0    3    0 14 271.1 264 121 117 32 117   2 215  12  3  8 1173  108 4.37 1.404 8.8 1.1 3.9 7.1 1.84 238
4 BOS 17 28.1 3.90 17 14 .548 3.51 31 31 31  0    1    0 10 276.2 250 121 108 37  91   4 297   8  0  8 1162  124 3.66 1.233 8.1 1.2 3.0 9.7 3.26 211
5 CHC 16 31.4 4.65 16 15 .516 3.93 31 31 31  0    2    0  7 291.0 264 144 127 40 115   5 290   8  0 18 1229  104 4.06 1.302 8.2 1.2 3.6 9.0 2.52 212
6 CHW 16 29.8 3.73 15 15 .500 3.35 30 30 30  0    1    0  5 263.2 229 112  98 31  98   2 226  10  1 13 1104  117 4.04 1.240 7.8 1.1 3.3 7.7 2.31 201

Wins Above Avg By Position 데이터도 불러오자:

1
2
3
4
## import MLB Wins Above Avg By Position data
waa_table <- rvest::html_nodes(x=webpage, xpath = '//*[@id="team_output"]')
waa <- rvest::html_table(waa_table)[[1]]
waa <- waa[1:(base::nrow(waa)-1),]
Colored by Color Scripter
cs

Wins Above Avg By Position <table> 태그의 xpath도 앞서 설명한 것과 동일하게 웹 브라우저 개발자 도구를 통해 얻을 수 있다. 특별히, 이 경우 마지막 행만 제거하였다. 왜 그런지는 각자 이해할 수 있기를...

> head(waa)
  Rk  Total  All P     SP     RP  Non-P      C     1B     2B     3B     SS     LF     CF     RF OF (All)     DH     PH
1  1 NYY6.2 ARI3.8 ARI3.3 CLE2.0 CIN4.4 HOU1.1 WSN1.9 PHI1.2 CIN2.0 CIN1.1 DET1.1 LAA1.6 NYY1.9   NYY3.7 SEA0.8 NYY0.3
2  2 WSN5.6 BOS3.1 WSN3.1 BOS1.8 NYY4.1 MIL0.6 ATL1.5 NYY1.0 LAD1.1 HOU0.7 NYY1.1 KCR1.0 SEA1.5   SEA1.5 TBR0.6 STL0.1
3  3 HOU5.0 COL2.4 STL3.0 CHW1.7 WSN3.4 TEX0.5 ARI1.3 HOU0.6 CHC1.1 SEA0.7 NYM0.7 BAL0.7 WSN1.3   DET1.2 NYY0.3 TBR0.1
4  4 LAD3.5 STL2.3 TEX2.3 NYY1.2 HOU3.4 LAD0.4 CIN0.9 OAK0.5 MIN1.0 CHC0.7 CIN0.6 NYY0.7 TBR0.8   HOU1.2 MIN0.2 CHC0.1
5  5 ARI3.0 PIT2.3 KCR2.0 COL1.0 TBR3.0 MIA0.3 OAK0.8 DET0.5 BAL1.0 CLE0.7 MIA0.6 HOU0.6 CHW0.8   WSN1.1 MIA0.1 CIN0.1
6  6 CIN2.9 WSN2.2 COL1.7 BAL0.8 DET2.2 COL0.3 MIL0.6 SEA0.4 STL0.9 TBR0.6 SEA0.5 PHI0.5 HOU0.4   TBR1.0 SFG0.1 HOU0.1

마지막 데이터인 Team Fielding도 동일한 방법으로 불러온다:

1
2
3
4
## import Team Fielding data
tf_table <- rvest::html_nodes(x=webpage, xpath = '//*[@id="teams_standard_fielding"]')
tf <- rvest::html_table(tf_table)[[1]]
tf <- tf[1:(base::nrow(tf)-3),]
Colored by Color Scripter
cs

> head(tf)
   Tm #Fld RA/G DefEff  G  GS  CG    Inn   Ch  PO   A  E DP Fld% Rtot Rtot/yr Rdrs Rdrs/yr
1 ARI   30 4.18   .677 33 297 231 2619.0 1203 873 308 22 23 .982  -19      -9  -18      -8
2 ATL   30 5.45   .694 29 261 211 2355.0 1099 785 295 19 30 .983    3       2    1       1
3 BAL   34 4.03   .696 30 270 200 2442.0 1127 814 297 16 33 .986    4       2   -5      -0
4 BOS   33 3.90   .685 31 279 231 2490.0 1113 830 257 26 23 .977    6       3    7       1
5 CHC   27 4.65   .690 31 279 182 2619.0 1240 873 343 24 36 .981    1       0   12       1
6 CHW   31 3.73   .713 30 270 223 2373.0 1082 791 269 22 28 .980   -3      -2   -4      -2

MongoDB에 데이터 입력하기 (Optional)

불러온 데이터를 MongoDB에 입력해 보자. 이를 위해 필요한 라이브러리와 소스를 로딩한다:

1
2
3
4
## load library & source for connecting mongodb
if (! ("mongolite" %in% rownames(installed.packages()))) { install.packages("mongolite") }
base::require("mongolite")
source('./Connect.R')
Colored by Color Scripter
cs

Command Line Tool에서 mongod 명령으로 MongoDB 서버를 실행했다고 가정한다. Connect.R 코드는 다음과 같다 (다운로드: Connect.R):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
##############################################################################
# Connect.R
##############################################################################
# @Author: Geol Choi, ph.D / cinema4dr12@gmail.com
# @Date: Mar.4,2017
# @Description: To insert Lahman data into MongoDB
##############################################################################
 
## intall packages & load them
if (! ("mongolite" %in% rownames(installed.packages()))) { install.packages("mongolite") }
base::library(mongolite)
 
################################################################################################
## MongoDB connection
Connect <- function(colName) {
  con <- mongolite::mongo(collection = colName,
                          db = "baseball",
                          url = "mongodb://localhost",
                          verbose = TRUE,
                          options = ssl_options())
  
  return(con)
}
 
################################################################################################
## MongoDB insert
InsertDB <- function(colName, basePath) {
  con <- Connect(colName);
  
  ## drop DB if any
  if(con$count() > 0) con$drop();
  
  ## load data
  fileName <- base::sprintf("%s/%s.csv", basePath, colName);
  df <- utils::read.csv(fileName);
  
  ## insert document as data frame
  con$insert(df);
  
  ## disconnect
  base::rm(con);
}
 
################################################################################################
## MongoDB insert
InsertDataFileToDB <- function(colName, fileName) {
  con <- Connect(colName);
  
  ## drop DB if any
  if(con$count() > 0) con$drop();
  
  ## load data
  df <- utils::read.csv(fileName);
  
  ## insert document as data frame
  con$insert(df);
  
  ## disconnect
  base::rm(con);
}
 
################################################################################################
## MongoDB insert
InsertDataFrameToDB <- function(colName, df) {
  con <- Connect(colName);
  
  ## drop DB if any
  if(con$count() > 0) con$drop();
  
  ## insert document as data frame
  con$insert(df);
  
  ## disconnect
  base::rm(con);
}
Colored by Color Scripter
cs

MongoDB에 baseball-refence.com으로부터 불러온 MLB 2017 시즌 Standard Team Batting, Standard Team Pitching, Wins Above Avg By Position, Team Fielding 데이터를 차례차례 저장한다:

1
2
3
4
5
## insert data into mongodb
InsertDataFrameToDB(colName="team_standard_batting", df=sb)
InsertDataFrameToDB(colName="team_standard_pitching", df=sp)
InsertDataFrameToDB(colName="team_standard_wins_above_avg_by_pos", df=waa)
InsertDataFrameToDB(colName="team_fielding", df=tf)
cs

MongoDB에 데이터를 입력한 실행에 대한 Console 결과는 다음과 같다:

Complete! Processed total of 30 rows.
Complete! Processed total of 27 rows.
Complete! Processed total of 30 rows.
Complete! Processed total of 30 rows.

이제 불러온 MLB 데이터를 가지고 데이터 분석을 해보기를 권장한다. 팀 데이터와 현재 팀의 성적과 관련하여 분석해 보는 것도 매우 흥미로운 분석이 될 것이다.

전체 코드

마지막으로 지금까지 설명한 것에 대한 전체 R 코드를 공유하고 본 포스팅을 마치고자 한다.
(다운로드: WebScraping_BaseballReference.R).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
#######################################################################
# @Author: Geol Choi, ph.D / cinema4dr12@gmail.com
# @Date: Mar.9,2017
# @Description: Fetching MLB Data from BaseReference.com
#######################################################################
base::rm(list = ls())
base::gc()
 
##############################################################################################
## load libraries & source
if (! ("rvest" %in% rownames(installed.packages()))) { install.packages("rvest") }
if (! ("stringr" %in% rownames(installed.packages()))) { install.packages("stringr") }
if (! ("tidyr" %in% rownames(installed.packages()))) { install.packages("tidyr") }
 
base::require("rvest")
base::require("stringr")
base::require("tidyr")
 
##############################################################################################
url <- "http://www.baseball-reference.com/leagues/MLB/2017.shtml"
webpage <- xml2::read_html(url)
 
## export webpage to temporary html file
xml2::write_html(webpage, "./temp.html")
 
## read from html file
conn <- base::file(description="./temp.html", open = "r")
webpage <- base::readLines(con=conn)
close(conn)
 
## remove the temporary html file
if (base::file.exists("./temp.html")) base::file.remove("./temp.html")
 
## remove html comments
webpage <- base:::gsub(pattern="<!--", replace="", x=webpage)
webpage <- base:::gsub(pattern="-->", replace="", x=webpage)
 
## write string to html file
base::write(x=webpage, file="./output.html")
 
## read from url
url <- "./output.html"
webpage <- xml2::read_html(x=url)
 
## remove output.html
if (base::file.exists("./output.html")) base::file.remove("./output.html")
 
## import Team Standard Batting data
sb_table <- rvest::html_nodes(x=webpage, xpath='//*[@id="teams_standard_batting"]')
sb <- rvest::html_table(sb_table)[[1]]
sb <- sb[1:(base::nrow(sb)-3),]
 
## import Team Standard Pitching data
sp_table <- rvest::html_nodes(x=webpage, xpath='//*[@id="teams_standard_pitching"]')
sp <- rvest::html_table(sp_table)[[1]]
sp <- sp[1:(base::nrow(sb)-3),]
 
## import MLB Wins Above Avg By Position data
waa_table <- rvest::html_nodes(x=webpage, xpath = '//*[@id="team_output"]')
waa <- rvest::html_table(waa_table)[[1]]
waa <- waa[1:(base::nrow(waa)-1),]
 
## import Team Fielding data
tf_table <- rvest::html_nodes(x=webpage, xpath = '//*[@id="teams_standard_fielding"]')
tf <- rvest::html_table(tf_table)[[1]]
tf <- tf[1:(base::nrow(tf)-3),]
 
## remove unnecceary bariables
rm(list = c("webpage", "sb_table", "tf_table", "waa_table", "sp_table", "conn", "url"))
 
##############################################################################################
## load library & source for connecting mongodb
if (! ("mongolite" %in% rownames(installed.packages()))) { install.packages("mongolite") }
base::require("mongolite")
source('./Connect.R')
 
## insert data into mongodb
InsertDataFrameToDB(colName="team_standard_batting", df=sb)
InsertDataFrameToDB(colName="team_standard_pitching", df=sp)
InsertDataFrameToDB(colName="team_standard_wins_above_avg_by_pos", df=waa)
InsertDataFrameToDB(colName="team_fielding", df=tf)
Colored by Color Scripter

'Data Science > Baseball Data Analysis' 카테고리의 다른 글

[Data Science / Baseball] Lahman 데이터를 이용한 야구 데이터 분석 Part 4. (0)	2017.05.18
[Data Science / Baseball] rvest 패키지를 이용하여 KBO 야구 데이터 가져오기 (1)	2017.05.14
[Data Science / Baseball] Retrosheet의 Game Log 데이터로부터 MLB 역대 관중수 알아보기 (1)	2017.04.05
[Data Science / Baseball] 온라인 야구 데이터를 MongoDB에 저장하기 (0)	2017.03.23
[Data Science / Baseball] Lahman 데이터를 이용한 야구 데이터 분석 Part 3. (0)	2017.03.09