Translate

Sunday, January 10, 2016

Web Scrapping using R

Web Scrapping (Crawling without web crawler) using R :

I am going to demonstrate scrapping of crickbuzz website (fetching live scores and venues of live matches) using rvest package in R. Also, i am going to tell you what problem i faced while doing this work and how i found answer to that questions. 
"rvest" is very useful package for harvesting (Scrapping web contents) using R.

If you don't have rvest library , you can install it by following command in RStudio.


install.packages("rvest")

calling the library using :


library(rvest)

##taking instance of crickbuzz livescores page : available at : (Crickbuzz Live Scores )
crickbuzz <- read_html(httr::GET("http://www.cricbuzz.com/cricket-match/live-scores"))

##you can find a particular html node in the html page by SelectorGadget.

##finding matches
scrapping
matches <- crickbuzz %>%
html_nodes(".text-hvr-underline.text-black") %>%
html_text()
matches
## %>% is used as pipeline in r. find more about it at : (Pipeline in R - Learn More)

##finding scores :
matches_scores <- crickbuzz %>%
html_nodes(".bg-black1 , .cb-font-12.text-black , .cb-text-preview , .cb-font-12.text-black") %>%
html_text()
matches_scores

##removing useless entries :
matches_scores <- matches_scores[-1]
matches_scores <- matches_scores[-1]

##current status of matches
matches_Curr <- crickbuzz %>%
html_nodes(".cb-text-live , .cb-text-preview , .cb-text-complete") %>%
html_text()
matches_Curr

##venue of the match 
scrapping
matches_venue <- crickbuzz %>%
html_nodes(".text-gray:nth-child(3)") %>%
html_text()
matches_venue

##fetching Date and time from crickbuzz live score page whttp://as very interesting task when i does it for first time. i was not able to do it, because i was simply fetching it like earlier things, then i posted a question on stackoverflow, you can have a look at it for problem i faced : (Stackoverflow Question)

The solution i got is : First fetch its timestamp from html attribute "timestamp" using function html_attr() .

##fetching matches dates,
matches_timestamps <- crickbuzz %>%
    html_nodes(".schedule-date:nth-child(1)")%>%
    html_attr("timestamp")
scrapping matches_dates <- lapply(X = matches_timestamps ,  function(timestamp_match){
     (as.POSIXct(as.numeric(timestamp_match)/1000, origin="1970-01-01")) })
 matches_dates
##constructing a frame of all the info for a look at data we collected. 
matches_info <- as.data.frame(matches,1:length(matches))
matches_info[,"scores"] <- matches_scores ## appending scores
matches_info[,"venue"] <- matches_venue ##appending venue
matches_info[,"current_status"] <- matches_Curr ##appending current status

##another problem was here :  if i simply do this , 

matches_info[,"date_and_time"] <- matches_dates //appending date and time
## R will give following warning  and lead me to wrong result,

Warning message:
In `[<-.data.frame`(`*tmp*`, , "Date And Time", value = list(1452391200,  : 
  provided 18 variables to replace 1 variable
## again i posted question to stackoverflow community and found answer to the question : (Question

the solution is very simple as below : (do.call() function :) 


matches_info[,"Date And Time"] <- do.call(c,matches_dates)
scrapping
##Following was the scores according to scores on 10-01-2016



Thank You for reading this. keep visiting for new posts, and stay tuned at DexterEdu Youtube channel for upcoming R Lecture series at DexterEdu
For any help email me at : krunalparmar@iitkgp.ac.in
Also See :

Library Navigation App and Linguistic App by DexterEdu.