In this post, I try my hand at simple web scraping in R.
This is part of my series of documenting my small experiments using R or Python & solving Data Analysis / Data Science problems. These experiments might be redundant and may have been already written and blogged about by various people, but this is more of a personal diary and my personal learning process. And, during this process, I hope that I can engage and inspire anyone else who is going through the same process as mine. If a more knowledgeable person than me, stumbles upon this blog and thinks there is a much better way to do things or i have erred somewhere, please feel free to share the feedback and help not just me but everyone grow together as a community.

I recently moved to Vancouver, Canada, and i wanted to go on, as many hikes as possible before the Summer gives way to Fall and then to Winter.I Googled about the trails around Vancouver and found a neat website www.vancouvertrails.com. Then i thought maybe why not make a google sheet and track the trails, as i finish them and note down my experience about it. I could have just copy pasted the trails data into a google sheet and would have moved on, but then it wouldn't be fun. So I wrote an R code to scrape the website data and export it as a csv, which i will then upload on my google sheet. (yes i like to make things as much fun as possible)
Goal: To scrape website data and export it as a csv file and while I’m at it, use ggplot to do some data analysis for fun. (simple enough)
Libraries: I started looking in the documentation of R libraries to find out which libraries to use and which functions to use after some 15 mins of googling, found library “rvest” which i liked.
Step 1: Reading the URL
There is a function “read_html”, we will use that to read the html on the given webpage.
url <- “[https://www.vancouvertrails.com/trails/?details=&sort=#list](https://www.vancouvertrails.com/trails/?details=&sort=#list)"
trails_webpage <- read_html(url)
If you have visited above URL, you see it has the list of all the trails (167 to be precise) in and around Vancouver
Step 2: Scraping the Data which is required
Now, the best part of the rvest library is that you can extract the data from html nodes, what that means is you can straightaway select the nodes with their id’s or css classes and extract the text from the html tags. So go to the above url in Firefox and click Page Source from Tools menu to open the source code of the page. I soon figured that the names of the hikes have been encapsulated in the .trailname css class, using this css class we can extract all the trail names on the webpage.
There are 2 functions that we will use here:
html_nodes : Use this function to extract the nodes that we like (in this case nodes with .trailname as css classhtml_text: Use this function to extract the text in between the html nodes (in this case our trail names)Extracting: Trail Names
trail_names_html <-html_nodes(trails_webpage, ‘.trailname’)
trail_names <- html_text(trail_names_html)
head(trail_names)
Output:
[1] “Abby Grind” “Admiralty Point”
[3] “Al’s Habrich Ridge Trail” “Aldergrove Regional Park”
[5] “Alice Lake” “Ancient Cedars Trail”
Similarly, now let's do this for all other attributes for each trail: Region, Difficulty, Time, Distance, Season. Each of these attributes have their own css classes:i-name, i-time,i-difficulty,i-distance,i-schedule
Extracting: Trail Region
trail_region_html <-html_nodes(trails_webpage, '.i-name')
trail_region <- html_text(trail_region_html)
head(trail_region)
Output:
[1] "Fraser Valley East" "Tri Cities" "Howe Sound"
[4] "Surrey and Langley" "Howe Sound" "Whistler"
Extracting: Trail Difficulty
trail_diff_html <-html_nodes(trails_webpage, '.i-difficulty')
trail_diff <- html_text(trail_diff_html)
head(trail_diff)
Output:
[1] "Intermediate" "Easy" "Intermediate" "Easy" "Easy"
[6] "Intermediate"
Extracting: Trail Season
trail_season_html <-html_nodes(trails_webpage, '.i-schedule')
trail_season <- html_text(trail_season_html)
head(trail_season)
Output:
[1] "year-round" "year-round" "July - October" "year-round"
[5] "April - November" "June - October"
One thing to note, when we extract time, it is in string format: Eg:1.5 Hours, 3 Hours., We want it in numeric form, To convert it into a numeric form, I'm using a library : Stringr and the function:str_extract
We can use the regular expression to match the pattern and extract the time in mumeric form. For regular expression help, you can refer to the cheatsheet of stringr here.
So this is what I did to convert our time to numeric form:
Extracting: Trail Times:
trail_time_html <-html_nodes(trails_webpage, '.i-time')
trail_time <- html_text(trail_time_html)
head(trail_time)
trail_time <- as.numeric(str_extract(trail_time,pattern = "\\-*\\d+\\.*\\d*"))
head(trail_time,25)
Output:
[1] 1.50 1.50 5.00 2.00 2.00 2.00 3.50 5.00
[9] 5.00 1.50 1.00 5.00 11.00 3.00 1.00 2.00
[17] 1.50 1.00 0.50 3.50 0.25 5.00 4.00 2.00
[25] 3.50
Similarly we do the same thing for Trail Distance as the information is in character form Eg: 4km:
Extracting: Trail Distance
trail_dist_html <-html_nodes(trails_webpage, '.i-distance')
trail_dist <- html_text(trail_dist_html)
head(trail_dist)
trail_dist <- as.numeric(str_extract(trail_dist,pattern = "\\-*\\d+\\.*\\d*"))
head(trail_dist,25)
Output:
[1] 4.0 5.0 7.0 5.0 6.0 5.0 6.1 12.0 10.0 3.0 2.6 10.0 29.0 8.0 2.5
[16] 5.0 4.0 4.2 1.0 6.0 0.8 7.5 7.0 8.0 10.0
Step 3: Now that we have all our information lets collate it into a dataframe and export it to .csv file. To be able to do this, we use a function write_csv from library readr
library(readr)
#Combining all the extracted features of the trails
trails_df <- data.frame(
Name =trail_names,
Region = trail_region,
Difficulty=trail_diff,
Distance=trail_dist,
HikeTime = trail_time,
Season = trail_season
)
str(trails_df)
write_csv(trails_df, "vancouver_trails.csv")
Step 4: Visualising data
You can refer to the Github repo for the code


