After that go to the folder mentioned in the command (copy and paste the one that you got): “C:\\Users\\User\\AppData\\Local\\binman\\binman_chromedriver\\win32\\”
So I will look for: “C:/Users/User/AppData/Local/binman/binman_chromedriver/win32”
3. ChromeDriver:
Download the ChromeDriver matching your Chrome version from Chrome for Testing.
If you are using Mac, you will download the Mac version that ajust to your Google Chrome Version. If you are using Windows, the same for Windows:
3. ChromeDriver:
After unzip the download chromedriver, move it the downloaded file to the directory selected in step 2. In my case: “C:/Users/User/AppData/Local/binman/binman_chromedriver/win32”
4. Verify Installation:
Check if the ChromeDriver is correctly set up by running:
# Check and install wdman if not already installedif (!requireNamespace("wdman", quietly =TRUE)) {install.packages("wdman")}wdman::selenium()
library(RSelenium)# Assuming Selenium is running on the specified port (the port should be the same that you have in 5.b.)remDr <-remoteDriver(remoteServerAddr ="localhost", port =4567L, browserName ="chrome")# Attempt to open the browserremDr$open()
6. If everything works, you will have a new window like this:
What is Selenium? Why use Selenium?
Selenium is an open-source framework used for automating web browsers. Allow us to interact with web elements on web pages as if a human user were interacting with the browser. Example: We will see the mouse simulating the movement of us in the webpage.
Use Selenium: interact with dynamic web content, perform complex navigations, automate testing, or scrape data from interactive elements that rely on JavaScript.
Get the source page sintax (right click on the website + inspect element):
HTML Sintaxis
HTML Sintaxis for today
CSS (1)
CSS (2)
XPATH
XPath (XML Path Language) is a query language used for selecting nodes* from an XML document, which can also be applied to HTML documents.
XPath allows us to combine with HTML, CSS, and Selenium to have a more efficient way to to interact with elements on a webpage and extract information.
* nodes: “individual parts or components of a document’s structure”. A node can represent different types of elements (text, elements —html—, attributes —css—)
# Selenium is running in the port that we said:remDr <-remoteDriver(remoteServerAddr ="localhost", port =4567L, browserName ="chrome")# Open the Chrome:remDr$open()
# Tell the webpage that we want to scraperemDr$navigate("https://www.sothebys.com/en/auctions/2012/latin-american-art-n08862.html")
Scraping the author’s name:
Get the source page:
page_source <- remDr$getPageSource()[[1]] #raw HTML source codepage_html <-read_html(page_source) #parses the raw HTML into a format that allows for HTML element selection and manipulationpage_html
[1] "1. Diego Rivera" "2. Fernando Botero"
[3] "3. José Pancetti" "4. Héctor Poleo"
[5] "5. Emiliano di Cavalcanti" "6. Matta"
[7] "7. Leonora Carrington" NA
[9] "9. Carlos Mérida" "10. Francisco Toledo"
[11] "11. Armando Reverón" "12. Fernando Botero"
Scraping the title of each piece of art:
Extract Title of the art piece:
title <- artworks %>%html_node(xpath ='.//div[@class="description"]') %>%html_text(trim =TRUE)title
[1] "Mujer con dalias"
[2] "Naturaleza muerta con frutas"
[3] "Saquarema"
[4] "Maternidad"
[5] "Natureza morta"
[6] "Mourir pour le pain (Composition 50F)"
[7] "Santa Teresa en la cocina"
[8] ""
[9] "Fecundidad"
[10] "Autorretrato"
[11] "Desnudo detrás de la mantilla"
[12] "Donna Seduta"
Split the vector into two separate vectors—one containing the values before the dash (” –“) and another containing the values after the dash—:
result <-tibble(estimate_text) %>%mutate(values =str_remove_all(estimate_text, "Estimate: | USD"), # Remove unnecessary textvalues =str_remove_all(values, ",")) %>%# Remove commasseparate(values, into =c("before", "after"), sep =" – ") %>%# Split into before and aftermutate(across(c(before, after), as.numeric)) # Convert to numeric# Extracting the minimum and maximum vectorsmin_estimate <- result$beforemax_estimate <- result$afterprint(min_estimate)