class: left title-slide background-image: url('unsplash-yUvZYHV2Zbw.png') background-size: cover background-position: top center # Raspagem de dados<br>via Rvest .side-text[ . steven.metodosquantitativos.com ] .title-where[ TARC3 <br> ] <style type="text/css"> @keyframes title-text{ 0% { opacity: 0; text-shadow: -20px 30px 5px rgba(0,0,0,0.25); transform: translate(15px, -15px); } 10% { opacity: 0; text-shadow: -20px 30px 5px rgba(0,0,0,0.25); transform: translate(15px, -15px); } 80% { opacity: 1; text-shadow: -5px 5px 10px rgba(0,0,0,0.25); transform: translate(0, 0); } 100% { opacity: 1; text-shadow: -5px 5px 10px rgba(0,0,0,0.25); transform: translate(0, 0); } } @keyframes enter-right { 0% { opacity: 0; transform: rotate(90deg) translateY(-50px) } 20% { opacity: 0; transform: rotate(90deg) translateY(-50px) } 80% { opacity: 1; transform: rotate(90deg) translateY(0) } 100% { opacity: 1; transform: rotate(90deg) translateY(0) } } @keyframes enter-left { 0% { opacity: 0; transform: translateY(100px) } 20% { opacity: 0; transform: translateY(100px) } 60% { opacity: 1; transform: translateX(0) } 100% { opacity: 1; transform: translateX(0) } } .remark-visible .title-slide h1, .remark-visible .title-slide .side-text, .remark-visible .title-slide .title-where { animation-duration: 13s; } .title-slide h1 { font-size: 100px; font-family: Jost, sans; animation-name: title-text; animation-direction: alternate; animation-iteration-count: infinite; } .side-text { color: white; opacity: 0.66; transform: rotate(90deg); position: absolute; font-size: 20px; top: 130px; right: -130px; transition: opacity 0.5s ease-in-out; animation-name: enter-right; animation-direction: alternate; animation-iteration-count: infinite; } .side-text:hover { opacity: 1; } .side-text a { color: white; } .title-where { color: white; font-family: 'Amatic SC', sans; font-size: 40px; position: absolute; bottom: 10px; animation-name: enter-left; animation-direction: alternate; animation-iteration-count: infinite; animation-timing-function: ease-in-out; } </style> --- class: center, middle ## [Meu primeiro projeto de raspagem de dados](http://www.estatisticacomr.uff.br/?p=869) <div class="ribbon-parent" style="position:absolute;top:0px;overflow:hidden;width:150px;height:150px;z-index:5;pointer-events:none;right:0px;top:00px;z-index:100;"> <div class="ribbon" style="background-color:white;overflow:hidden;white-space:nowrap;position:absolute;top:45px;box-shadow:0 0 10px #888;pointer-events:auto;right:-50px;transform:rotate(45deg);width:250px;top:28px;"><a href="https://aulas.metodosquantitativos.com/" style="border:1px solid white;color:black;display:block;font:bold 95% 'Collegiate', Arial, sans-serif;margin:1px 0;padding:6px 50px;text-align:center;text-decoration:none;letter-spacing:-0.3px;" target="_blank"> Aulas</a></div></div> <style> .ribbon:hover {opacity:1;} .ribbon {opacity:0.6;transition:opacity 0s ease 0s;} </style> -- # Meu primeiro fracasso com raspagem de dados --- # Rvest ### Como conseguir o rvest ```r install.packages("rvest") ``` --- class: inverse, center, middle # Comparação <br>Rvest com <br>BeautifulSoup --- .pull-left[ No **BeautifulSoup**, nossa configuração inicial é assim: ```python # load packages from bs4 import BeautifulSoup import requests # connect to webpage resp = requests.get("https://bsi.uniriotec.br/") # get BeautifulSoup object soup = BeautifulSoup(resp.content) ``` ] -- .pull-right[ Em comparação, no **rvest**: ```r # load rvest package library(rvest) # get HTML object dados_html = read_html("https://bsi.uniriotec.br/") ``` ] --- class: inverse, center, middle # Como procurar por tags HTML específicas --- .pull-left[ # No BeautifulSoup: ```python links = soup.find_all("a") ``` ] -- .pull-right[ # No Rvest: ```r links = dados_html %>% html_nodes("a")%>% html_attr("href") ``` No **rvest**, usamos o operador **%>%**. ] --- class: inverse, center, middle # Como extrair uma lista de todos os objetos --- .pull-left[ No **BeautifulSoup**, usamos o método **find_all** para extrair uma lista de todos os objetos de uma *tag* específica de uma página da *web*. ```python # get all div tags soup.find_all("div") # get all h1 tags soup.find_all("h1") ``` ] -- .pull-right[ No **rvest**, podemos obter tags específicas de HTML usando a função *html_nodes*. ```r # scrape all div tags html_data %>% html_nodes("div") # scrape header h1 tags html_data %>% html_nodes("h1") ``` ] --- class: inverse, center, middle # Como raspar dados do site do BSI com o Rvest? --- ### Provavelmente será algo parecido com isso. ```r link<-"https://bsi.uniriotec.br/bancos-de-dados-i-tin0120/" pagina <- read_html(link) r1 <- pagina %>% rvest::html_nodes("h1") %>% html_text() r4 <- pagina %>% rvest::html_nodes("h4") %>% html_text() p <- pagina %>% rvest::html_nodes("p") %>% html_text() ``` --- Outros pacotes do R interessantes: 1. [RSelenium: R Bindings for 'Selenium WebDriver'](https://cran.r-project.org/web/packages/RSelenium/index.html) 2. [XML: Tools for Parsing and Generating XML Within R and S-Plus](https://cran.r-project.org/web/packages/XML/index.html) 3. [xml2: Parse XML](https://cran.r-project.org/web/packages/xml2/index.html) 4. [htmltidy: Tidy Up and Test XPath Queries on HTML and XML Content](https://cran.r-project.org/web/packages/htmltidy/index.html) 5. [seleniumPipes: R Client Implementing the W3C WebDriver Specification](https://cran.r-project.org/web/packages/seleniumPipes/index.html) Referências para criação desse slide <br> [BeautifulSoup vs. Rvest](http://theautomatic.net/2019/07/23/beautifulsoup-vs-rvest/)