Webscraping

Robert McDonnell

4 minute read

In an earlier post, I described some ways in which you can interact with a web browser using R and RSelenium. This is ideal when you need to access data through drop-down menus and search bars. However, working with RSelenium can be tricky. There are, of course, easier ways to get information from the internet using R. Perhaps the most straightforward way is to use rvest, in tandem with other packages of the Hadleyverse1, such as dplyr and tidyr for data preparation and cleaning after the webscrape.

9 minute read

It goes almost without saying that the internet itself is the richest database available to us. From a 2014 blog post, it was claimed that every minute : Facebook users share nearly 2.5 million pieces of content. Twitter users tweet nearly 300,000 times. Instagram users post nearly 220,000 new photos. YouTube users upload 72 hours of new video content. Apple users download nearly 50,000 apps. Email users send over 200 million messages.