First off, what is a Nokogiri? According to Wikipedia,
“The Japanese saw or nokogiri (鋸 ) is a type of saw used in woodworking and Japanese carpentry that cuts on the pull stroke, unlike the European saw that cuts on the push stroke. This allows it to have thinner blades that cut more efficiently and leave a narrower cut width (kerf).”
OHHHH a saw. Now let’s cut through HTML data!
It’s good to familiarize yourself with the Nokogiri gem because Nokogiri makes it easier to analyze data from sources (such as HTML documents) that are not normally designed for this purpose.
First, install Nokogiri. On your command line, type the following:
gem install nokogiri
If you have any issues, reference the Nokogiri website.
Next, make an HTTP request using an Open URI Module. Open URI gives you the ability to open a http, https or ftp URL as if it were a file. Ultimately, this will allow us to run the ‘open’ method, which returns the HTML content of the URL and can store the output in an html variable we create.
Input in IRB:
require 'nokogiri' require 'open-uri'
Both commands should show “true”.
Input in IRB:
html = open('http://www.WEBSITE.com') #this utilizes our Open URI module from above nokogiri_doc = Nokogiri::HTML(html)
Nokogiri rocks! This command returns a more manageable output that we can easily manipulate. So what data are we looking to obtain within the HTML data? Take a look at the CSS elements of the website through the “Inspect Element” option in Google Chrome. Open up Google Chrome to the desired website you are attempting to parse. As an example, I’ll use WordPress (how “meta”).
After clicking on Inspect Element, another screen will emerge from the right side of your screen (or bottom, depending on your settings). This screen shows all of the CSS style elements within the page you are checking out. As you move your mouse around the CSS style screen, you should see different portions of the website illuminate.
When you locate the portion of the website you are trying to pull, take a look at the CSS style element attribute highlighted on your screen. Access the parent div, (in this case “article” for entire post list), and the element we are seeking (the title, “h1.entry-title a”).
This style attribute will be the key to parsing the HTML elements you are seeking. Next, we will iterate through the website.
def nokogiri_example html = open('https://casielevine.wordpress.com') wordpress = Nokogiri::HTML(html) all_posts =  # Iterate through the articles and pull the titles wordpress.css("article").each do |article| all_posts << article.css("h1.entry-title a").text end # return the all_posts results all_posts end nokogiri_example
Now we have an array of only the elements we were seeking from the WordPress site!