Scraping Data With Cheerio

Waltir | 2019-04-28

What is Cheerio?

Cheerio is a Node module that allows you to easily parse markup and extract the information you need using jQuery and Javascript. Cheerio provides an API for traversing/manipulating the resulting data structure. Additionally, it does not interpret the result as a web browser does. Cheerio simply provides access to the markup. It does not do any visual rendering , apply CSS, load external resources, or execute JavaScript. Basically, if the data is present when you ‘View Source’ or ‘Inspect’ it will also be available to Cheerio.


What's the point?

Cheerio comes in handy if you need to scrape or verify a large amount of data quickly. Sure we could use other popular QA tools to obtain this data, however, Cheerio will accomplish the task in a fraction of the time it would take for our normal automation framework to execute.


How have my projects benefited from it?

  • I’ve used it to verify placements of widgets and ad units on 300+ websites
  • Verified data in our Rest API for 300+ websites
  • Obtained Google Play / App Store versions for 300+ mobile apps
  • Obtained Google Play / App Store / Alexa reviews for 300+ mobile apps and skills
  • Verified the accuracy of legal and contact information on 300+ websites
  • Verified menu items and button links in 300+ apps using our Rest API
  • … The list goes on and on.

As you can see it has been our go to tool for the past two years when QA / Support needs to provide the business owners with the information they are looking for.


Execution Time

While this section is probably not necessary I feel that it's worth mentioning. Cheerio is fast! Typically these audits take less than 5 minutes to run against our 300+ sites.

giphy


Okay, lets dig in!

Install Cheerio

In a new project directory run the following commands npm init followed by npm install cheerio. This will initialize npm and install the cheerio module.


Setting up our Cheerio script

First were going to create a new file in the root of our directory called test.js. Next we need to add our two dependencies to the top of our new test file like so

const cheerio = require('cheerio');


const request = require('request');

We will be using the Request module to fetch the data from our site and we will be using the Cheerio module to parse through the data and extract the information we want.


Running a basic Cheerio script

The following example shows a basic request using Cheerio to log the title urls to the posts found on the following NPR ( https://www.npr.org/sections/national/ ) to the terminal. It can be run by navigating to the root of your projects directory in the terminal and running node script.js.

require('events').EventEmitter.defaultMaxListeners = 100;
const cheerio = require('cheerio');
const request = require('request');
request({
method: 'GET',
url: 'https://www.npr.org/sections/national/'
}, (err, res, body) => {
if (err) return console.error(err);
let $ = cheerio.load(body);
let title = $('title').text();
console.log(title); // Log the page title to the terminal
// Iterate over each of the posts on the page and log the url to the terminal
$('div.imagewrap > a').each(function(index, elem) {
let link = $(elem).attr('href');
console.log(link)
})
});
view raw script.js hosted with ❤ by GitHub

giphy


Getting data from each post

Okay, so we have logged the title of the page and the urls to each post on the page but what if we want to access data in each post? Easy, we just need to send an extra request to each post and extract the data we want. The following Gist shows how to save data from a dozen NPR posts to a json file.

require('events').EventEmitter.defaultMaxListeners = 100;
const cheerio = require('cheerio');
const request = require('request');
const fs = require('fs');
var interval = 500;
let data = new Array;
request({
method: 'GET',
url: 'https://www.npr.org/sections/national/'
}, (err, res, body) => {
if (err) return console.error(err);
let $ = cheerio.load(body);
$('div.imagewrap > a').each(function(index, elem) {
setTimeout(function () {
let link = $(elem).attr('href');
request({
method: 'GET',
url: link
}, (err, res, body) => {
if (err) return console.error(err);
let $ = cheerio.load(body);
let post = {
title: $('h1').text(),
url: link,
date: $('.date').text(),
slug: $('h3.slug > a').text(),
author: $('.byline__name > a:nth-child(1)').text(),
img: $('#storytext > div > div > img').attr('src'),
body:$('#storytext > p').map(function(i, el) { return $(this).text(); }).get().join(' ')
}
data.push(post)
})
fs.writeFile("log.json", JSON.stringify(data), function(err) {
if (err) return console.error(err);
console.log("Anoter post saved!");
});
}, index * interval);
});
});
view raw articles.js hosted with ❤ by GitHub


Iterating over each nav link and fetching all available posts

Finally what if we want gather posts from each page in the nav? Again, we simply need to adjust our script by adding one more request. The following working example shows how you can first iterate over each link in the navbar and then iterate over each post on the page to gather over 400 of the most recent posts on NPR.

require('events').EventEmitter.defaultMaxListeners = 100;
const cheerio = require('cheerio');
const request = require('request');
const fs = require('fs');
var interval = 500;
let data = new Array;
request({
method: 'GET',
url: 'https://www.npr.org/'
}, (err, res, body) => {
if (err) return console.error(err);
let $ = cheerio.load(body);
// Iterate over the nav links
$('li.submenu__item > a').each(function(index, elem) {
setTimeout(function () {
// Iterate over each post on the page
$('div.imagewrap > a').each(function(index, elem) {
setTimeout(function () {
let link = $(elem).attr('href');
request({
method: 'GET',
url: link
}, (err, res, body) => {
if (err) return console.error(err);
let $ = cheerio.load(body);
let post = {
title: $('h1').text(),
url: link,
date: $('.date').text(),
slug: $('h3.slug > a').text(),
author: $('.byline__name > a:nth-child(1)').text(),
img: $('#storytext > div > div > img').attr('src'),
body:$('#storytext > p').map(function(i, el) { return $(this).text(); }).get().join(' ')
}
data.push(post)
})
fs.writeFile("log.json", JSON.stringify(data), function(err) {
if (err) return console.error(err);
console.log("Anoter post saved!");
});
}, index * interval);
});
}, index * interval);
})
});
view raw all_articles.js hosted with ❤ by GitHub

giphy

As you can see Cheerio is a very powerful tool that can be used to assist with a wide array of tasks. Personally, it has become one of my favorite tools for automating many of my QA and Support tasks. It's worth mentioning that these simple examples are not entirely perfect and can be rewritten a dozen different ways. If you have any further questions I strongly recommend that you checkout the Cheerio documentation here .