Extracting data from meta tags with Cheerio

Waltir | 2019-05-01

As a QA engineer you get asked to test lots of stuff both on the front-end and back-end. One task that has come up time and time again is to verify that our meta tags are working correctly. Meta tags can be very important, especially for media companies that rely heavily on sharing their content on the various social platforms.


Getting some meta values

The following example shows how you can quickly extract some meta values from NPR.

const cheerio = require('cheerio');
const request = require('request');
const fs = require('fs');
request({
method: 'GET',
url: 'https://www.npr.org/sections/national/'
}, (err, res, body) => {
if (err) return console.error(err);
let $ = cheerio.load(body);
let post = {
title: $('h1').text(),
canonical: $('link[rel="canonical"]').attr('href'),
description: $('meta[name="description"]').attr('content'),
// Get OG Values
og_title: $('meta[property="og:title"]').attr('content'),
og_url: $('meta[property="og:url"]').attr('content'),
og_img: $('meta[property="og:image"]').attr('content'),
og_type: $('meta[property="og:type"]').attr('content'),
// Get Twitter Values
twitter_site: $('meta[name="twitter:site"]').attr('content'),
twitter_domain: $('meta[name="twitter:domain"]').attr('content'),
twitter_img_src: $('meta[name="twitter:image:src"]').attr('content'),
// Get Facebook Values
fb_appid: $('meta[property="fb:app_id"]').attr('content'),
fb_pages: $('meta[property="fb:pages"]').attr('content'),
}
data.push(post)
fs.writeFile("log.json", JSON.stringify(data), function(err) {
if (err) return console.error(err);
console.log("Anoter post saved!");
});
});
view raw cheerio_meta.js hosted with ❤ by GitHub


Response

After running the script above we receive the following JSON output. While it doesn’t seem like much now lets see how we can expand upon this.

[{


"title": "National",


"canonical": "https://www.npr.org/sections/national/",


"description": "NPR coverage of national news, U.S. politics, elections, business, arts, culture, health and science, and technology. Subscribe to the NPR Nation RSS feed.",


"og_title": "National",


"og_url": "https://www.npr.org/sections/national/",


"og_img": "https://media.npr.org/include/images/facebook-default-wide.jpg?s=1400",


"og_type": "article",


"twitter_site": "@NPR",


"twitter_domain": "npr.org",


"fb_appid": "138837436154588",


"fb_pages": "10643211755"


}]

giphy


Getting values from multiple posts

Obviously we could get manually check the meta values on one page quite easily. Where Cheerio shines is being able to verify dozens of posts at the same time. The following script iterates over all of the posts on the page and logs their meta values to our JSON file.

const cheerio = require('cheerio');
const request = require('request');
const fs = require('fs');
var interval = 500;
let data = new Array;
request({
method: 'GET',
url: 'https://www.npr.org/sections/national/'
}, (err, res, body) => {
if (err) return console.error(err);
let $ = cheerio.load(body);
$('div.imagewrap > a').each(function(index, elem) {
setTimeout(function () {
let link = $(elem).attr('href');
request({
method: 'GET',
url: link
}, (err, res, body) => {
if (err) return console.error(err);
let $ = cheerio.load(body);
let post = {
title: $('h1').text(),
url: link,
canonical: $('link[rel="canonical"]').attr('href'),
description: $('meta[name="description"]').attr('content'),
// Get OG Values
og_title: $('meta[property="og:title"]').attr('content'),
og_url: $('meta[property="og:url"]').attr('content'),
og_img: $('meta[property="og:image"]').attr('content'),
og_type: $('meta[property="og:type"]').attr('content'),
// Get Twitter Values
twitter_site: $('meta[name="twitter:site"]').attr('content'),
twitter_domain: $('meta[name="twitter:domain"]').attr('content'),
twitter_img_src: $('meta[name="twitter:image:src"]').attr('content'),
// Get Facebook Values
fb_appid: $('meta[property="fb:app_id"]').attr('content'),
fb_pages: $('meta[property="fb:pages"]').attr('content'),
}
data.push(post)
})
fs.writeFile("log.json", JSON.stringify(data), function(err) {
if (err) return console.error(err);
console.log("Anoter post saved!");
});
}, index * interval);
});
});

[{


"title": "UNC Charlotte Shooting Victim Is Honored As A Hero For Tackling Shooter",


"url": "https://www.npr.org/2019/05/01/719222196/unc-charlotte-shooting-victim-is-honored-as-a-hero-for-tackling-shooter",


"canonical": "https://www.npr.org/2019/05/01/719222196/unc-charlotte-shooting-victim-is-honored-as-a-hero-for-tackling-shooter",


"description": "Riley Howell is credited with disrupting the campus shooting, dying in the incident but saving others' lives. Police say they have not determined the shooter's motive.",


"og_title": "UNC Charlotte Shooting Victim Is Honored As A Hero For Tackling Shooter",


"og_url": "https://www.npr.org/2019/05/01/719222196/unc-charlotte-shooting-victim-is-honored-as-a-hero-for-tackling-shooter",


"og_img": "https://media.npr.org/assets/img/2019/05/01/ap_19121763139817_wide-c4a4fb41a7434242650ffd548f0539a110c51b9c.jpg?s=1400",


"og_type": "article",


"twitter_site": "@NPR",


"twitter_domain": "npr.org",


"twitter_img_src": "https://media.npr.org/assets/img/2019/05/01/ap_19121763139817_wide-c4a4fb41a7434242650ffd548f0539a110c51b9c.jpg?s=1400",


"fb_appid": "138837436154588",


"fb_pages": "10643211755"


}, {


"title": "Alabama Lawmakers Move To Outlaw Abortion In Challenge To Roe V. Wade",


"url": "https://www.npr.org/2019/05/01/719096129/alabama-lawmakers-move-to-outlaw-abortion-in-challenge-to-roe-v-wade",


"canonical": "https://www.npr.org/2019/05/01/719096129/alabama-lawmakers-move-to-outlaw-abortion-in-challenge-to-roe-v-wade",


"description": "The House overwhelmingly passed a bill Tuesday that could become the country's most restrictive abortion ban. It would make it a crime for doctors to perform abortions at any stage of a pregnancy. ",


"og_title": "Alabama Lawmakers Move To Outlaw Abortion In Challenge To Roe V. Wade",


"og_url": "https://www.npr.org/2019/05/01/719096129/alabama-lawmakers-move-to-outlaw-abortion-in-challenge-to-roe-v-wade",


"og_img": "https://media.npr.org/assets/img/2019/05/01/gettyimages-465405620_wide-4c683599c9632b335771cfa7674ffaad98cb029e.jpg?s=1400",


"og_type": "article",


"twitter_site": "@NPR",


"twitter_domain": "npr.org",


"twitter_img_src": "https://media.npr.org/assets/img/2019/05/01/gettyimages-465405620_wide-4c683599c9632b335771cfa7674ffaad98cb029e.jpg?s=1400",


"fb_appid": "138837436154588",


"fb_pages": "10643211755"


}]

The script above outputs to a simple JSON file, however, typically my next step is to perform a visual inspection of the scraped data in a Google Sheet. Using Cheerio we are able to quickly verify the accuracy of our meta values on dozens of posts in the same amount of time it would take to open and review just a handful of articles manually.

giphy