In this blog post, I will share a Google Sheets App Script that will help you flag URLs that are blocked from Crawling and which ones aren’t blocked from Crawling.
Use Case:
You are handling a huge enterprise website comprising of thousands or millions of pages. A new robots.txt got implemented & you want to check which URLs are now blocked from crawling or you want to check before implementing the new robots.txt that how many URLs won’t get crawled.
This is assuming you are not looking forward to run a full crawl on your favourite crawling tool like Screaming Frog which can consume a lot of time.
Without Further Ado here is the App Script that You will need to paste in the Script Editor
/**
* Checks if a given URL is allowed or disallowed based on robots.txt rules.
*
* @param {string} url - The URL to check.
* @return {string} "Crawling Allowed" or "Crawling Blocked".
* @customfunction
*/
function CHECKROBOTS(url) {
var robotsTxtRules = [
"/*?categoryId",
"/*?*price*",
"/*?*sort*",
"/*/collections/*sort_by*"
];
var path = getPathFromUrl(url);
for (var i = 0; i < robotsTxtRules.length; i++) {
if (isUrlDisallowed(path, robotsTxtRules[i])) {
return "Crawling Blocked";
}
}
return "Crawling Allowed";
}
/**
* Extracts the path and query string from a URL.
*
* @param {string} url - The full URL.
* @return {string} The path and query string.
*/
function getPathFromUrl(url) {
var match = url.match(/^(?:https?:\/\/)?(?:[^@\n]+@)?(?:www\.)?([^:\/\n?]+)(.*)$/);
return match ? match[2] : url;
}
/**
* Checks if a URL path is disallowed based on a robots.txt rule.
*
* @param {string} path - The URL path to check.
* @param {string} rule - The robots.txt rule to check against.
* @return {boolean} True if the path is disallowed, false otherwise.
*/
function isUrlDisallowed(path, rule) {
var pattern = rule
.replace(/\*/g, '.*') // Replace * with .*
.replace(/\//g, '\\/') // Escape forward slashes
.replace(/\?/g, '\\?') // Escape question marks
.replace(/\&/g, '\\&') // Escape ampersands
.replace(/\=/g, '\\=') // Escape equal signs
var regex = new RegExp('^' + pattern);
return regex.test(path);
}
This is an example, but based on your website you can specify the disallowed command lines. After this you will need a formula like this
=CHECKROBOTS(A2)
This is an example, but based on your website you can specify the disallowed command lines. After this you will need a formula like this.
Once you add formula next to the URL cell it will flag the URL Crawling Status as shown in the example below.
Kunjal Chawhan founder of Decode Digital Market, a Digital Marketer by profession, and a Digital Marketing Niche Blogger by passion, here to share my knowledge