Last updated on September 22nd, 2024 at 06:21 am
Cannibalization is a topic in SEO that can be seen as proactive or reactive. Proactive is when you do an existing analysis before launching any new page to ensure that you are not going to cannibalize the existing page with a new page.
But if you encounter a website that already has hundreds of thousands of pages of which a lot of pages are similar then this post is for you.
In this post, I share a Python Script that does N-Gram Analysis on Meta titles to identify titles that are very similar to each other and are likely cannibalizing each other.
What are N-Grams though?
N-grams is a concept in NLP (Natural Language Processing) it is a continuous sequence of words in a document. It can be utilized for text classification, clustering, and identifying trends among other things.
Simply put, N-Grams bifurcate text into different grams, where grams mean a single word or letter (letters when they appear individually in a sentence example, a)
This diagram explains N-Grams really well.
This is not to say that N-grams are limited to trigrams it can extend beyond that, it can be 4-grams, 5-grams & so on. Hence it is called N-grams. N represents the number.
Usually, Bigram & Trigram are more popular when it comes to analysis.
Python Script to Run N-Gram Analysis on Titles to identify Similar Titles
import csv
from collections import Counter
from itertools import combinations
from typing import List
def n_gram_similarity(title1: str, title2: str, n: int) -> float:
"""Calculate the n-gram similarity between two strings."""
ngrams1 = Counter(zip(*[title1[i:] for i in range(n)]))
ngrams2 = Counter(zip(*[title2[i:] for i in range(n)]))
return sum((ngrams1 & ngrams2).values()) / sum((ngrams1 | ngrams2).values())
def find_similar_titles(titles: List[str], threshold: float, n: int) -> List[List[str]]:
"""Find similar titles based on the n-gram similarity score."""
similar_titles = []
for title1, title2 in combinations(titles, 2):
score = n_gram_similarity(title1, title2, n)
if score >= threshold:
similar_titles.append([title1, title2, score])
return similar_titles
def main(input_file: str, output_file: str):
"""Read titles from input file and write similar titles to output file."""
with open(input_file) as f:
titles = [line.strip() for line in f]
similar_titles = find_similar_titles(titles, threshold=0.5, n=3)
with open(output_file, 'w', newline='') as f:
writer = csv.writer(f)
writer.writerow(['Title 1', 'Title 2', 'N-gram Similarity'])
writer.writerows(similar_titles)
if __name__ == '__main__':
input_file = 'titles.txt'
output_file = 'similar_titles.csv'
main(input_file, output_file)
Let’s see the Script in Action
I ran the script & this is the CSV output that I received as you can see in the screenshot below.
As you can see in the results, how an amazing job it has done in the identification of similar titles. You can sort it for by Z-A to find the most similar title pages.
There is a page titled What is Retrieval Augmented Generation then there is another page titled What is Retrieval Augmented Generation? For sure these are way too similar with a distinction of just a question mark
Then there is another example, Understanding Retrieval-Augmented Generation (RAG) in AI with its similar one as Understanding Retrieval-Augmented Generation (RAG)
This is how N-Grams comes into the picture to identify similar titles on the same domain to identify the cannibalization.
Kunjal Chawhan founder of Decode Digital Market, a Digital Marketer by profession, and a Digital Marketing Niche Blogger by passion, here to share my knowledge