Python Script to Spot Similar Meta Titles & Prevent Cannibalization

Cannibalization is a topic in SEO that can be seen as proactive or reactive. Proactive is when you do an existing analysis before launching any new page to ensure that you are not going to cannibalize the existing page with a new page.

But if you encounter a website that already has hundreds of thousands of pages of which a lot of pages are similar then this post is for you.

In this post, I share a Python Script that does N-Gram Analysis on Meta titles to identify titles that are very similar to each other and are likely cannibalizing each other.

What are N-Grams though?

N-grams is a concept in NLP (Natural Language Processing) it is a continuous sequence of words in a document. It can be utilized for text classification, clustering, and identifying trends among other things.

Simply put, N-Grams bifurcate text into different grams, where grams mean a single word or letter (letters when they appear individually in a sentence example, a)

This diagram explains N-Grams really well.

n-gram-example-1
Source: AIML

This is not to say that N-grams are limited to trigrams it can extend beyond that, it can be 4-grams, 5-grams & so on. Hence it is called N-grams. N represents the number.

Usually, Bigram & Trigram are more popular when it comes to analysis.

Python Script to Run N-Gram Analysis on Titles to identify Similar Titles

				
					import csv
from collections import Counter
from itertools import combinations
from typing import List

def n_gram_similarity(title1: str, title2: str, n: int) -> float:
    """Calculate the n-gram similarity between two strings."""
    ngrams1 = Counter(zip(*[title1[i:] for i in range(n)]))
    ngrams2 = Counter(zip(*[title2[i:] for i in range(n)]))
    return sum((ngrams1 & ngrams2).values()) / sum((ngrams1 | ngrams2).values())

def find_similar_titles(titles: List[str], threshold: float, n: int) -> List[List[str]]:
    """Find similar titles based on the n-gram similarity score."""
    similar_titles = []
    for title1, title2 in combinations(titles, 2):
        score = n_gram_similarity(title1, title2, n)
        if score >= threshold:
            similar_titles.append([title1, title2, score])
    return similar_titles

def main(input_file: str, output_file: str):
    """Read titles from input file and write similar titles to output file."""
    with open(input_file) as f:
        titles = [line.strip() for line in f]
    similar_titles = find_similar_titles(titles, threshold=0.5, n=3)
    with open(output_file, 'w', newline='') as f:
        writer = csv.writer(f)
        writer.writerow(['Title 1', 'Title 2', 'N-gram Similarity'])
        writer.writerows(similar_titles)

if __name__ == '__main__':
    input_file = 'titles.txt'
    output_file = 'similar_titles.csv'
    main(input_file, output_file)

				
			

Let’s see the Script in Action

I ran the script & this is the CSV output that I received as you can see in the screenshot below.

ngram similar titles

As you can see in the results, how an amazing job it has done in the identification of similar titles. You can sort it for by Z-A to find the most similar title pages.

There is a page titled What is Retrieval Augmented Generation then there is another page titled What is Retrieval Augmented Generation? For sure these are way too similar with a distinction of just a question mark

Then there is another example, Understanding Retrieval-Augmented Generation (RAG) in AI with its similar one as Understanding Retrieval-Augmented Generation (RAG)

This is how N-Grams comes into the picture to identify similar titles on the same domain to identify the cannibalization.

Leave a Comment