Compare Your Pages vs Competitors Using BERT Embeddings

Last updated on September 22nd, 2024 at 06:20 am

Vector Embeddings are all the rage in SEO right now. Since LLMs like ChatGPT & Google AIO came into the picture vector embeddings are getting an overwhelming adoption for various use cases.

What are Vector Embeddings?

Simply put, Vector Embeddings are a concept in NLP (Natural Language Processing) here the indication is towards word embeddings or sentence embeddings. Words are converted into vectors which is a numerical representation of words. Vector Embeddings mean how these numbers occupy their place in the vector space model. If there is a sentence that consists of various words each word will be represented with a number in a vector space model the importance of that word with its other word will be visualized or understood.

This again depends on which Model you are using to extract the embeddings & vectors. some examples of models/techniques are.

TF-IDF
Bert Base Uncase
xlm-mlm-en-2028
distilbert-base-uncased
distilroberta-base
albert-base-v1

This illustration by Medium explains how words are converted to vector embeddings & how they occupy a relationship with each other in the vector space model.

Now that we have understood what vector embeddings are let’s understand how they relate to SEO

In SEO, we deal a lot with content, we try to rank webpages by optimizing it with textual content. The same textual content is converted into vector embeddings which helps Google understand the content. In fact, the same concept was mentioned in Google Algo leaks.

Another way this relates to SEO is Google AIO, Google AI Overviews are basically RAG-based information retrieval which relies on vector embeddings. The information is stored in vector databases & through RAG AIO answers are generated.

Moreover with GEO (Generative Engine Optimization) which is what the industry is gravitating towards it’s all the more important to understand & integrate the understandings of Vector Embeddings. Various studies have shown a strong correlation between vector similarity in top-ranking pages.

When it comes to Google Search Engine Rankings or Google AIO it’s also important to note that Vector Embeddings are one of the factor they also take into account other factors like PageRank, Authoritativeness of the domain by understanding the quality of domains that links to the website.

What we will cover in this post?

We will understand how we can scrape the body text of a list of URLs (we will use Advertools for this) that are ranking in the SERP, how we can convert the body text into vector embeddings (Bert Base Uncased Model) and how we can compare vector embeddings of one page with another page.

I will be sharing the entire Python Script that will help you achieve this.

Step 1 – Installations

				
					!pip install advertools transformers torch pandas scikit-learn plotly

Step 2 – Import necessary Libraries

				
					import advertools as adv
import pandas as pd
from transformers import AutoTokenizer, AutoModel
import torch
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import plotly.graph_objects as go

Step 3 – Define the functions

				
					def extract_body_text(urls):
    output_file = 'scraped_data.jl'
    adv.crawl(urls, output_file, follow_links=False)
    df = pd.read_json(output_file, lines=True)
    return df[['url', 'body_text']]

def get_bert_embedding(text, model, tokenizer):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512, padding=True)
    with torch.no_grad():
        outputs = model(**inputs)
    return outputs.last_hidden_state.mean(dim=1).squeeze().numpy()

def plot_similarity_matrix(similarity_df):
    fig = go.Figure(data=go.Heatmap(
        z=similarity_df.values,
        x=similarity_df.columns,
        y=similarity_df.index,
        hoverongaps = False,
        hovertemplate='URL1: %{y}<br>URL2: %{x}<br>Similarity: %{z:.4f}<extra></extra>',
        colorscale='Blues',
        zmin=0.99,
        zmax=1
    ))
    
    fig.update_layout(
        title='Cosine Similarity between URLs',
        xaxis_title='URLs',
        yaxis_title='URLs',
        width=1000,
        height=800
    )
    
    fig.show()

def main(urls):
    # Extract body text from URLs
    print("Extracting body text from URLs...")
    df = extract_body_text(urls)
    
    # Load BERT model and tokenizer
    print("Loading BERT model and tokenizer...")
    tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
    model = AutoModel.from_pretrained("bert-base-uncased")
    
    # Generate embeddings
    print("Generating embeddings...")
    embeddings = df['body_text'].apply(lambda x: get_bert_embedding(x, model, tokenizer))
    
    # Calculate cosine similarity
    print("Calculating cosine similarity...")
    similarity_matrix = cosine_similarity(np.vstack(embeddings))
    
    # Create a DataFrame for the similarity matrix
    similarity_df = pd.DataFrame(similarity_matrix, index=df['url'], columns=df['url'])
    
    # Plot the similarity matrix
    plot_similarity_matrix(similarity_df)
    
    return similarity_df

Step 4 – Specify URLs & Run the main script

				
					# Example usage
urls = [
    "https://aws.amazon.com/what-is/data-analytics/",
    "https://www.investopedia.com/terms/d/data-analytics.asp",
    "https://www.coursera.org/articles/data-analytics",
    "https://www.techtarget.com/searchdatamanagement/definition/data-analytics",
    "https://www.comptia.org/content/guides/what-is-data-analytics",
    "https://www.alteryx.com/glossary/data-analytics",
    "https://www.sciencedirect.com/topics/social-sciences/data-analytics"
]

similarity_matrix = main(urls)
print("Cosine Similarity Matrix:")
print(similarity_matrix)

Cosine Similarity Matrix

This is how the cosine similarity matrix is visualized amongst the URLs.

Upon hover, we can see the similarity between the web pages. Here just to make a note, the top rankers are AWS, Investopedia, Coursera

Between AWS & Investopedia you would notice that the similarity score is 0.942 which is very closely matched.

ScienceDirect doesn’t rank in 1st page of Google. And rightly so when you compare cosine similarity between Science Direct & top competitors the similarity score indicates the wide distance they carry.

Here we can see that the similarity score is 0.66 whereas in the first example, it was 0.94 between different URLs.

Takeaway

We can use this script to analyze how close or distant we are to top-ranking documents & try to bridge the gap that we see. Which should reward with better rankings, more likelihood of appearing as an answer in LLMs.

Kunjal

Kunjal Chawhan founder of Decode Digital Market, a Digital Marketer by profession, and a Digital Marketing Niche Blogger by passion, here to share my knowledge

Sharing is awesome

Twitter Facebook Pinterest LinkedIn WhatsApp Pocket

2 thoughts on “Compare Your Pages vs Competitors Using BERT Embeddings”

Roger Marquez
August 26, 2024 at 10:11 pm
Can we adjust this script so it runs evaluating a site´s sitemap?
- Kunjal
  September 15, 2024 at 5:05 am
  Yes, it can be adjusted like that. We would need to make changes to the script to handle that & also how efficiently we can scrape the content to vectorize would be questionable if we take this route.