Last updated on September 22nd, 2024 at 06:20 am
Vector Embeddings are all the rage in SEO right now. Since LLMs like ChatGPT & Google AIO came into the picture vector embeddings are getting an overwhelming adoption for various use cases.
What are Vector Embeddings?
Simply put, Vector Embeddings are a concept in NLP (Natural Language Processing) here the indication is towards word embeddings or sentence embeddings. Words are converted into vectors which is a numerical representation of words. Vector Embeddings mean how these numbers occupy their place in the vector space model. If there is a sentence that consists of various words each word will be represented with a number in a vector space model the importance of that word with its other word will be visualized or understood.
This again depends on which Model you are using to extract the embeddings & vectors. some examples of models/techniques are.
- TF-IDF
- Bert Base Uncase
- xlm-mlm-en-2028
- distilbert-base-uncased
- distilroberta-base
- albert-base-v1
This illustration by Medium explains how words are converted to vector embeddings & how they occupy a relationship with each other in the vector space model.
Now that we have understood what vector embeddings are let’s understand how they relate to SEO
In SEO, we deal a lot with content, we try to rank webpages by optimizing it with textual content. The same textual content is converted into vector embeddings which helps Google understand the content. In fact, the same concept was mentioned in Google Algo leaks.
Another way this relates to SEO is Google AIO, Google AI Overviews are basically RAG-based information retrieval which relies on vector embeddings. The information is stored in vector databases & through RAG AIO answers are generated.
Moreover with GEO (Generative Engine Optimization) which is what the industry is gravitating towards it’s all the more important to understand & integrate the understandings of Vector Embeddings. Various studies have shown a strong correlation between vector similarity in top-ranking pages.
When it comes to Google Search Engine Rankings or Google AIO it’s also important to note that Vector Embeddings are one of the factor they also take into account other factors like PageRank, Authoritativeness of the domain by understanding the quality of domains that links to the website.
What we will cover in this post?
We will understand how we can scrape the body text of a list of URLs (we will use Advertools for this) that are ranking in the SERP, how we can convert the body text into vector embeddings (Bert Base Uncased Model) and how we can compare vector embeddings of one page with another page.
I will be sharing the entire Python Script that will help you achieve this.
Step 1 – Installations
!pip install advertools transformers torch pandas scikit-learn plotly
Step 2 – Import necessary Libraries
import advertools as adv
import pandas as pd
from transformers import AutoTokenizer, AutoModel
import torch
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import plotly.graph_objects as go
Step 3 – Define the functions
def extract_body_text(urls):
output_file = 'scraped_data.jl'
adv.crawl(urls, output_file, follow_links=False)
df = pd.read_json(output_file, lines=True)
return df[['url', 'body_text']]
def get_bert_embedding(text, model, tokenizer):
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512, padding=True)
with torch.no_grad():
outputs = model(**inputs)
return outputs.last_hidden_state.mean(dim=1).squeeze().numpy()
def plot_similarity_matrix(similarity_df):
fig = go.Figure(data=go.Heatmap(
z=similarity_df.values,
x=similarity_df.columns,
y=similarity_df.index,
hoverongaps = False,
hovertemplate='URL1: %{y}
URL2: %{x}
Similarity: %{z:.4f} ',
colorscale='Blues',
zmin=0.99,
zmax=1
))
fig.update_layout(
title='Cosine Similarity between URLs',
xaxis_title='URLs',
yaxis_title='URLs',
width=1000,
height=800
)
fig.show()
def main(urls):
# Extract body text from URLs
print("Extracting body text from URLs...")
df = extract_body_text(urls)
# Load BERT model and tokenizer
print("Loading BERT model and tokenizer...")
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")
# Generate embeddings
print("Generating embeddings...")
embeddings = df['body_text'].apply(lambda x: get_bert_embedding(x, model, tokenizer))
# Calculate cosine similarity
print("Calculating cosine similarity...")
similarity_matrix = cosine_similarity(np.vstack(embeddings))
# Create a DataFrame for the similarity matrix
similarity_df = pd.DataFrame(similarity_matrix, index=df['url'], columns=df['url'])
# Plot the similarity matrix
plot_similarity_matrix(similarity_df)
return similarity_df
Step 4 – Specify URLs & Run the main script
# Example usage
urls = [
"https://aws.amazon.com/what-is/data-analytics/",
"https://www.investopedia.com/terms/d/data-analytics.asp",
"https://www.coursera.org/articles/data-analytics",
"https://www.techtarget.com/searchdatamanagement/definition/data-analytics",
"https://www.comptia.org/content/guides/what-is-data-analytics",
"https://www.alteryx.com/glossary/data-analytics",
"https://www.sciencedirect.com/topics/social-sciences/data-analytics"
]
similarity_matrix = main(urls)
print("Cosine Similarity Matrix:")
print(similarity_matrix)
Cosine Similarity Matrix
This is how the cosine similarity matrix is visualized amongst the URLs.
Upon hover, we can see the similarity between the web pages. Here just to make a note, the top rankers are AWS, Investopedia, Coursera
Between AWS & Investopedia you would notice that the similarity score is 0.942 which is very closely matched.
ScienceDirect doesn’t rank in 1st page of Google. And rightly so when you compare cosine similarity between Science Direct & top competitors the similarity score indicates the wide distance they carry.
Here we can see that the similarity score is 0.66 whereas in the first example, it was 0.94 between different URLs.
Takeaway
We can use this script to analyze how close or distant we are to top-ranking documents & try to bridge the gap that we see. Which should reward with better rankings, more likelihood of appearing as an answer in LLMs.
Kunjal Chawhan founder of Decode Digital Market, a Digital Marketer by profession, and a Digital Marketing Niche Blogger by passion, here to share my knowledge
Can we adjust this script so it runs evaluating a site´s sitemap?
Yes, it can be adjusted like that. We would need to make changes to the script to handle that & also how efficiently we can scrape the content to vectorize would be questionable if we take this route.