Last updated on September 22nd, 2024 at 06:19 am
Wikipedia has covered pretty much every topic your mind can conceive. Another thing Wikipedia is good at is internally linking from one page to another.
Wouldn’t it be great to have this corpus of knowledge as a reference point?
I was doing some research & found out that there happens to be a Wikipedia API Python Library with various kinds of functions that can help you visualize their data to be used as a reference point. 🤯
There is a lot that can achieved. But in this post, I will be covering how you can utilize Wikipedia API Python Library to visualize the Topical Map for a given Wikipedia page, and how it emerges from there and connects to various topics.
In this Python Script for visualization, we will be leveraging Plotly Sankey Chart Visualization.
Here is the script without further ado!
Step 1 – Install the necessary libraries
!pip install wikipedia-api plotly nltk
We are installing three libraries with this line of code, wikipedia-api, Plotly & NLTK
NLTK which is a NLP Python Library is helping us with the data cleaning part. It helps us signify the stopwords to improve the output quality.
Step 2 – Imports & Functions
import wikipediaapi
import plotly.graph_objects as go
from urllib.parse import unquote
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import nltk
# Download necessary NLTK data
nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('wordnet', quiet=True)
def preprocess_text(text):
tokens = word_tokenize(text.lower())
stop_words = set(stopwords.words('english'))
tokens = [token for token in tokens if token.isalpha() and token not in stop_words]
lemmatizer = WordNetLemmatizer()
tokens = [lemmatizer.lemmatize(token) for token in tokens]
return set(tokens)
def calculate_relevance(source_title, target_title):
source_tokens = preprocess_text(source_title)
target_tokens = preprocess_text(target_title)
common_tokens = source_tokens.intersection(target_tokens)
return len(common_tokens) / max(len(source_tokens), len(target_tokens))
def is_valid_page(title):
excluded_prefixes = [
"Wikipedia talk:",
"Talk:",
"User:",
"User talk:",
"Category:",
"Template:",
"Help:",
"File:"
]
return not any(title.startswith(prefix) for prefix in excluded_prefixes)
def get_page_links(page_url, depth=5, max_links=2, relevance_threshold=0.1):
title = unquote(page_url.split("/")[-1].replace("_", " "))
wiki = wikipediaapi.Wikipedia('MyProjectName (merlin@example.com)', 'en')
def fetch_links(page_title, current_depth):
page = wiki.page(page_title)
if not page.exists():
return {}
links = list(page.links.keys())
relevant_links = [
link for link in links
if calculate_relevance(page_title, link) >= relevance_threshold
and is_valid_page(link)
]
relevant_links = relevant_links[:max_links]
result = {page_title: relevant_links}
if current_depth < depth:
for link in relevant_links:
result.update(fetch_links(link, current_depth + 1))
return result
return fetch_links(title, 0)
def create_sankey_data(links_dict):
nodes = list(links_dict.keys())
for sublinks in links_dict.values():
nodes.extend(sublinks)
nodes = list(dict.fromkeys(nodes)) # Remove duplicates while preserving order
node_indices = {node: i for i, node in enumerate(nodes)}
source = []
target = []
value = []
for page, sublinks in links_dict.items():
for sublink in sublinks:
source.append(node_indices[page])
target.append(node_indices[sublink])
value.append(1)
return nodes, source, target, value
def create_sankey_chart(nodes, source, target, value):
fig = go.Figure(data=[go.Sankey(
node = dict(
pad = 15,
thickness = 20,
line = dict(color = "black", width = 0.5),
label = nodes,
color = "blue"
),
link = dict(
source = source,
target = target,
value = value
))])
fig.update_layout(title_text="Semantically Filtered Wikipedia Page Links Sankey Diagram (Excluding Talk Pages)", font_size=10)
return fig
def main(page_url):
links_dict = get_page_links(page_url)
nodes, source, target, value = create_sankey_data(links_dict)
fig = create_sankey_chart(nodes, source, target, value)
fig.show()
if __name__ == "__main__":
page_url = "https://en.wikipedia.org/wiki/Insurance_fraud"
main(page_url)
In the above code block, we have successfully imported functions, and leveraged NLTK for stop words & relevancy specifications.
However, NLTK wasn’t enough which is why we also specified that we don’t want to see visualizations for the following
def get_page_links(page_url, depth=5, max_links=2, relevance_threshold=0.1):
You can also drag & hover these nodes to see their relationship with other nodes, it also on hover displays incoming & outgoing link flow.
The entire script takes about a few minutes to be executed & very easily you can observe the topical map in a bird’s eye view.
SEO Use Case:
Your core topic page that you have been struggling to rank, you can find its Wikipedia page & see how the topical map is emerging from that page, this will help you see if there are any topical map gaps which you need to address.
Kunjal Chawhan founder of Decode Digital Market, a Digital Marketer by profession, and a Digital Marketing Niche Blogger by passion, here to share my knowledge