Natural language processing of electronic health records for early detection of cognitive decline: a systematic review
An Introduction To Natural Language Processing With Python For SEOs
Natural language processing (NLP) is becoming more important than ever for SEO professionals.
It is crucial to start building the skills that will prepare you for all the amazing changes happening around us.
Hopefully, this column will motivate you to get started!
We are going to learn practical NLP while building a simple knowledge graph from scratch.
As Google, Bing, and other search engines use Knowledge Graphs to encode knowledge and enrich search results, what better way to learn about them than to build one?
Specifically, we are going to extract useful facts automatically from Search Engine Journal XML sitemaps.
In order to do this and keep things simple and fast, we will pull article headlines from the URLs in the XML sitemaps.
We will extract named entities and their relationships from the headlines.
Finally, we will build a powerful knowledge graph and visualize the most popular relationships.
In the example below the relationship is "launches."
The way to read the graph is to follow the direction of the arrows: subject "launches" object.
For example:
These facts and over a thousand more were extracted and grouped automatically!
Let's get in on the fun.
Here is the technical plan:
I recently had an enlightening conversation with Elias Dabbas from The Media Supermarket and learned about his wonderful Python library for marketers: advertools.
Some of my old Search Engine Journal articles are not working with the newer library versions. He gave me a good idea.
If I print the versions of third-party libraries now, it would be easy to get the code to work in the future.
I would just need to install the versions that worked when they fail. 🤓
%%capture !Pip install advertools import advertools as adv print(adv.__version__) #0.10.6We are going to download all Search Engine Journal sitemaps to a pandas data frame with two lines of code.
sitemap_url = "https://www.Searchenginejournal.Com/sitemap_index.Xml" df= adv.Sitemap_to_df(sitemap_url)One cool feature in the package is that it downloaded all the linked sitemaps in the index and we get a nice data frame.
Look how simple it is to filter articles/pages from this year. We have 1,550 articles.
df[df["lastmod"] > '2020-01-01']The advertools library has a function to break URLs within the data frame, but let's do it manually to get familiar with the process.
from urllib.Parse import urlparse import re example_url="https://www.Searchenginejournal.Com/google-be-careful-relying-on-3rd-parties-to-render-website-content/376547/" u = urlparse(example_url) print(u) #output -> ParseResult(scheme='https', netloc='www.Searchenginejournal.Com', path='/google-be-careful-relying-on-3rd-parties-to-render-website-content/376547/', params='', query='', fragment='')Here we get a named tuple, ParseResult, with a breakdown of the URL components.
We are interested in the path.
We are going to use a simple regex to split it by / and – characters
slug = re.Split("[/-]", u.Path) print(slug) #output ['', 'google', 'be', 'careful', 'relying', 'on', '3rd', 'parties', 'to', 'render', 'website', 'content', '376547', '']Next, we can convert it back to a string.
headline = " ".Join(slug) print(headline)#output
' google be careful relying on 3rd parties to render website content 376547 'The slugs contain a page identifier that is useless for us. We will remove with a regex.
headline = re.Sub("\d{6}", "",headline) print(headline) #output ' google be careful relying on 3rd parties to render website content ' #Strip whitespace at the borders headline = headline.Strip() print(headline) #output 'google be careful relying on 3rd parties to render website content'Now that we tested this, we can convert this code to a function and create a new column in our data frame.
def get_headline(url): u = urlparse(url) if len(u.Path) > 1: slug = re.Split("[/-]", u.Path) new_headline = re.Sub("\d{6}", ""," ".Join(slug)).Strip() #skip author and category pages if not re.Match("authorcategory", new_headline): return new_headline return ""Let's create a new column named headline.
new_df["headline"] = new_df["url"].Apply(lambda x: get_headline(x))Let's explore and visualize the entities in our headlines corpus.
First, we combine them into a single text document.
import spacy from spacy import displacy text = "\n".Join([x for x in new_df["headline"].Tolist() if len(x) > 0]) nlp = spacy.Load("en_core_web_sm") doc = nlp(text) displacy.Render(doc, style="ent", jupyter=True)We can see some entities correctly labeled and some incorrectly labeled like Hulu as a person.
There are also several missed like Facebook and Google Display Network.
spaCy's out of the box NER is not perfect and generally needs training with custom data to improve detection, but this is good enough to illustrate the concepts in this tutorial.
Building a Knowledge GraphNow, we get to the exciting part.
Let's start by evaluating the grammatical relationships between the words in each sentence.
We do this by printing the syntactic dependency of the entities.
for tok in doc[:100]: print(tok.Text, "...", tok.Dep_)We are looking for subjects and objects connected by a relationship.
We will use spaCy's rule-based parser to extract subjects and objects from the headlines.
The rule can be something like this:
Extract the subject/object along with its modifiers, compound words and also extract the punctuation marks between them.
Let's first import the libraries that we will need.
from spacy.Matcher import Matcher from spacy.Tokens import Span import networkx as nx import matplotlib.Pyplot as plt from tqdm import tqdmTo build a knowledge graph, the most important things are the nodes and the edges between them.
The main idea is to go through each sentence and build two lists. One with the entity pairs and another with the corresponding relationships.
We are going to borrow a couple of functions created by Data Scientist, Prateek Joshi.
Let's test them on 100 sentences and see what the output looks like. I added len(x) > 0 to skip empty lines.
for t in [x for x in new_df["headline"].Tolist() if len(x) > 0][:100]: print(get_entities(t))Many extractions are missing elements or are not great, but as we have so many headlines, we should be able to extract useful facts anyways.
Now, let's build the graph.
entity_pairs = [] for i in tqdm([x for x in new_df["headline"].Tolist() if len(x) > 0]): entity_pairs.Append(get_entities(i))Here are some example pairs.
entity_pairs[10:20] #output [['chrome', ''], ['google assistant', '500 million 500 users'], ['', ''], ['seo metrics', 'how them'], ['google optimization', ''], ['twitter', 'new explore tab'], ['b2b', 'greg finn podcast'], ['instagram user growth', 'lower levels'], ['', ''], ['', 'advertiser']]Next, let's build the corresponding relationships. Our hypothesis is that the predicate is actually the main verb in a sentence.
relations = [get_relation(i) for i in tqdm([x for x in new_df["headline"].Tolist() if len(x) > 0])] print(relations[10:20]) #output ['blocker', 'has', 'conversions', 'reports', 'ppc', 'rolls', 'paid', 'drops to lower', 'marketers', 'facebook']Next, let's rank the relationships.
pd.Series(relations).Value_counts()[4:50]Finally, let's build the knowledge graph.
# extract subject source = [i[0] for i in entity_pairs] # extract object target = [i[1] for i in entity_pairs] kg_df = pd.DataFrame({'source':source, 'target':target, 'edge':relations}) # create a directed-graph from a dataframe G=nx.From_pandas_edgelist(kg_df, "source", "target", edge_attr=True, create_using=nx.MultiDiGraph()) plt.Figure(figsize=(12,12)) pos = nx.Spring_layout(G) nx.Draw(G, with_labels=True, node_color='skyblue', edge_cmap=plt.Cm.Blues, pos = pos) plt.Show()This plots a monster graph, which, while impressive, is not particularly useful.
Let's try again, but take only one relationship at a time.
In order to do this, we will create a function that we pass the relationship text as input (bolded text).
def display_graph(relation): G=nx.From_pandas_edgelist(kg_df[kg_df['edge']==relation], "source", "target", edge_attr=True, create_using=nx.MultiDiGraph()) plt.Figure(figsize=(12,12)) pos = nx.Spring_layout(G, k = 0.5) # k regulates the distance between nodes nx.Draw(G, with_labels=True, node_color='skyblue', node_size=1500, edge_cmap=plt.Cm.Blues, pos = pos) plt.Show()Now, when I run display_graph("launches"), I get the graph at the beginning of the article.
Here are a few more relationships that I plotted.
I created a Colab notebook with all the steps in this article and at the end, you will find a nice form with many more relationships to check out.
Just run all the code, click on the pulldown selector and click on the play button to see the graph.
Here are some resources that I found useful while putting this tutorial together.
I asked my follower to share the Python projects and excited to see how many creative ideas coming to life from the community! 🐍🔥
More Resources:
Image Credits
All screenshots taken by author, August 2020
3 Open Source NLP Tools For Data Extraction
Unstructured text and data are like gold for business applications and the company bottom line, but where to start? Here are three tools worth a look.
Developers and data scientists use generative AI and large language models (LLMs) to query volumes of documents and unstructured data. Open source LLMs, including Dolly 2.0, EleutherAI Pythia, Meta AI LLaMa, StabilityLM, and others, are all starting points for experimenting with artificial intelligence that accepts natural language prompts and generates summarized responses.
"Text as a source of knowledge and information is fundamental, yet there aren't any end-to-end solutions that tame the complexity in handling text," says Brian Platz, CEO and co-founder of Fluree. "While most organizations have wrangled structured or semi-structured data into a centralized data platform, unstructured data remains forgotten and underleveraged."
If your organization and team aren't experimenting with natural language processing (NLP) capabilities, you're probably lagging behind competitors in your industry. In the 2023 Expert NLP Survey Report, 77% of organizations said they planned to increase spending on NLP, and 54% said their time-to-production was a top return-on-investment (ROI) metric for successful NLP projects.
Use cases for NLPIf you have a corpus of unstructured data and text, some of the most common business needs include
Sometimes, having NLP capabilities bundled into a platform or application is desirable. For example, LLMs support asking questions; AI search engines enable searches and recommendations; and chatbots support interactions. Other times, it's optimal to use NLP tools to extract information and enrich unstructured documents and text.
Let's look at three popular open source NLP tools that developers and data scientists are using to perform discovery on unstructured documents and develop production-ready NLP processing engines.
The Natural Language Toolkit (NLTK), released in 2001, is one of the older and more popular NLP Python libraries. NLTK boasts more than 11.8 thousand stars on GitHub and lists over 100 trained models.
"I think the most important tool for NLP is by far Natural Language Toolkit, which is licensed under Apache 2.0," says Steven Devoe, director of data and analytics at SPR. "In all data science projects, the processing and cleaning of the data to be used by algorithms is a huge proportion of the time and effort, which is particularly true with natural language processing. NLTK accelerates a lot of that work, such as stemming, lemmatization, tagging, removing stop words, and embedding word vectors across multiple written languages to make the text more easily interpreted by the algorithms."
NLTK's benefits stem from its endurance, with many examples for developers new to NLP, such as this beginner's hands-on guide and this more comprehensive overview. Anyone learning NLP techniques may want to try this library first, as it provides simple ways to experiment with basic techniques such as tokenization, stemming, and chunking.
spaCyspaCy is a newer library, with its version 1.0 released in 2016. SpaCy supports over 72 languages and publishes its performance benchmarks, and it has amassed more than 25,000 stars on GitHub.
"spaCy is a free, open-source Python library providing advanced capabilities to conduct natural language processing on large volumes of text at high speed," says Nikolay Manchev, head of data science, EMEA, at Domino Data Lab. "With spaCy, a user can build models and production applications that underpin document analysis, chatbot capabilities, and all other forms of text analysis. Today, the spaCy framework is one of Python's most popular natural language libraries for industry use cases such as extracting keywords, entities, and knowledge from text."
Tutorials for spaCy show similar capabilities to NLTK, including named entity recognition and part-of-speech (POS) tagging. One advantage is that spaCy returns document objects and supports word vectors, which can give developers more flexibility for performing additional post-NLP data processing and text analytics.
Spark NLPIf you already use Apache Spark and have its infrastructure configured, then Spark NLP may be one of the faster paths to begin experimenting with natural language processing. Spark NLP has several installation options, including AWS, Azure Databricks, and Docker.
"Spark NLP is a widely used open-source natural language processing library that enables businesses to extract information and answers from free-text documents with state-of-the-art accuracy," says David Talby, CTO of John Snow Labs. "This enables everything from extracting relevant health information that only exists in clinical notes, to identifying hate speech or fake news on social media, to summarizing legal agreements and financial news.
Spark NLP's differentiators may be its healthcare, finance, and legal domain language models. These commercial products come with pre-trained models to identify drug names and dosages in healthcare, financial entity recognition such as stock tickers, and legal knowledge graphs of company names and officers.
Talby says Spark NLP can help organizations minimize the upfront training in developing models. "The free and open source library comes with more than 11,000 pre-trained models plus the ability to reuse, train, tune, and scale them easily," he says.
Best practices for experimenting with NLPEarlier in my career, I had the opportunity to oversee the development of several SaaS products built using NLP capabilities. My first NLP was an SaaS platform to search newspaper classified advertisements, including searching cars, jobs, and real estate. I then led developing NLPs for extracting information from commercial construction documents, including building specifications and blueprints.
When starting NLP in a new area, I advise the following:
You may find that the NLP tools used to discover and experiment with new document types will aid in defining requirements. Then, expand the review of NLP technologies to include open source and commercial options, as building and supporting production-ready NLP data pipelines can get expensive. With LLMs in the news and gaining interest, underinvesting in NLP capabilities is one way to fall behind competitors. Fortunately, you can start with one of the open source tools introduced here and build your NLP data pipeline to fit your budget and requirements.
Top 10 Best Python Libraries For Natural Language Processing In 2025
We independently select all products and services. This article was written by a third-party company. If you click through links we provide, The Georgia Straight may earn a commission. Learn more
Python is a widely used programming language, often favored in the field of data science, and its uses go beyond to include natural language processing (NLP). NLP is concerned with analyzing and understanding human language, and this task is made much easier with the support of Python libraries. This piece will explore some of the Python libraries that are particularly beneficial for natural language processing.
One of the most popular libraries for NLP is Natural Language Toolkit (NLTK). It is widely considered the best Python library for NLP and is an essential tool for tasks like classification, tagging, stemming, parsing, and semantic reasoning. NLTK is often chosen by beginners looking to get involved in the fields of NLP and machine learning. Another popular library is spaCy, which is recognized as a professional-grade Python library for advanced NLP. It excels at working with incredibly large-scale information extraction tasks.
Understanding Natural Language ProcessingNatural Language Processing (NLP) is a branch of Artificial Intelligence (AI) that focuses on enabling computers to understand, interpret, and generate human language. It involves the use of algorithms and statistical models to analyze and extract meaning from natural language data, including text and speech.
NLP is a rapidly growing field with numerous applications in various industries, including healthcare, finance, customer service, and marketing. Some of the common tasks in NLP include sentiment analysis, language translation, speech recognition, and text summarization.
To perform these tasks, NLP relies on a combination of rule-based and statistical approaches. Rule-based methods involve the use of predefined rules and patterns to process and analyze language data. Statistical methods, on the other hand, use machine learning algorithms to learn patterns and relationships from large datasets.
Python is a popular language for NLP due to its simplicity, flexibility, and the availability of numerous libraries and frameworks. Some of the popular Python libraries for NLP include Natural Language Toolkit (NLTK), spaCy, TextBlob, Gensim, and CoreNLP.
Overall, understanding NLP is essential for anyone interested in working with natural language data. With the right tools and techniques, it is possible to extract valuable insights and knowledge from language data that can be used to improve decision-making and drive business growth.
Python and Natural Language ProcessingPython is a popular programming language that has become a go-to tool for natural language processing (NLP). NLP is a field of study that focuses on the interactions between computers and humans in natural language. It involves analyzing, understanding, and generating human language with the help of algorithms and computational methods.
Python has a wide range of libraries that can be used for NLP tasks. These libraries provide a wide range of capabilities, including text processing, sentiment analysis, machine translation, and more. Some of the most popular Python libraries for NLP are:
Python's ease of use and the availability of powerful libraries make it an ideal choice for NLP tasks. With the right tools and techniques, developers can build powerful applications that can analyze and understand human language.
Best Python Libraries for Natural Language ProcessingPython is one of the most popular programming languages for Natural Language Processing (NLP) tasks. With its vast collection of libraries, Python offers a wide range of tools for NLP. In this section, we will discuss the top 10 Python libraries for NLP.
1. Natural Language Toolkit (NLTK)NLTK is widely considered the best Python library for NLP. It is an essential library that supports tasks like classification, tagging, stemming, parsing, and semantic reasoning. NLTK is suitable for all kinds of programmers, including students, educators, engineers, researchers, and industry professionals.
2. SpaCyspaCy is a free and open-source library that offers a lot of built-in capabilities for NLP. It is becoming increasingly popular for processing and analyzing data in the field of NLP. SpaCy is suitable for both beginners and advanced users.
3. GensimGensim is a Python library that specializes in topic modeling and similarity detection. It is easy to use and offers a wide range of functionalities for NLP tasks.
4. CoreNLPCoreNLP is a library developed by Stanford University that offers a suite of natural language processing tools. It is written in Java but can be used in Python through the Py4J library.
5. PatternPattern is a Python library that offers a wide range of functionalities for NLP tasks, including sentiment analysis, part-of-speech tagging, and word inflection. It is suitable for both beginners and advanced users.
6. TextBlobTextBlob is a Python library that offers a simple API for common NLP tasks, including sentiment analysis, part-of-speech tagging, and noun phrase extraction. It is suitable for beginners who want to get started with NLP.
7. PyNLPIPyNLPI is a Python library that offers a wide range of functionalities for NLP tasks, including named entity recognition, sentiment analysis, and text classification. It is suitable for both beginners and advanced users.
8. Scikit-learnscikit-learn is a Python library that offers a wide range of functionalities for machine learning tasks, including NLP tasks. It is suitable for advanced users who want to build custom models for NLP tasks.
9. PolyglotPolyglot is a Python library that offers support for over 130 languages. It offers a wide range of functionalities for NLP tasks, including named entity recognition, sentiment analysis, and part-of-speech tagging.
10. PyTorchPyTorch is a Python library that offers a wide range of functionalities for deep learning tasks, including NLP tasks. It is suitable for advanced users who want to build custom deep learning models for NLP tasks.
In conclusion, Python offers a wide range of libraries for NLP tasks. The libraries discussed in this section are some of the best Python libraries for NLP, and they offer a wide range of functionalities for NLP tasks.
Comparing Python NLP LibrariesWhen it comes to Natural Language Processing (NLP) in Python, there are several libraries available to choose from. In this section, we will compare some of the most popular NLP libraries in terms of ease of use, functionality, community support, and performance.
Ease of UseOne of the most important factors to consider when choosing an NLP library is its ease of use. Libraries that are easy to use can help developers save time and effort.
NLTK is a popular library for beginners, as it provides a lot of documentation and tutorials. SpaCy is also a user-friendly library that offers pre-trained models and easy-to-use APIs. TextBlob is another library that is known for its simplicity and ease of use.
FunctionalityThe functionality of an NLP library is another key factor to consider. Libraries that offer a wide range of functionalities can help developers solve complex NLP problems.
spaCy is known for its high-performance and advanced features, such as named entity recognition and dependency parsing. NLTK also offers a wide range of functionalities, including sentiment analysis, part-of-speech tagging, and text classification. Gensim is a library that is specifically designed for topic modeling and document similarity analysis.
Community SupportCommunity support is crucial when it comes to NLP libraries. Developers need to know that they can rely on the community for help and support.
NLTK has a large and active community, which provides support through forums, mailing lists, and social media. SpaCy also has a growing community, with active contributors and support forums. TextBlob is a smaller library, but it has an active community that provides support through GitHub and Stack Overflow.
PerformanceThe performance of an NLP library can have a significant impact on the speed and accuracy of NLP applications.
spaCy is known for its high-performance and speed, making it a popular choice for large-scale NLP applications. NLTK is also a high-performance library, but it can be slower than spaCy for some tasks. Gensim is designed for scalability and high-performance, making it a popular choice for large-scale topic modeling.
In summary, when choosing an NLP library, developers should consider factors such as ease of use, functionality, community support, and performance. Each library has its own strengths and weaknesses, and the choice ultimately depends on the specific needs of the project.
Choosing the Right Python Library for NLPWhen it comes to Natural Language Processing, choosing the right Python library can be a daunting task. With so many options available, it's essential to consider your specific needs and requirements before selecting a library.
One of the most popular libraries for NLP is the Natural Language Toolkit (NLTK). It is widely considered to be the best Python library for NLP and is an essential tool for beginners looking to get involved in the field of NLP and machine learning. NLTK supports a variety of tasks, including classification, tagging, stemming, parsing, and semantic reasoning.
Another popular library is spaCy, which is known for its speed and efficiency. It is an excellent choice for large-scale NLP projects and is particularly useful for tasks such as named entity recognition and dependency parsing.
Gensim is another library worth considering, especially if your project involves topic modeling or word embeddings. It is a robust and efficient library that supports a wide range of NLP tasks, including document similarity and text summarization.
In addition to these libraries, there are several other options available, including TextBlob and CoreNLP. TextBlob is a simple and easy-to-use library that is ideal for beginners, while CoreNLP is a more advanced library that supports a wide range of NLP tasks, including sentiment analysis and part-of-speech tagging.
Ultimately, the right Python library for your NLP project will depend on your specific needs and requirements. It's essential to consider factors such as the size and complexity of your project, your level of experience with NLP, and the specific tasks you need to perform. By carefully evaluating your options and selecting the right library, you can ensure that your NLP project is a success.
ConclusionNatural Language Processing is a vast field that requires the use of specialized tools to process and analyze text data. Python has emerged as the go-to language for NLP due to its simplicity, versatility, and the availability of several powerful libraries.
In this article, we have explored some of the best Python libraries for Natural Language Processing. These libraries provide a wide range of functionalities, including tokenization, stemming, part-of-speech tagging, parsing, and semantic reasoning.
NLTK is widely considered the best Python library for NLP and is often chosen by beginners looking to get involved in the field. SpaCy is another popular library that excels at working with large-scale information extraction tasks. Other libraries like TextBlob, Gensim, and Pattern offer unique functionalities and can be used for specific NLP tasks.
It is important to note that the selection of a library depends on the specific requirements of the project. Therefore, it is recommended to explore the features of each library and choose the one that best suits the project's needs.
Overall, Python has a vibrant NLP community, and these libraries are a testament to the language's power and flexibility. With the help of these libraries, developers can build sophisticated NLP applications that can understand human language and provide valuable insights.
Frequently Asked Questions What are some popular open-source NLP libraries in Python?Python has a wide range of open-source NLP libraries, including Natural Language Toolkit (NLTK), spaCy, TextBlob, Gensim, Pattern, and Stanford NLP. These libraries provide a range of functionalities, from tokenization and parsing to sentiment analysis and topic modeling.
Which Python library is widely considered the most comprehensive for NLP?NLTK is widely considered the most comprehensive Python library for NLP. It is an essential library that supports tasks like classification, tagging, stemming, parsing, and semantic reasoning. It also provides a range of datasets and resources that can be used for training and testing NLP models.
Are there any free Python libraries for NLP?Yes, there are several free and open-source Python libraries for NLP, including NLTK, spaCy, TextBlob, and Gensim. These libraries can be easily installed using pip and provide a range of functionalities for NLP tasks.
What are some advantages of using NLTK for NLP?NLTK has several advantages for NLP, including its comprehensive set of tools and resources, its user-friendly interface, and its active community of developers and users. It also provides a range of datasets and resources that can be used for training and testing NLP models.
Can Python be used for advanced NLP tasks?Yes, Python can be used for advanced NLP tasks, including sentiment analysis, named entity recognition, and topic modeling. Python libraries like NLTK, spaCy, and Gensim provide a range of functionalities for these tasks and can be easily integrated into NLP pipelines.
What are some examples of NLP applications that can be developed using Python libraries?Python libraries can be used to develop a range of NLP applications, including chatbots, sentiment analysis tools, text summarization tools, and recommendation systems. These applications can be used in a range of industries, from e-commerce to healthcare to finance.

Comments
Post a Comment