Types of AI Algorithms and How They Work

nlp examples python :: Article Creator

An Introduction To Natural Language Processing With Python For SEOs

Natural language processing (NLP) is becoming more important than ever for SEO professionals.

It is crucial to start building the skills that will prepare you for all the amazing changes happening around us.

Hopefully, this column will motivate you to get started!

We are going to learn practical NLP while building a simple knowledge graph from scratch.

As Google, Bing, and other search engines use Knowledge Graphs to encode knowledge and enrich search results, what better way to learn about them than to build one?

Specifically, we are going to extract useful facts automatically from Search Engine Journal XML sitemaps.

In order to do this and keep things simple and fast, we will pull article headlines from the URLs in the XML sitemaps.

We will extract named entities and their relationships from the headlines.

Finally, we will build a powerful knowledge graph and visualize the most popular relationships.

In the example below the relationship is "launches."

The way to read the graph is to follow the direction of the arrows: subject "launches" object.

For example:

"Bing launches 19 tracking map", which is likely "Bing launches covid-19 tracking map."

Another is "Snapchat launches ads certification program."

These facts and over a thousand more were extracted and grouped automatically!

Let's get in on the fun.

Here is the technical plan:

We will fetch all Search Engine Journal XML sitemaps.

We will parse the URLs to extract the headlines from the slugs.

We will extract entity pairs from the headlines.

We will extract the corresponding relationships.

We will build a knowledge graph and create a simple form in Colab to visualize the relationships we are interested in.

Fetching All Search Engine Journal XML Sitemaps

I recently had an enlightening conversation with Elias Dabbas from The Media Supermarket and learned about his wonderful Python library for marketers: advertools.

Some of my old Search Engine Journal articles are not working with the newer library versions. He gave me a good idea.

If I print the versions of third-party libraries now, it would be easy to get the code to work in the future.

I would just need to install the versions that worked when they fail. 🤓

%%capture !Pip install advertools import advertools as adv print(adv.__version__) #0.10.6

We are going to download all Search Engine Journal sitemaps to a pandas data frame with two lines of code.

sitemap_url = "https://www.Searchenginejournal.Com/sitemap_index.Xml" df= adv.Sitemap_to_df(sitemap_url)

One cool feature in the package is that it downloaded all the linked sitemaps in the index and we get a nice data frame.

Look how simple it is to filter articles/pages from this year. We have 1,550 articles.

df[df["lastmod"] > '2020-01-01']

Extract Headlines From the URLs

The advertools library has a function to break URLs within the data frame, but let's do it manually to get familiar with the process.

from urllib.Parse import urlparse import re example_url="https://www.Searchenginejournal.Com/google-be-careful-relying-on-3rd-parties-to-render-website-content/376547/" u = urlparse(example_url) print(u) #output -> ParseResult(scheme='https', netloc='www.Searchenginejournal.Com', path='/google-be-careful-relying-on-3rd-parties-to-render-website-content/376547/', params='', query='', fragment='')

Here we get a named tuple, ParseResult, with a breakdown of the URL components.

We are interested in the path.

We are going to use a simple regex to split it by / and – characters

slug = re.Split("[/-]", u.Path) print(slug) #output ['', 'google', 'be', 'careful', 'relying', 'on', '3rd', 'parties', 'to', 'render', 'website', 'content', '376547', '']

Next, we can convert it back to a string.

headline = " ".Join(slug) print(headline)

#output

' google be careful relying on 3rd parties to render website content 376547 '

The slugs contain a page identifier that is useless for us. We will remove with a regex.

headline = re.Sub("\d{6}", "",headline) print(headline) #output ' google be careful relying on 3rd parties to render website content ' #Strip whitespace at the borders headline = headline.Strip() print(headline) #output 'google be careful relying on 3rd parties to render website content'

Now that we tested this, we can convert this code to a function and create a new column in our data frame.

def get_headline(url): u = urlparse(url) if len(u.Path) > 1: slug = re.Split("[/-]", u.Path) new_headline = re.Sub("\d{6}", ""," ".Join(slug)).Strip() #skip author and category pages if not re.Match("authorcategory", new_headline): return new_headline return ""

Let's create a new column named headline.

new_df["headline"] = new_df["url"].Apply(lambda x: get_headline(x))

Extracting Named Entities

Let's explore and visualize the entities in our headlines corpus.

First, we combine them into a single text document.

import spacy from spacy import displacy text = "\n".Join([x for x in new_df["headline"].Tolist() if len(x) > 0]) nlp = spacy.Load("en_core_web_sm") doc = nlp(text) displacy.Render(doc, style="ent", jupyter=True)

We can see some entities correctly labeled and some incorrectly labeled like Hulu as a person.

There are also several missed like Facebook and Google Display Network.

spaCy's out of the box NER is not perfect and generally needs training with custom data to improve detection, but this is good enough to illustrate the concepts in this tutorial.

Building a Knowledge Graph

Now, we get to the exciting part.

Let's start by evaluating the grammatical relationships between the words in each sentence.

We do this by printing the syntactic dependency of the entities.

for tok in doc[:100]: print(tok.Text, "...", tok.Dep_)

We are looking for subjects and objects connected by a relationship.

We will use spaCy's rule-based parser to extract subjects and objects from the headlines.

The rule can be something like this:

Extract the subject/object along with its modifiers, compound words and also extract the punctuation marks between them.

Let's first import the libraries that we will need.

from spacy.Matcher import Matcher from spacy.Tokens import Span import networkx as nx import matplotlib.Pyplot as plt from tqdm import tqdm

To build a knowledge graph, the most important things are the nodes and the edges between them.

The main idea is to go through each sentence and build two lists. One with the entity pairs and another with the corresponding relationships.

We are going to borrow a couple of functions created by Data Scientist, Prateek Joshi.

The first one, get_entities extracts the main entities and associated attributes.

The second one, get_relations extracts the corresponding relationships between entities.

Let's test them on 100 sentences and see what the output looks like. I added len(x) > 0 to skip empty lines.

for t in [x for x in new_df["headline"].Tolist() if len(x) > 0][:100]: print(get_entities(t))

Many extractions are missing elements or are not great, but as we have so many headlines, we should be able to extract useful facts anyways.

Now, let's build the graph.

entity_pairs = [] for i in tqdm([x for x in new_df["headline"].Tolist() if len(x) > 0]): entity_pairs.Append(get_entities(i))

Here are some example pairs.

entity_pairs[10:20] #output [['chrome', ''], ['google assistant', '500 million 500 users'], ['', ''], ['seo metrics', 'how them'], ['google optimization', ''], ['twitter', 'new explore tab'], ['b2b', 'greg finn podcast'], ['instagram user growth', 'lower levels'], ['', ''], ['', 'advertiser']]

Next, let's build the corresponding relationships. Our hypothesis is that the predicate is actually the main verb in a sentence.

relations = [get_relation(i) for i in tqdm([x for x in new_df["headline"].Tolist() if len(x) > 0])] print(relations[10:20]) #output ['blocker', 'has', 'conversions', 'reports', 'ppc', 'rolls', 'paid', 'drops to lower', 'marketers', 'facebook']

Next, let's rank the relationships.

pd.Series(relations).Value_counts()[4:50]

Finally, let's build the knowledge graph.

# extract subject source = [i[0] for i in entity_pairs] # extract object target = [i[1] for i in entity_pairs] kg_df = pd.DataFrame({'source':source, 'target':target, 'edge':relations}) # create a directed-graph from a dataframe G=nx.From_pandas_edgelist(kg_df, "source", "target", edge_attr=True, create_using=nx.MultiDiGraph()) plt.Figure(figsize=(12,12)) pos = nx.Spring_layout(G) nx.Draw(G, with_labels=True, node_color='skyblue', edge_cmap=plt.Cm.Blues, pos = pos) plt.Show()

This plots a monster graph, which, while impressive, is not particularly useful.

Let's try again, but take only one relationship at a time.

In order to do this, we will create a function that we pass the relationship text as input (bolded text).

def display_graph(relation): G=nx.From_pandas_edgelist(kg_df[kg_df['edge']==relation], "source", "target", edge_attr=True, create_using=nx.MultiDiGraph()) plt.Figure(figsize=(12,12)) pos = nx.Spring_layout(G, k = 0.5) # k regulates the distance between nodes nx.Draw(G, with_labels=True, node_color='skyblue', node_size=1500, edge_cmap=plt.Cm.Blues, pos = pos) plt.Show()

Now, when I run display_graph("launches"), I get the graph at the beginning of the article.

Here are a few more relationships that I plotted.

I created a Colab notebook with all the steps in this article and at the end, you will find a nice form with many more relationships to check out.

Just run all the code, click on the pulldown selector and click on the play button to see the graph.

Resources to Learn More & Community Projects

Here are some resources that I found useful while putting this tutorial together.

I asked my follower to share the Python projects and excited to see how many creative ideas coming to life from the community! 🐍🔥

More Resources:

Image Credits

All screenshots taken by author, August 2020

3 Open Source NLP Tools For Data Extraction

Unstructured text and data are like gold for business applications and the company bottom line, but where to start? Here are three tools worth a look.

Developers and data scientists use generative AI and large language models (LLMs) to query volumes of documents and unstructured data. Open source LLMs, including Dolly 2.0, EleutherAI Pythia, Meta AI LLaMa, StabilityLM, and others, are all starting points for experimenting with artificial intelligence that accepts natural language prompts and generates summarized responses.

"Text as a source of knowledge and information is fundamental, yet there aren't any end-to-end solutions that tame the complexity in handling text," says Brian Platz, CEO and co-founder of Fluree. "While most organizations have wrangled structured or semi-structured data into a centralized data platform, unstructured data remains forgotten and underleveraged."

If your organization and team aren't experimenting with natural language processing (NLP) capabilities, you're probably lagging behind competitors in your industry. In the 2023 Expert NLP Survey Report, 77% of organizations said they planned to increase spending on NLP, and 54% said their time-to-production was a top return-on-investment (ROI) metric for successful NLP projects.

Use cases for NLP

If you have a corpus of unstructured data and text, some of the most common business needs include

Entity extraction by identifying names, dates, places, and products

Pattern recognition to discover currency and other quantities

Categorization into business terms, topics, and taxonomies

Sentiment analysis, including positivity, negation, and sarcasm

Summarizing the document's key points

Machine translation into other languages

Dependency graphs that translate text into machine-readable semi-structured representations

Sometimes, having NLP capabilities bundled into a platform or application is desirable. For example, LLMs support asking questions; AI search engines enable searches and recommendations; and chatbots support interactions. Other times, it's optimal to use NLP tools to extract information and enrich unstructured documents and text.

Let's look at three popular open source NLP tools that developers and data scientists are using to perform discovery on unstructured documents and develop production-ready NLP processing engines.

The Natural Language Toolkit (NLTK), released in 2001, is one of the older and more popular NLP Python libraries. NLTK boasts more than 11.8 thousand stars on GitHub and lists over 100 trained models.

"I think the most important tool for NLP is by far Natural Language Toolkit, which is licensed under Apache 2.0," says Steven Devoe, director of data and analytics at SPR. "In all data science projects, the processing and cleaning of the data to be used by algorithms is a huge proportion of the time and effort, which is particularly true with natural language processing. NLTK accelerates a lot of that work, such as stemming, lemmatization, tagging, removing stop words, and embedding word vectors across multiple written languages to make the text more easily interpreted by the algorithms."

NLTK's benefits stem from its endurance, with many examples for developers new to NLP, such as this beginner's hands-on guide and this more comprehensive overview. Anyone learning NLP techniques may want to try this library first, as it provides simple ways to experiment with basic techniques such as tokenization, stemming, and chunking.

spaCy

spaCy is a newer library, with its version 1.0 released in 2016. SpaCy supports over 72 languages and publishes its performance benchmarks, and it has amassed more than 25,000 stars on GitHub.

"spaCy is a free, open-source Python library providing advanced capabilities to conduct natural language processing on large volumes of text at high speed," says Nikolay Manchev, head of data science, EMEA, at Domino Data Lab. "With spaCy, a user can build models and production applications that underpin document analysis, chatbot capabilities, and all other forms of text analysis. Today, the spaCy framework is one of Python's most popular natural language libraries for industry use cases such as extracting keywords, entities, and knowledge from text."

Tutorials for spaCy show similar capabilities to NLTK, including named entity recognition and part-of-speech (POS) tagging. One advantage is that spaCy returns document objects and supports word vectors, which can give developers more flexibility for performing additional post-NLP data processing and text analytics.

Spark NLP

If you already use Apache Spark and have its infrastructure configured, then Spark NLP may be one of the faster paths to begin experimenting with natural language processing. Spark NLP has several installation options, including AWS, Azure Databricks, and Docker.

"Spark NLP is a widely used open-source natural language processing library that enables businesses to extract information and answers from free-text documents with state-of-the-art accuracy," says David Talby, CTO of John Snow Labs. "This enables everything from extracting relevant health information that only exists in clinical notes, to identifying hate speech or fake news on social media, to summarizing legal agreements and financial news.

Spark NLP's differentiators may be its healthcare, finance, and legal domain language models. These commercial products come with pre-trained models to identify drug names and dosages in healthcare, financial entity recognition such as stock tickers, and legal knowledge graphs of company names and officers.

Talby says Spark NLP can help organizations minimize the upfront training in developing models. "The free and open source library comes with more than 11,000 pre-trained models plus the ability to reuse, train, tune, and scale them easily," he says.

Best practices for experimenting with NLP

Earlier in my career, I had the opportunity to oversee the development of several SaaS products built using NLP capabilities. My first NLP was an SaaS platform to search newspaper classified advertisements, including searching cars, jobs, and real estate. I then led developing NLPs for extracting information from commercial construction documents, including building specifications and blueprints.

When starting NLP in a new area, I advise the following:

Begin with a small but representable example of the documents or text.

Identify the target end-user personas and how extracted information improves their workflows.

Specify the required information extractions and target accuracy metrics.

Test several approaches and use speed and accuracy metrics to benchmark.

Improve accuracy iteratively, especially when increasing the scale and breadth of documents.

Expect to deliver data stewardship tools for addressing data quality and handling exceptions.

You may find that the NLP tools used to discover and experiment with new document types will aid in defining requirements. Then, expand the review of NLP technologies to include open source and commercial options, as building and supporting production-ready NLP data pipelines can get expensive. With LLMs in the news and gaining interest, underinvesting in NLP capabilities is one way to fall behind competitors. Fortunately, you can start with one of the open source tools introduced here and build your NLP data pipeline to fit your budget and requirements.

What Is Natural Language Processing? AI For Speech And Text

Deep learning has improved machine translation and other natural language processing tasks by leaps and bounds

From a friend on Facebook:

Me: Alexa please remind me my morning yoga sculpt class is at 5:30am.

Alexa: I have added Tequila to your shopping list.

We talk to our devices, and sometimes they recognize what we are saying correctly. We use free services to translate foreign language phrases encountered online into English, and sometimes they give us an accurate translation. Although natural language processing has been improving by leaps and bounds, it still has considerable room for improvement.

My friend's accidental Tequila order may be more appropriate than she thought. ¡Arriba!

What is natural language processing?

Natural language processing, or NLP, is currently one of the major successful application areas for deep learning, despite stories about its failures. The overall goal of natural language processing is to allow computers to make sense of and act on human language. We'll break that down further in the next section.

Historically, natural language processing was handled by rule-based systems, initially by writing rules for, e.G., grammars and stemming. Aside from the sheer amount of work it took to write those rules by hand, they tended not to work very well.

Why not? Let's consider what should be a simple example, spelling. In some languages, such as Spanish, spelling really is easy and has regular rules. Anyone learning English as a second language, however, knows how irregular English spelling and pronunciation can be. Imagine having to program rules that are riddled with exceptions, such as the grade-school spelling rule "I before E except after C, or when sounding like A as in neighbor or weigh." As it turns out, the "I before E" rule is hardly a rule. Accurate perhaps 3/4 of the time, it has numerous classes of exceptions.

After pretty much giving up on hand-written rules in the late 1980s and early 1990s, the NLP community started using statistical inference and machine learning models. Many models and techniques were tried; few survived when they were generalized beyond their initial usage. A few of the more successful methods were used in multiple fields. For example, Hidden Markov Models were used for speech recognition in the 1970s and were adopted for use in bioinformatics—specifically, analysis of protein and DNA sequences—in the 1980s and 1990s.

Phrase-based statistical machine translation models still needed to be tweaked for each language pair, and the accuracy and precision depended mostly on the quality and size of the textual corpora available for supervised learning training. For French and English, the Canadian Hansard (proceedings of Parliament, by law bilingual since 1867) was and is invaluable for supervised learning. The proceedings of the European Union offer more languages, but for fewer years.

In the fall of 2016, Google Translate suddenly went from producing, on the average, "word salad" with a vague connection to the meaning in the original language, to emitting polished, coherent sentences more often than not, at least for supported language pairs such as English-French, English-Chinese, and English-Japanese. Many more language pairs have been added since then.

That dramatic improvement was the result of a nine-month concerted effort by the Google Brain and Google Translate teams to revamp Google Translate from using its old phrase-based statistical machine translation algorithms to using a neural network trained with deep learning and word embeddings using Google's TensorFlow framework. Within a year neural machine translation (NMT) had replaced statistical machine translation (SMT) as the state of the art.

Was that magic? No, not at all. It wasn't even easy. The researchers working on the conversion had access to a huge corpus of translations from which to train their networks, but they soon discovered that they needed thousands of GPUs for training, and that they would need to create a new kind of chip, a Tensor Processing Unit (TPU), to run Google Translate on their trained neural networks at scale. They also had to refine their networks hundreds of times as they tried to train a model that would be nearly as good as human translators.

Natural language processing tasks

In addition to the machine translation problem addressed by Google Translate, major NLP tasks include automatic summarization, co-reference resolution (determine which words refer to the same objects, especially for pronouns), named entity recognition (identify people, places, and organizations), natural language generation (convert information into readable language), natural language understanding (convert chunks of text into more formal representations such as first-order logic structures), part-of-speech tagging, sentiment analysis (classify text as favorable or unfavorable toward specific objects), and speech recognition (convert audio to text).

Major NLP tasks are often broken down into subtasks, although the latest-generation neural-network-based NLP systems can sometimes dispense with intermediate steps. For example, an experimental Google speech-to-speech translator called Translatotron can translate Spanish speech to English speech directly by operating on spectrograms without the intermediate steps of speech to text, language translation, and text to speech. Translatotron isn't all that accurate yet, but it's good enough to be a proof of concept.

Natural language processing methods

Like any other machine learning problem, NLP problems are usually addressed with a pipeline of procedures, most of which are intended to prepare the data for modeling. In his excellent tutorial on NLP using Python, DJ Sarkar lays out the standard workflow: Text pre-processing -> Text parsing and exploratory data analysis -> Text representation and feature engineering -> Modeling and/or pattern mining -> Evaluation and deployment.

Sarkar uses Beautiful Soup to extract text from scraped websites, and then the Natural Language Toolkit (NLTK) and spaCy to preprocess the text by tokenizing, stemming, and lemmatizing it, as well as removing stopwords and expanding contractions. Then he continues to use NLTK and spaCy to tag parts of speech, perform shallow parsing, and extract Ngram chunks for tagging: unigrams, bigrams, and trigrams. He uses NLTK and the Stanford Parser to generate parse trees, and spaCy to generate dependency trees and perform named entity recognition.

Sarkar goes on to perform sentiment analysis using several unsupervised methods, since his example data set hasn't been tagged for supervised machine learning or deep learning training. In a later article, Sarkar discusses using TensorFlow to access Google's Universal Sentence Embedding model and perform transfer learning to analyze a movie review data set for sentiment analysis.

As you'll see if you read these articles and work through the Jupyter notebooks that accompany them, there isn't one universal best model or algorithm for text analysis. Sarkar constantly tries multiple models and algorithms to see which work best on his data.

For a review of recent deep-learning-based models and methods for NLP, I can recommend this article by an AI educator who calls himself Elvis.

Natural language processing services

You would expect Amazon Web Services, Microsoft Azure, and Google Cloud to offer natural language processing services of one kind or another, in addition to their well-known speech recognition and language translation services. And of course they do—not only generic NLP models, but also customized NLP.

Amazon Comprehend is a natural language processing service that extracts key phrases, places, peoples' names, brands, events, and sentiment from unstructured text. Amazon Comprehend uses pre-trained deep learning models and identifies rather generic places and things. If you want to extend this capability to identify more specific language, you can customize Amazon Comprehend to identify domain-specific entities and to categorize documents into your own categories

Microsoft Azure has multiple NLP services. Text Analytics identifies the language, sentiment, key phrases, and entities of a block of text. The capabilities supported depend on the language.

Language Understanding (LUIS) is a customizable natural-language interface for social media apps, chat bots, and speech-enabled desktop applications. You can use a pre-built LUIS model, a pre-built domain-specific model, or a customized model with machine-trained or literal entities. You can build a custom LUIS model with the authoring APIs or with the LUIS portal.

For the more technically minded, Microsoft has released a paper and code showing you how to fine-tune a BERT NLP model for custom applications using the Azure Machine Learning Service.

Google Cloud offers both a pre-trained natural language API and customizable AutoML Natural Language. The Natural Language API discovers syntax, entities, and sentiment in text, and classifies text into a predefined set of categories. AutoML Natural Language allows you to train a custom classifier for your own set of categories using deep transfer learning.

Search This Blog

Follow It

Autonomous AI

How to Make Money Online