What Is NLP (Natural Language Processing)?
What Is Natural Language Processing? AI For Speech And Text
Deep learning has improved machine translation and other natural language processing tasks by leaps and bounds
From a friend on Facebook:
Me: Alexa please remind me my morning yoga sculpt class is at 5:30am.
Alexa: I have added Tequila to your shopping list.
We talk to our devices, and sometimes they recognize what we are saying correctly. We use free services to translate foreign language phrases encountered online into English, and sometimes they give us an accurate translation. Although natural language processing has been improving by leaps and bounds, it still has considerable room for improvement.
My friend's accidental Tequila order may be more appropriate than she thought. ¡Arriba!
What is natural language processing?Natural language processing, or NLP, is currently one of the major successful application areas for deep learning, despite stories about its failures. The overall goal of natural language processing is to allow computers to make sense of and act on human language. We'll break that down further in the next section.
Historically, natural language processing was handled by rule-based systems, initially by writing rules for, e.G., grammars and stemming. Aside from the sheer amount of work it took to write those rules by hand, they tended not to work very well.
Why not? Let's consider what should be a simple example, spelling. In some languages, such as Spanish, spelling really is easy and has regular rules. Anyone learning English as a second language, however, knows how irregular English spelling and pronunciation can be. Imagine having to program rules that are riddled with exceptions, such as the grade-school spelling rule "I before E except after C, or when sounding like A as in neighbor or weigh." As it turns out, the "I before E" rule is hardly a rule. Accurate perhaps 3/4 of the time, it has numerous classes of exceptions.
After pretty much giving up on hand-written rules in the late 1980s and early 1990s, the NLP community started using statistical inference and machine learning models. Many models and techniques were tried; few survived when they were generalized beyond their initial usage. A few of the more successful methods were used in multiple fields. For example, Hidden Markov Models were used for speech recognition in the 1970s and were adopted for use in bioinformatics—specifically, analysis of protein and DNA sequences—in the 1980s and 1990s.
Phrase-based statistical machine translation models still needed to be tweaked for each language pair, and the accuracy and precision depended mostly on the quality and size of the textual corpora available for supervised learning training. For French and English, the Canadian Hansard (proceedings of Parliament, by law bilingual since 1867) was and is invaluable for supervised learning. The proceedings of the European Union offer more languages, but for fewer years.
In the fall of 2016, Google Translate suddenly went from producing, on the average, "word salad" with a vague connection to the meaning in the original language, to emitting polished, coherent sentences more often than not, at least for supported language pairs such as English-French, English-Chinese, and English-Japanese. Many more language pairs have been added since then.
That dramatic improvement was the result of a nine-month concerted effort by the Google Brain and Google Translate teams to revamp Google Translate from using its old phrase-based statistical machine translation algorithms to using a neural network trained with deep learning and word embeddings using Google's TensorFlow framework. Within a year neural machine translation (NMT) had replaced statistical machine translation (SMT) as the state of the art.
Was that magic? No, not at all. It wasn't even easy. The researchers working on the conversion had access to a huge corpus of translations from which to train their networks, but they soon discovered that they needed thousands of GPUs for training, and that they would need to create a new kind of chip, a Tensor Processing Unit (TPU), to run Google Translate on their trained neural networks at scale. They also had to refine their networks hundreds of times as they tried to train a model that would be nearly as good as human translators.
Natural language processing tasksIn addition to the machine translation problem addressed by Google Translate, major NLP tasks include automatic summarization, co-reference resolution (determine which words refer to the same objects, especially for pronouns), named entity recognition (identify people, places, and organizations), natural language generation (convert information into readable language), natural language understanding (convert chunks of text into more formal representations such as first-order logic structures), part-of-speech tagging, sentiment analysis (classify text as favorable or unfavorable toward specific objects), and speech recognition (convert audio to text).
Major NLP tasks are often broken down into subtasks, although the latest-generation neural-network-based NLP systems can sometimes dispense with intermediate steps. For example, an experimental Google speech-to-speech translator called Translatotron can translate Spanish speech to English speech directly by operating on spectrograms without the intermediate steps of speech to text, language translation, and text to speech. Translatotron isn't all that accurate yet, but it's good enough to be a proof of concept.
Natural language processing methodsLike any other machine learning problem, NLP problems are usually addressed with a pipeline of procedures, most of which are intended to prepare the data for modeling. In his excellent tutorial on NLP using Python, DJ Sarkar lays out the standard workflow: Text pre-processing -> Text parsing and exploratory data analysis -> Text representation and feature engineering -> Modeling and/or pattern mining -> Evaluation and deployment.
Sarkar uses Beautiful Soup to extract text from scraped websites, and then the Natural Language Toolkit (NLTK) and spaCy to preprocess the text by tokenizing, stemming, and lemmatizing it, as well as removing stopwords and expanding contractions. Then he continues to use NLTK and spaCy to tag parts of speech, perform shallow parsing, and extract Ngram chunks for tagging: unigrams, bigrams, and trigrams. He uses NLTK and the Stanford Parser to generate parse trees, and spaCy to generate dependency trees and perform named entity recognition.
Sarkar goes on to perform sentiment analysis using several unsupervised methods, since his example data set hasn't been tagged for supervised machine learning or deep learning training. In a later article, Sarkar discusses using TensorFlow to access Google's Universal Sentence Embedding model and perform transfer learning to analyze a movie review data set for sentiment analysis.
As you'll see if you read these articles and work through the Jupyter notebooks that accompany them, there isn't one universal best model or algorithm for text analysis. Sarkar constantly tries multiple models and algorithms to see which work best on his data.
For a review of recent deep-learning-based models and methods for NLP, I can recommend this article by an AI educator who calls himself Elvis.
Natural language processing servicesYou would expect Amazon Web Services, Microsoft Azure, and Google Cloud to offer natural language processing services of one kind or another, in addition to their well-known speech recognition and language translation services. And of course they do—not only generic NLP models, but also customized NLP.
Amazon Comprehend is a natural language processing service that extracts key phrases, places, peoples' names, brands, events, and sentiment from unstructured text. Amazon Comprehend uses pre-trained deep learning models and identifies rather generic places and things. If you want to extend this capability to identify more specific language, you can customize Amazon Comprehend to identify domain-specific entities and to categorize documents into your own categories
Microsoft Azure has multiple NLP services. Text Analytics identifies the language, sentiment, key phrases, and entities of a block of text. The capabilities supported depend on the language.
Language Understanding (LUIS) is a customizable natural-language interface for social media apps, chat bots, and speech-enabled desktop applications. You can use a pre-built LUIS model, a pre-built domain-specific model, or a customized model with machine-trained or literal entities. You can build a custom LUIS model with the authoring APIs or with the LUIS portal.
For the more technically minded, Microsoft has released a paper and code showing you how to fine-tune a BERT NLP model for custom applications using the Azure Machine Learning Service.
Google Cloud offers both a pre-trained natural language API and customizable AutoML Natural Language. The Natural Language API discovers syntax, entities, and sentiment in text, and classifies text into a predefined set of categories. AutoML Natural Language allows you to train a custom classifier for your own set of categories using deep transfer learning.
What Is TensorFlow?
TensorFlow is an open-source collection of tools and libraries that helps developers build and train deep learning models.
It has become one of the most widely used software frameworks since it can help build complex Artificial Intelligence (AI) models relatively quickly and easily.
Jad Khalife, Director of Sales Engineering, Middle East & Turkey, at Dataiku, says one of the features that make Tensorflow suitable for machine learning is that it's an end-to-end framework that offers everything from data preprocessing to model deployment.
You may likeTensorFlow uses a dataflow graph to represent computations. It shares this space with another open-source machine-learning framework called PyTorch.
Developed and released by the Google Brain Team in November 2015, the framework received a major update in 2019 in the form of TensorFlow 2.0.
TensorFlow applications can run on either conventional CPUs or GPUs. Furthermore, Google Cloud users can run TensorFlow on Google's own TensorFlow Processing Unit (TPU) chips, which are designed to speed up TensorFlow tasks.
Uses for TensorFlowTensorFlow has many applications in different industries. It has been used by AirBnB to improve guest experience, by Airbus to detect anomalies in ISS telemetry data, by NASA to hunt for new planets, and to fight illegal deforestation.
Among its most important uses are:
Image recognition: This is one of the most popular uses of TensorFlow. Developers can leverage TensorFlow's pre-trained models or build their own to identify and classify objects within digital images and videos. This technology has applications in fields like medical image analysis, and autonomous driving.
Natural Language Processing (NLP): Developers can use TensorFlow to process and analyze large volumes of textual data. This helps automate language understanding and generation, enabling developers to create chatbots, language translation systems, sentiment analysis tools, and other such NLP-based systems. Not surprisingly, many digital assistants are based on models trained using TensorFlow.
Reinforcement learning: Reinforcement learning (RL) involves an agent that learns to make decisions by interacting with an environment, through trial and error. TensorFlow can be used for this task through its library called TensorFlow Agents (TF-Agents), which provides a framework for building and training RL agents. This is particularly useful in fields such as robotics where TensorFlow can help develop models that enable robots to perceive and interact with their environment, improving tasks like navigation.
Generative Adversarial Networks (GANs): TensorFlow bundles a library called TF-GAN that allows developers to easily implement GANs. This comprehensive library simplifies the setup and training of GAN models. These models can then be used for tasks like generating all kinds of realistic media.
Time Series analysis: TensorFlow provides several methods and models for time series analysis and forecasting. This comes in handy to forecast outcomes, detect anomalies, and for financial modeling. It's widely used in predicting stock prices, weather forecasting, and such. Recommendation engines, such as those used by Netflix, are one of the most common use cases for time series.
Advantages of TensorFlowTensorFlow offers several advantages that make it a popular choice for developing and deploying machine learning models. Here's why it's the preferred choice for many AI developers:
Scalability: TensorFlow is designed to be scalable, which allows it to work efficiently across various devices, from mobile phones to high-end servers. It can also easily handle large datasets and computations, whether on a local machine, distributed across multiple machines, or in a cloud environment.
Support for multiple devices: TensorFlow supports multiple devices, such as CPUs, GPUs, and TPUs. This capability allows models created with TensorFlow to be deployed easily across different platforms without rewriting code.
Parallelism: By distributing its workload across multiple processors or machines, TensorFlow can significantly reduce the time required to train models. This is particularly useful when working with large datasets and complex models that would otherwise take a long time to train on a single device.
Open Source: TensorFlow is open source, which means it's accessible to AI developers all over the world. Being open source also helps foster trust and transparency. Backed by Google, TensorFlow also has a very active and vibrant community of developers, data scientists, and engineers who work together to modify and extend the framework and provide support.
Greater developer control: Although TensorFlow uses Python as a front-end API for building applications with the framework, it offers wrappers in several other programming languages including C++ and Java. This means developers can train and deploy machine learning models regardless of the programming language or platform.
Extensive ecosystem: TensorFlow boasts a rich ecosystem of libraries and tools to help make development faster and easier. This includes TensorFlow Lite for mobile and embedded devices, TensorFlow.Js for web-based applications, the TensorFlow Hub repository of pre-trained models, and a lot more.
TensorFlow componentsThere are a few key components in TensorFlow that help facilitate its functionality as one of the leading machine-learning libraries.
Tensors: As its name suggests Tensors are a crucial aspect of TensorFlow. Think of a tensor as a multi-dimensional array. In TensorFlow, all data is represented as tensors, which are the primary data structures that are used to represent and manipulate data in TensorFlow.
Flows: This is the other critical aspect of TensorFlow. As we know, TensorFlow accepts input in the form of tensors. This input passes through a series of steps. The term "flow" refers to this movement of data through the various stages of model training or inference.
Graphs: One of the reasons for TensorFlow's popularity is its graph-based architecture. All operations in TensorFlow are depicted and executed inside a graph, which helps define how data is processed in the model.
TensorBoard: TensorBoard is a visualization tool that helps developers track, and understand the training of machine learning models in TensorFlow. It is primarily used for monitoring and debugging the machine learning models and provides insights into how the models are learning and performing.
What is TensorFlow Lite?While TensorFlow is a wonderful library to train and infer machine learning models, it requires powerful CPUs, GPUs, or TPUs to work its magic. In 2017, Google released TensorFlow Lite to enable developers to bring machine learning-powered experiences to mobile and embedded devices.
Now called LiteRT, TensorFlow Lite allows developers to deploy machine learning models on devices with limited computational resources, such as smartphones, tablets, and other IoT devices.
"[TensorFlow Lite] enables efficient inference with minimal computational resources, making it ideal for real-time and low-latency machine learning applications," says Khalife.
It is tuned for speed and optimizes power consumption to run efficiently in devices with limited hardware resources. Models created with TensorFlow Lite are lightweight enough to be deployed on embedded devices, like the Raspberry Pi, and at the edge. Like TensorFlow, LiteRT is also open source.
5 Natural Language Processing Libraries To Use
Natural language processing (NLP) is important because it enables machines to understand, interpret and generate human language, which is the primary means of communication between people. By using NLP, machines can analyze and make sense of large amounts of unstructured textual data, improving their ability to assist humans in various tasks, such as customer service, content creation and decision-making.
Additionally, NLP can help bridge language barriers, improve accessibility for individuals with disabilities, and support research in various fields, such as linguistics, psychology and social sciences.
Here are five NLP libraries that can be used for various purposes, as discussed below.
NLTK (Natural Language Toolkit)One of the most widely used programming languages for NLP is Python, which has a rich ecosystem of libraries and tools for NLP, including the NLTK. Python's popularity in the data science and machine learning communities, combined with the ease of use and extensive documentation of NLTK, has made it a go-to choice for many NLP projects.
NLTK is a widely used NLP library in Python. It offers NLP machine-learning capabilities for tokenization, stemming, tagging and parsing. NLTK is great for beginners and is used in many academic courses on NLP.
Tokenization is the process of dividing a text into more manageable pieces, like specific words, phrases or sentences. Tokenization aims to give the text a structure that makes programmatic analysis and manipulation easier. A frequent pre-processing step in NLP applications, such as text categorization or sentiment analysis, is tokenization.
Words are derived from their base or root form through the process of stemming. For instance, "run" is the root of the terms "running," "runner," and "run." Tagging involves identifying each word's part of speech (POS) within a document, such as a noun, verb, adjective, etc.. In many NLP applications, such as text analysis or machine translation, where knowing the grammatical structure of a phrase is critical, POS tagging is a crucial step.
Parsing is the process of analyzing the grammatical structure of a sentence to identify the relationships between the words. Parsing involves breaking down a sentence into constituent parts, such as subject, object, verb, etc. Parsing is a crucial step in many NLP tasks, such as machine translation or text-to-speech conversion, where understanding the syntax of a sentence is important.
Related: How to improve your coding skills using ChatGPT?
SpaCySpaCy is a fast and efficient NLP library for Python. It is designed to be easy to use and provides tools for entity recognition, part-of-speech tagging, dependency parsing and more. SpaCy is widely used in the industry for its speed and accuracy.
Dependency parsing is a natural language processing technique that examines the grammatical structure of a phrase by determining the relationships between words in terms of their syntactic and semantic dependencies, and then building a parse tree that captures these relationships.
Stanford CoreNLPStanford CoreNLP is a Java-based NLP library that provides tools for a variety of NLP tasks, such as sentiment analysis, named entity recognition, dependency parsing and more. It is known for its accuracy and is used by many organizations.
Sentiment analysis is the process of analyzing and determining the subjective tone or attitude of a text, while named entity recognition is the process of identifying and extracting named entities, such as names, locations and organizations, from a text.
GensimGensim is an open-source library for topic modeling, document similarity analysis and other NLP tasks. It provides tools for algorithms such as latent dirichlet allocation (LDA) and word2vec for generating word embeddings.
LDA is a probabilistic model used for topic modeling, where it identifies the underlying topics in a set of documents. Word2vec is a neural network-based model that learns to map words to vectors, enabling semantic analysis and similarity comparisons between words.
TensorFlowTensorFlow is a popular machine-learning library that can also be used for NLP tasks. It provides tools for building neural networks for tasks such as text classification, sentiment analysis and machine translation. TensorFlow is widely used in industry and has a large support community.
Classifying text into predetermined groups or classes is known as text classification. Sentiment analysis examines a text's subjective tone to ascertain the author's attitude or feelings. Machines translate text from one language into another. While all use natural language processing techniques, their objectives are distinct.
Can NLP libraries and blockchain be used together?NLP libraries and blockchain are two distinct technologies, but they can be used together in various ways. For instance, text-based content on blockchain platforms, such as smart contracts and transaction records, can be analyzed and understood using NLP approaches.
NLP can also be applied to creating natural language interfaces for blockchain applications, allowing users to communicate with the system using everyday language. The integrity and privacy of user data can be guaranteed by using blockchain to protect and validate NLP-based apps, such as chatbots or sentiment analysis tools.
Related: Data protection in AI chatting: Does ChatGPT comply with GDPR standards?

Comments
Post a Comment