An executive primer on artificial general intelligence

natural language is structured data :: Article Creator

How Do Computers And Humans Process Language?

The performance of natural language models such as ChatGPT is deeply impressive. The performance of another one of these models, the one called "human,"is at least as impressive.

The human model stores only about 1.5 megabytes of data for a native language [1]. With relatively little training data, the human model is able to capture sentence structures, bootstrap the meaning of words, and generate words and language it has never seen or heard before. The human model does not really need to be told what is correct language use and what is not; it is able to learn this automatically. Basically, it is able to identify itself what is correct in terms of grammar, meaning, and context. It excels at language processing.

For the computational model (take ChatGPT as the example), whether you ask the model a knowledge question, to make a summary for you, or have it create an original limerick, poem, or haiku, it demonstrates human-like performance. One can argue that it perhaps even surpasses human performance. Whereas it might take you, as a reader, half an hour to come up with an original limerick for this blog post, it would take ChatGPT just a matter of seconds.

GPT stands for General Pretrained Transformers, artificial neural networks trained on massive amounts of data. Around 570GB of data are used to train a model consisting of about 1.8 trillion parameters across 120 layers. This training is extremely expensive both in terms of computing power and the resulting carbon footprint [2].

The human natural language model and the natural language model called "ChatGPT" show remarkable similarities in their output: generated language. For the input, the similarities between the two are obvious: both are trained on language data in order to generate language output.

And yet the computational and human language processing models also show remarkable differences. The amount of language data the computational model needs to be trained is very different than the amount of training data the human model needs. The computational power needed for the computational model is also very different from the power the human model needs. How can we explain these similarities and differences?

Natural language models

Dall-E

It is often argued that artificial intelligence models like ChatGPT are trained on unstructured data. If an AI model recognizes an image, the pixels in the image are "unstructured." The language data AI models like ChatGPT are trained on could be considered equally unstructured. This is true to some extent. Indeed, pixels in images, just like words in sentences, are not tabularized in a structured spreadsheet, but is the input really unstructured?

Think about it. If we were to take a 10x10 image (100 pixels) with 256 grayscale levels, we could create 6.668E+240 different images. That's a lot of images! So many images that if we assume an average person lives to 90 years old, we would need to see 2.3493E+231 pictures per second throughout our lifetime to have seen all pictures. Now, computers might process faster depending on computational resources, but 6.668E+240 different images—and only the ones that are 100 pixels—is even a lot for computers.

For language, things are not much different. Let's take a single word. If a word were truly unstructured, all letter combinations could exist. If we take words with maximally 10 letters, we can then create 1.46814E+14 different words. Mind you, the word "unstructured" would not exist in this set of words, because it has 12 and not 10 letters. If we assume 171,476 words in the Oxford English Dictionary, there are 51,727 times more 10-letter words that we can form than all the words that actually exist in the dictionary. I have argued elsewhere that we have no problem recognizing 10 sextillion sentences—as many sentences as there are grains of sand in the Sahara desert—which makes it rather difficult to argue that language models are trained on unstructured data [3].

Perhaps the magic of the computational and human natural language models does not lie in training data and computational power. Perhaps the magic lies elsewhere in the language system itself. With an increasing amount of evidence that language is not arbitrary—the sound of words gives pieces of meaning away, as does word order, as does the context of words—what human and computational models are actually doing is picking up on these language patterns.

That would be an exciting conclusion, as it would explain how humans are so good at language and why computers can become so good at language (and how the latter might actually decrease their carbon footprint by taking advantage of these patterns in language).

In closing, I guess I owe you the limerick I mentioned earlier. Here is the one created by ChatGPT:

In the land where language takes flight,Both human and AI shine bright.From pixels and word,Meaning's melody's heard,In patterns, they find their delight.

Decoding The Data Ocean: Security Threats & Natural Language Processing

InfoSec Insider

Infosec Insider Post

Infosec Insider content is written by a trusted community of Threatpost cybersecurity subject matter experts. Each contribution has a goal of bringing a unique voice to important cybersecurity topics. Content strives to be of the highest quality, objective and non-commercial.

Google's Gemini Comes To Databases

Google wants Gemini, its family of generative AI models, to power your app's databases — in a sense.

At its annual Cloud Next conference in Las Vegas, Google announced the public preview of Gemini in Databases, a collection of features underpinned by Gemini to — as the company pitched it — "simplify all aspects of the database journey." In less jargony language, Gemini in Databases is a bundle of AI-powered, developer-focused tools for Google Cloud customers who are creating, monitoring and migrating app databases.

One piece of Gemini in Databases is Database Studio, an editor for structured query language (SQL), the language used to store and process data in relational databases. Built into the Google Cloud console, Database Studio can generate, summarize and fix certain errors with SQL code, Google says, in addition to offering general SQL coding suggestions through a chatbot-like interface.

Joining Database Studio under the Gemini in Databases brand umbrella is AI-assisted migrations via Google's existing Database Migration Service. Google's Gemini models can convert database code and deliver explanations of those changes along with recommendations, according to Google.

Elsewhere, in Google's new Database Center — yet another Gemini in Databases component — users can interact with databases using natural language and can manage a fleet of databases with tools to assess their availability, security and privacy compliance. And should something go wrong, those users can ask a Gemini-powered bot to offer troubleshooting tips.

"Gemini in Databases enables customer to easily generate SQL; additionally, they can now manage, optimize and govern entire fleets of databases from a single pane of glass; and finally, accelerate database migrations with AI-assisted code conversions," Andi Gutmans, GM of databases at Google Cloud, wrote in a blog post shared with TechCrunch. "Imagine being able to ask questions like 'Which of my production databases in east Asia had missing backups in the last 24 hours?' or 'How many PostgreSQL resources have a version higher than 11?' and getting instant insights about your entire database fleet."

That assumes, of course, that the Gemini models don't make mistakes from time to time — which is no guarantee.

Regardless, Google's forging ahead, bringing Gemini to Looker, its business intelligence tool, as well.

Launching in private preview, Gemini in Looker lets users "chat with their business data," as Google describes it in a blog post. Integrated with Workspace, Google's suite of enterprise productivity tools, Gemini in Looker spans features such as conversational analytics; report, visualization and formula generation; and automated Google Slide presentation generation.

I'm curious to see if Gemini in Looker's report and presentation generation work reliably well. Generative AI models don't exactly have a reputation for accuracy, after all, which could lead to embarrassing, or even mission-critical, mistakes. We'll find out as Cloud Next continues into the week with any luck.

Gemini in Databases could be perceived as a response of sorts to top rival Microsoft's recently launched Copilot in Azure SQL Database, which brought generative AI to Microsoft's existing fully managed cloud database service. Microsoft is looking to stay a step ahead in the budding AI-driven database race and has also worked to build generative AI with Azure Data Studio, the company's set of enterprise data management and development tools.

View comments

Search This Blog

Follow It

Autonomous AI

How to Make Money Online