Industrial applications of large language models



natural language indexing :: Article Creator

Enhancing Information Retrieval Through Statistical Natural Language ...

Our systems have detected unusual traffic activity from your network. Please complete this reCAPTCHA to demonstrate that it's you making the requests and not a robot. If you are having trouble seeing or completing this challenge, this page may help. If you continue to experience issues, you can contact JSTOR support.

Block Reference: #c3a9794a-8845-11f0-8963-3250e9e5732fVID: #IP: 167.71.87.121Date and time: Tue, 02 Sep 2025 21:42:46 GMT

Javascript is disabled

Go back to JSTOR

©2000- ITHAKA. All Rights Reserved. JSTOR®, the JSTOR logo, JPASS®, and ITHAKA® are registered trademarks of ITHAKA.


Snowflake's AI Integration: Vector Indexing And Language Models ...

With artificial intelligence taking the world by storm, unconventional approaches are expected to push AI adoption beyond the evaluation phase.

While large language models seem to work magic with consumer-ready summary blogs, data authenticity has much higher stakes when it comes to generative application development. To this end Snowflake Inc. Devised a technique called Retrieval Augmented Generation for believable outcomes, according to Sridhar Ramaswamy (pictured), senior vice president at Snowflake and co-founder of Neeva Inc., which was acquired by Snowflake in May.

"We pioneered a technique called RAG, Retrieval Augmented Generation, or you can just think of that as smart search, that you need to use in conjunction with the language model in order to produce believable output," Ramaswamy explained. "We want to have that index that can set the context for a language model to always produce believable, authentic information."

Ramaswamy spoke with theCUBE industry analyst Dave Vellante at Supercloud 4, during an exclusive broadcast on theCUBE, SiliconANGLE Media's livestreaming studio. They discussed how Snowflake is jumping on the AI bandwagon by making outputs believable through RAG and the importance of data and language models.

Language models are changing the stakes in the enterprise world

Based on language models' proficiency and efficiency, Snowflake is boosting its adoption rate through structured query language, according to Ramaswamy. This comes in handy in making data practitioners' lives easier because language models are revolutionizing the business landscape.

"A really important part of language models is literally what the two words say, which is their proficiency with language … in terms of efficiency, they're pretty amazing," he said. "What we are going to do as part of the Snowflake platform is, not just host a set of models, but we are going to make it easily accessible in SQL, which means that the thousands of data analysts … have access to language models in the SQL that they write, without needing to deal with GPUs, deal with APIs."

Open-source language models are changing the dynamics in the AI field. This is a notable transformation because it has provided a lot of impetus to the ecosystem, which looked like it would only have a handful of players, such as OpenAI and Anthropic, according to Ramaswamy.

"The rise of open-source language models and the role that Meta is increasingly playing in the language model space," he pointed out. "We have a great partnership with Meta, but of course, we are also pre-training our own models. I think this sort of rise in open source and the democratization of AI models is an important development."

With AI being a cutting-edge technology in the modern era, Snowflake is moving a notch higher by integrating a set of AI capabilities into its platform. This includes vector indexing and running language models, Ramaswamy pointed out.

"The core technology of AI is that it's a natural language, you can speak to it, you can write things and it can extract the structured information that is underneath and surface it to you," he said. "We are also integrating the Neeva Search technology, which is a combination of old-school information retrieval plus vector indexing natively into Snowflake so that you're able to just, with a single command, index a table."

Combining SQL and Streamlit for better data access

To democratize access to the structured data sitting in Snowflake, innovations, such as merging SQL and the Streamlit app are of the essence, according to Ramaswamy. This creates fluid access to data needed for better decision-making.

"Our take is that combining SQL and Streamlit to write interesting applications, people are going to come up with some crazy stuff that will be incredibly valuable … I think the first wave is going to come early next year," he said. We are also creating a Copilot experience that builds on top of the inferencing platform and the vector indexing so that when you can talk to it in English, it writes the SQL queries and then you can click a button and have that run."

When it comes to data, creating value and making the outcomes believable are of the essence. As a result, Snowflake is geared toward this objective by incorporating the necessary parameters, Ramaswamy pointed out.

"How exactly is access control going to work?" he asked. "This is why, with the things that we are implementing, we make sure that governance and security are right in there from the very beginning. We actually have applications where people can feel like, 'OK, only the people that are supposed to look at some data are actually allowed to look at that data.'"

Here's the complete video interview, part of SiliconANGLE's and theCUBE's coverage of Supercloud 4:

Photo: SiliconANGLE

Support our mission to keep content open and free by engaging with theCUBE community. Join theCUBE's Alumni Trust Network, where technology leaders connect, share intelligence and create opportunities.

  • 15M+ viewers of theCUBE videos, powering conversations across AI, cloud, cybersecurity and more
  • 11.4k+ theCUBE alumni — Connect with more than 11,400 tech and business leaders shaping the future through a unique trusted-based network.
  • About SiliconANGLE Media

    SiliconANGLE Media is a recognized leader in digital media innovation, uniting breakthrough technology, strategic insights and real-time audience engagement. As the parent company of SiliconANGLE, theCUBE Network, theCUBE Research, CUBE365, theCUBE AI and theCUBE SuperStudios — with flagship locations in Silicon Valley and the New York Stock Exchange — SiliconANGLE Media operates at the intersection of media, technology and AI.

    Founded by tech visionaries John Furrier and Dave Vellante, SiliconANGLE Media has built a dynamic ecosystem of industry-leading digital media brands that reach 15+ million elite tech professionals. Our new proprietary theCUBE AI Video Cloud is breaking ground in audience interaction, leveraging theCUBEai.Com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.


    Genomics And Natural Language ProcessingNature Reviews Genetics

    Schuler, G. D., Epstein, J. A., Ohkawa, H. & Kans, J. A. Entrez: molecular biology database and retrieval system. Methods Enzymol. 266, 141–162 (1996)

    Article  CAS  PubMed  Google Scholar 

    Wilbur, W. J. & Yang, Y. An analysis of statistical term strength and its use in the indexing and retrieval of molecular biology texts. Comput. Biol. Med. 26, 209–222 (1996).Describes the vector-space model used by Entrez, the literature-search service maintained by the NCBI.

    Article  CAS  PubMed  Google Scholar 

    Renner, A. & Aszodi, A. High-throughput functional annotation of novel gene products using document clustering. Proc. Pacific Symp. Biocomp. 5, 54–68 (2000).

    Google Scholar 

    Shatkay, H., Edwards, S., Wilbur, W. J. & Boguski, M. Genes, themes, and microarrays. Proc. Int. Conf. Intell. Syst. Mol. Biol. 8, 317–327 (2000).

    CAS  PubMed  Google Scholar 

    Manning, C. D. & Schutze, H. S. In Foundations of Statistical Natural Language Processing 85 (MIT press, Cambridge, Massachusetts, 1999).The indispensable reference for anyone who is interested in statistical natural language processing (NLP).

    Google Scholar 

    Burset, M. & Guigo, R. Evaluation of gene structure prediction programs. Genomics 34, 353–367 (1996).

    Article  CAS  PubMed  Google Scholar 

    Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990).

    Article  CAS  PubMed  Google Scholar 

    Altschul, S. F. Et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402 (1997).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

    MacCallum, R. M., Kelley, L. A. & Sternberg, J. E. SAWTED: structure assignment with text description — enhanced detection of remote homologues with automated SWISS-PROT annotation comparisons. Bioinformatics 16, 125–129 (2000).

    Article  CAS  PubMed  Google Scholar 

    Bairoch, A. & Apweiler, R. The SWISS-PROT protein sequence data bank and its supplement TrEMBL in 1999. Nucleic Acids Res. 27, 49–54 (1999).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

    Chang, J. T., Raychaudhuri, S. & Altman, R. B. Including biological literature improves homology search. Proc. Pacif. Symp. Biocomp. 5, 374–383 (2001).A quantitative assessment of the utility of combining sequence similarity with document similarity.

    Google Scholar 

    Eisenhaber, F. & Bork, P. Evaluation of human-readable annotation in biomolecular sequence databases with biological rule libraries. Bioinformatics 15, 528–535 (1999).

    Article  CAS  PubMed  Google Scholar 

    Stapley, B. J., Kelley, L. A. & Sternberg, M. J. E. Predicting the sub-cellular location of proteins from text using support vector machines. Proc. Pacif. Symp. Biocomp. (in the press).Describes the use of both text and sequence data to predict subcellular localization.

    Iliopoulos, I., Enright, A. J. & Ouzounis, C. A. TEXTQUEST: document clustering of Medline abstracts for concept discovery in molecular biology. Proc. Pacif. Symp. Biocomp. 6, 374–383 (2001).

    Google Scholar 

    Andrade, M. A. & Valencia, A. Automatic extraction of keywords from scientific text: application to the knowledge domain of protein families. Bioinformatics 14, 600–607 (1998).

    Article  CAS  PubMed  Google Scholar 

    Renner, A. & Aszodi, A. High-throughput functional annotation of novel gene products using document clustering. Proc. Pacific Symp. Biocomp. 5, 54–68 (2000).

    Google Scholar 

    Raychaudhuri, S., Chang, J. T., Sutphin, P. D. & Altman, R. B. Associating genes with gene ontology codes using a maximum entropy analysis of biomedical literature. Genome Res. 12, 203–214 (2002).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

    The Gene Ontology Consortium. Gene Ontology: tool for the unification of biology. Nature Genet. 25, 25–29 (2000).

    Stevens, R., Goble, C. A. & Bechhofer, S. Ontology-based knowledge representation for bioinformatics. Brief Bioinform. 1, 398–414 (2000).

    Article  CAS  PubMed  Google Scholar 

    Fellbaum, C. (ed.) WordNet: an Electronic Lexical Database (MIT Press, Cambridge, Massachusetts, 1999).

    Google Scholar 

    Humphreys, B. L., Lindberg, D. A., Schoolman, H. M. & Barnett, G. O. The Unified Medical Language System: an informatics research collaboration. J. Am. Med. Inform. Assoc. 5, 1–11 (1998).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

    Baclawski, K., Cigna, J., Kokar, M. M., Mager, P. & Indurkhya, B. Knowledge representation and indexing using the unified medical language system. Proc. Pacif. Symp. Biocomp. 5, 502–513 (2000).A brief introduction to UMLS and related issues.

    Google Scholar 

    Nadkarni, P., Chen, R. & Brandt, C. UMLS concept indexing for production databases: a feasibility study. J. Am. Med. Inform. Assoc. 8, 80–91 (2001).Critically assesses the use of UMLS for concept indexing, and provides a useful discussion of nomenclature issues.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

    Hersh, W. R. & Donohoe, L. C. SAPHIRE International: a tool for cross-language information retrieval. Proc. 1998 AMIA Annu. Symp. 673–677 (1998).

    Maynard D. & Ananiadou S. In Recent Advances in Computational Terminology (eds Bourigault, D., Jacquemin, C. & L'Homme, M.-C.) (John Benjamins, Amsterdam, 2000).

    Google Scholar 

    Aronson, A. R. & Rindflesh, T. C. Query expansion using the UMLS Metathesaurus. Proc. AMIA Annu. Fall Symp. 1997, 485–489 (1997).

    Aronson, A. R. Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program. Proc. AMIA Annu. Fall Symp. 2001, 17–21 (2001).

    Brill, E. A simple rule-based part of speech tagger. Proc. Third ACL Appl. NLP (1992).

    Hersh, W. R., Price, S. & Donohoe, L. Assessing thesaurus-based expansion using the UMLS Metathesaurus. Proc. AMIA Annu. Fall Symp. 2000, 344–348 (2000).

    Bodenreider, O. Circular hierarchical relationships in the UMLS: etiology, diagnosis, treatment, complications and prevention. Proc. AMIA Annu. Fall Symp. 2001, 57–61 (2001).

    Hahn, U., Romacker, M. & Schulz, S. Creating knowledge repositories from biomedical reports: the MEDSYNDICATE text mining system. Pacif. Symp. Biocomp. 338–349 (2002)Applies sophisticated NLP techniques to the task of information extraction, with excellent results.

    Proux, D., Rechenmann, F., Julliard, L., Pillet, V. & Jacq, B. Detecting gene symbols and names in biological texts: a first step toward pertinent information. Proc. Genome Inform. Workshop 9, 72–80 (1998).

    CAS  Google Scholar 

    Fukuda, K., Tsunoda, T., Tamura, A. & Takagi, T. Toward information extraction: identifying protein names from biological papers. Proc. Pacif. Symp. Biocomp. 3, 707–718 (1998).

    Google Scholar 

    Yoshida, M., Fukuda, K. & Takagi, T. PNAD-CSS: a workbench for constructing a protein name abbreviation dictionary. Bioinformatics 16, 169–175 (2000).

    Article  CAS  PubMed  Google Scholar 

    Stapley, B. J. & Benoit, G. Biobibliometrics: information retrieval and visualization from co-occurrences of gene names in MEDLINE abstracts. Proc. Pacif. Symp. Biocomp. 5, 526–537 (2000).

    Google Scholar 

    Ng, S.-K. & Wong, M. Toward routine automatic pathway discovery from on-line scientific text abstracts. Genome Inform. 10, 104–112 (1999).

    CAS  Google Scholar 

    Wong, L. PIES, a protein interaction extraction system. Proc. Pacif. Symp. Biocomp. 6, 520–531 (2001).

    Google Scholar 

    Hatzivassiloglou, V., Duboue, P. & Rzhetsky, A. Disambiguating proteins, genes and RNA in text: a machine learning approach. Bioinformatics 17 (Suppl. 1), S97–S106 (2001).

    Article  PubMed  Google Scholar 

    Thomas, J., Milward, D., Ouzounis, C., Pulman, S. & Carroll, M. Automatic extraction of protein interactions from scientific abstracts. Proc. Pacif. Symp. Biocomp. 5, 541–551 (2000).

    Google Scholar 

    Humphreys, K., Demetriou, G. & Gaizauskas, R. Two applications of information extraction to biological science journal articles: enzyme interactions and protein structures. Proc. Pacif. Symp. Biocomp. 5, 502–513 (2000).

    Google Scholar 

    Jenssen, T.-K., Laegreid, A., Komorowski, J. & Hovig, E. A literature network of human genes for high-throughput analysis of gene expression. Nature Genet. 28, 21–28 (2001).Describes PubGene — a large-scale information extraction system that uses simple co-occurrence to detect associations between genes.

    CAS  PubMed  Google Scholar 

    Sekimizu, T., Park, H. S. & Tsujii, J. Identifying the interaction between genes and gene products based on frequently seen verbs in medline abstracts. Genome Inform. 9, 62–71 (1998).

    CAS  Google Scholar 

    Ono, T., Hishigaki, H., Tanigami, A. & Toshihisa, T. Automated extraction of information on protein–protein interactions from the biological literature. Bioinformatics 17, 155–161 (2001).Shows that information extraction can be carried out with reasonable sensitivity and specificity without using overly sophisticated NLP techniques.

    Article  CAS  PubMed  Google Scholar 

    Blaschke, C., Andrade, M. A., Ouzounis, C. & Valencia, A. Automatic extraction of biological information from scientific text: protein–protein interactions. Proc. AAAI Conf. Intell. Syst. Mol. Biol. 7, 60–67 (1999).

    Google Scholar 

    Leroy, G. & Chen, H. Filling preposition–base templates to capture information from medical abstracts. Proc. Pacif. Symp. Biocomp. 350–361 (2002).

    Rindflesch, T. C., Tanabe, L., Weinstein, J. N. & Hunter, L. EDGAR: extraction of drugs, genes and relations from the biomedical literature. Proc. Pacif. Symp. Biocomp. 5, 517–528 (2000).

    Google Scholar 

    Mutalik, P., Deshpande, A. & Nadkarni, P. Use of general-purpose negation detection to augment concept indexing of medical documents: a quantitative study using the UMLS. J. Am. Med. Inform. Assoc. 8, 598–609 (2001).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

    Pustejovsky, J., Castano, J., Zhang, J., Kotecki, M. & Cochran, B. Robust relational parsing over biomedical literature: extracting inhibit relations. Proc. Pacif. Symp. Biocomp. 362–373 (2002).Describes automatically inferred rules for extracting information using grammar induction techniques.

    Eilbeck, K., Brass, A., Paton, N. & Hodgman, C. INTERACT: an object oriented protein–protein interaction database. Proc. Int. Conf. Intell. Syst. Mol. Biol. 87–94 (1999).

    Xenarios, I., Rice, D. W., Salwinski, L., Baron, M. K. & Marcotte, E. M. DIP: the database of interacting proteins. Nucleic Acids Res. 28, 289–291 (2000).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

    Bader, G. D., Donaldson, I., Wolting, C., Ouellette, B. F., Pawson, T. & Hogue, C. W. BIND — the biomolecular interaction network database. Nucleic Acids Res. 29, 242–245 (2001).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

    Blaschke, C. & Valencia, A. Can bibliographic pointers for known biological data be found automatically? Protein interactions as a case study. Comp. Funct. Genomics 2, 196–206 (2001).Proposes that biomedical text mining is limited more by inadequate lexica and lack of full-text sources than by data-mining technology. Also includes a useful discussion of nomenclature issues.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

    Hearst, M. A. In WordNet: an Electronic Lexical Database (ed. Fellbaum, C.) 131–151 (MIT press, Cambridge, Massachusetts, 1999)

    Google Scholar 

    Roberts, R. J. PubMed Central: the GenBank of the published literature. Proc. Natl Acad. Sci. USA 98, 381–382 (2001).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

    Cruse, D. A. Lexical Semantics (Cambridge University Press, Cambridge, UK, 1986)

    Google Scholar 






    Comments

    Follow It

    Popular posts from this blog

    What is Generative AI? Everything You Need to Know

    Top Generative AI Tools 2024

    60 Growing AI Companies & Startups (2025)