Could artificial intelligence assist you in writing your next paper?

Large language models can write abstracts or suggest directions for research, but these AI tools are still being worked on

artificial inteligence, writing assistant, ai tools

You know that text autocomplete feature on your smartphone that makes it so convenient — and occasionally frustrating — to use? Now, tools based on the same concept are assisting researchers in analyzing and writing scientific papers; generating code; and brainstorming ideas.

Natural language processing (NLP) is a branch of artificial intelligence that aims to help computers "understand" and even produce human-readable text. Large language models (LLMs) are tools that have evolved from being objects of study to being research assistants.

LLMs are neural networks that have been trained to process and, in particular, generate language from massive amounts of text. OpenAI, a research laboratory in San Francisco, California, created the most well-known LLM, GPT-3, in 2020 by training a network to predict the next piece of text based on what came before. Researchers have expressed surprise at its eerily human-like writing on Twitter and elsewhere. Anyone can now use it to generate text based on a prompt using the OpenAI programming interface. (Prices begin around US$0.001 per 750 words processed, a metric that combines reading the prompt and writing the response.)

"I think I use GPT-3 almost every day," says Hafsteinn Einarsson, a computer scientist at the University of Iceland in Reykjavik. He uses it to generate feedback on his paper abstracts. Some of the algorithm's suggestions were useless in one example, which Einarsson shared at a conference in June, advising him to add information that was already included in his text. Others, however, were more helpful, such as "explicitly state the research question at the beginning of the abstract." According to Einarsson, it can be difficult to see flaws in your own manuscript. "You can either sleep on it for two weeks or have someone else look at it." And that "someone else" could be GPT-3.

Thought organization

Some researchers use LLMs to generate paper titles or to make text more readable. Mina Lee, a doctoral student in computer science at Stanford University in California, gives GPT-3 prompts such as "generate the title of a paper using these keywords." To rewrite difficult sections, she employs Wordtune, an AI-powered writing assistant developed by AI21 Labs in Tel Aviv, Israel. "I write a paragraph that's basically a brain dump," she explains. "I simply click 'Rewrite' until I find a cleaner version that I prefer."

Domenic Rosati, a computer scientist at the technology start-up Scite in Brooklyn, New York, uses an LLM called Generate to organize his thoughts. Generate, created by Cohere, a Toronto-based NLP firm, behaves similarly to GPT-3. "I put in notes or just scribbles and thoughts and say,'summarize this' or 'turn this into an abstract,'" Rosati explains. "It's a fantastic synthesis tool for me."

Language models can even aid in the design of experiments. Einarsson used the game Pictionary to collect language data from participants for one project. GPT-3 suggested game variations based on a description of the game. In theory, researchers could also request new perspectives on experimental protocols. On the other hand, Lee asked GPT-3 to come up with ideas for how to introduce her boyfriend to her parents. It advised going to a restaurant near the beach.

Coding encoding

GPT-3 was trained on a wide range of text by OpenAI researchers, including books, news stories, Wikipedia entries, and software code. Later, the team discovered that GPT-3, like other text, could be broken into pieces of code. The researchers developed Codex, a fine-tuned version of the algorithm trained on more than 150 gigabytes of text from the code-sharing platform GitHub1. GitHub has now integrated Codex into Copilot, a service that suggests code as users type.

At least half of his colleagues use Copilot, according to Luca Soldaini, a computer scientist at the Allen Institute for AI (also known as AI2) in Seattle, Washington. According to Soldaini, it works best for repetitive programming, such as writing boilerplate code to process PDFs. "It just blurts something out and says, 'I hope this is what you want.'" Sometimes it isn't. As a result, Soldaini says they're careful to use Copilot only for languages and libraries they're familiar with, so they can spot issues.

Searches for literature

The most well-known application of language models is for searching and summarizing literature. AI2's Semantic Scholar search engine, which covers approximately 200 million papers, primarily in biomedicine and computer science, provides tweet-length descriptions of papers using a language model known as TLDR (short for too long; didn't read). TLDR is based on an earlier model called BART, developed by researchers at the social media platform Facebook and refined using human-written summaries. (TLDR is not a large language model by today's standards, as it only has about 400 million parameters.) (GPT-3's largest version contains 175 billion.)

AI2's Semantic Reader, an application that augments scientific papers, also includes TLDR. When a user clicks on an in-text citation in Semantic Reader, a box with information, including a TLDR summary, appears. "The idea is to bring artificial intelligence right into the reading experience," says Dan Weld, chief scientist at Semantic Scholar.

When language models generate text summaries, "there's often a problem with what people charitably call hallucination," according to Weld, "but is the language model just completely making stuff up or lying?" On truthfulness tests 2, TLDR performs reasonably well—authors of papers TLDR was asked to describe rated its accuracy at 2.5 out of 3.Weld attributes this to the fact that the summaries are only about 20 words long and that the algorithm rejects summaries that include uncommon words that do not appear in the full text.

Elicit, a search tool, was launched in 2021 by the machine-learning non-profit organization Ought in San Francisco, California. When you ask Elicit a question like, "What are the effects of mindfulness on decision making?" it returns a table of ten papers. Users can instruct the software to insert content such as abstract summaries and metadata, as well as information about study participants, methodology, and results, into columns. Elicit extracts or generates this information from papers using tools such as GPT-3.

Joel Chan, a human-computer interaction researcher at the University of Maryland in College Park, uses Elicit whenever he begins a new project. "It works great when I don't know what language to search in," he says. Gustav Nilsonne, a neuroscientist at the Karolinska Institute in Stockholm, uses Elicit to find papers with data he can add to pooled analyses. According to him, the tool had suggested papers that he had not found in previous searches.

Models that are changing

The AI2 prototypes provide a glimpse into the future of LLMs. When researchers read a scientific abstract and have questions, they may not have time to read the full paper. A team at AI2 created a tool that, at least in the domain of NLP, can answer such questions. It began by instructing researchers to read the abstracts of NLP papers and then to ask questions about them (for example, "what five dialogue attributes were examined?"). After reading the full papers, the team asked other researchers to respond to those questions. 3. Using the resulting data set, AI2 trained a version of its Longformer language model to generate answers to various questions about other papers, which can ingest an entire paper rather than just a few hundred words as other models do.

ACCoRD, a model that can generate definitions and analogies for 150 scientific concepts related to NLP, was used to fine-tune BART so that researchers could use a question and a set of documents to make a brief meta-analytical summary.

Then there are applications that go beyond text generation. In 2019, AI2 used Semantic Scholar papers to fine-tune BERT, a language model created by Google in 2018, to create SciBERT, which has 110 million parameters. Scite, which created a scientific search engine using AI, fine-tuned SciBERT so that when its search engine lists papers citing a target paper, it categorizes them as supporting, contrasting, or otherwise mentioning that paper. According to Rosati, nuance helps people identify limitations or gaps in the literature.

The AI2 SPECTER model, which is also based on SciBERT, condenses papers into compact mathematical representations. According to Weld, SPECTER is used by conference organizers to match submitted papers to peer reviewers, and Semantic Scholar uses it to recommend papers based on a user's library.

Tom Hope, a computer scientist at the Hebrew University of Jerusalem and AI2, says that other AI2 research projects have fine-tuned language models to find good drug combinations, links between genes and disease, and scientific challenges and directions in COVID-19 research.

Can language models, however, provide deeper insight or even discovery? In May, Hope and Weld co-authored a review with Eric Horvitz, Microsoft's chief scientific officer, and others that outlines the challenges to accomplishing this, such as teaching models to "[infer] the result of recombining two concepts." "Generating a picture of a cat flying into space is one thing," Hope says, referring to OpenAI's DALLE 2 image-generation model. "How will we get from there to combining abstract, highly complex scientific concepts?"

That remains an open question. However, LLMs are already having an impact on research. At some point, Einarsson says, "people will miss out if these large language models are not used."