“Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers” investigates whether large language models (LLMs) can propose innovative research concepts. The study involved over 100 NLP experts to assess the originality and potential of ideas generated by LLMs in the field of natural language processing.
It’s intriguing that LLMs can generate more novel ideas, but these ideas often lack practicality. Humans, on the other hand, produce less novel ideas but are more grounded in prior research and feasibility.
The next two steps appear to be:
-
Explore if the seemingly impossible ideas from LLMs could be achieved with a significant breakthrough in methods.
-
Consider how human creativity stays grounded—can LLMs, retrieval-augmented generation, or similar tools adopt strategies to reflect that?
It’s fascinating that the criteria focus on perceived novelty and practicality, which differs from rigorous scientific testing. Since no one can predict the future, it could be more valuable to assess LLMs’ ability to recognize a genuinely novel idea as novel (and vice versa), rather than relying solely on these perceptions.
The real challenge will be if the authors’ next paper is based on an idea proposed by an LLM!
On a more serious note, I’m eager to see the results of the next phase, where researchers will implement ideas proposed by both AI and humans to see how they perform in practice.
Evaluating feasibility or working within available resources seems much simpler than inventing a novel idea from scratch. This makes the creativity of LLMs promising, even if they sometimes lean toward impractical ideas. If an interesting concept isn’t feasible, the LLM can easily adjust it or discard it and generate a new one, keeping the process flexible and efficient.
How often is an idea considered concurrent work, meaning it was a clear next step based on previous research? Even the most innovative ideas can usually be described as “X but with Y,” where Y might come from another field.
Thinking outside the box is essentially fuzzy pattern recognition, something future machine learning models could definitely excel at.
The study seems solid from a quick glance (it’s 94 pages long). They use retrieval-augmented generation (RAG) to find top related papers on a given topic, generate 4,000 topic seeds, and then have the LLM rank those for paper proposals using a template. Both human and machine proposals are rewritten by the LLM for style, but the first human author confirms the rewrite doesn’t reduce proposal quality.
I’m surprised by the finding that “LLMs produce more novel proposals.” It suggests the common belief that LLMs lack creativity compared to the prompt might be too quick a judgment, possibly due to using a single, monolithic prompt. This study employs a kind of “chain of thought” process: literature search, brainstorming, filtering, and proposing.
But after reading the most common points of criticism, the filtering step might not be rigorous enough. Usually scientists won’t „blurt out“ stuff that’s completely intractable or just very far fetched. Also, looking at the correlation scores of the reviewers used in the paper and reviewers in e.g. conferences, I’m not sure how „expert“ the experts really were.