‘Prompt engineering’ is a hot topic, but what if every generative AI tool has a different vision of what amounts to a good prompt?

Boolean mastery was once the coin of the legal writing realm. Marrying keen legal thinking, a sense of how courts write, and the ability to appreciate the difference between a timely “/p” or “w/10,” cemented a young lawyer’s value in the early days of electronic legal research. Primitive eDiscovery also rewarded attorneys who could predict the right searches to get the right results, giving rise to a whole industry of outside discovery vendors. Alas, increasingly robust “natural language” models evened the playing field for everyone else competing with these research ninjas.

As legal enters the generative AI era, the prompt engineer is again ascendant. Once more, the community whispers of the mythic figure of the true engineer who can coax large language models to produce quality content — or at least not get firms sanctioned — and ponders how law schools will train the next generation to write the prompts that will make the whole world spin.

But everyone might be getting a bit ahead of themselves — and not just because legal-specific generative AI is not quite ready for prime time — because you can’t engineer a prompt without understanding what the AI is even looking for.

During Legalweek, I had a chat with Jeremy Pickens, Head of Applied Science at Redgrave Data (known around Above the Law parts as the A-Team of data problems), about overcoming the token problem. Essentially, how does legal get the results it wants from generative AI without swallowing up any efficiencies by breaking the bank on token charges. He explained that scientists evaluating larger context windows have found, “some language models will pay attention to the beginning and the end of the prompt and sort of just ignore the middle, others tend to pay attention to the middle of the prompt and ignore the beginning and the end, others pay attention to the beginning and ignore the rest, others pay attention to the end and ignore the rest.”

Until that point, I’d always considered the prompt engineering issue as a redux of building Boolean Barbarians who just knew how to craft the right input. But Boolean had the advantage of being, more or less, a fixed language across tools. Large Language Models keep all that stuff behind the curtain of its natural language, chatbot-inspired interfaces. Depending on the precise situation, the model might be training itself how to react to prompts and coming up with idiosyncrasies that no one expected.

That’s what machine learning does! On that note, remember when we called this stuff machine learning? We endured an overblown AI hype cycle that settled into a nice, comfy “machine learning” phase. It made everyone feel better about it. Now we’re back to artificial intelligence again. As it turns out, the legal industry’s journey wildly oscillating between these terms mirrors what happened in the computing world. Zachary Lipton wrote a piece a few years ago titled From AI to ML to AI: On Swirling Nomenclature & Slurried Thought that considered the methodological damage caused by playing fast and loose with these terms:

Because the technology itself is discussed so shallowly, there’s little opportunity to convey any sense of technological progress by describing the precise technical innovations. Instead, the easiest way to indicate novelty in the popular discourse is by changing the name of the field itself!

What was Google working on 6 years ago? Big data. Four years ago? Machine learning. Two years ago? Artificial intelligence. Two years from now? Artificial general intelligence!

Whether or not the technological progress provides any intellectually sensible justification for relabeling a field of research, readers respond to periodic rebranding. Researchers in turn have an incentive to brand their work under the new name in order to tap in to the press coverage.

In any event, with these rebranded “AI-ML-AI” tools feasting on data, the real action for the hardcore scientists is in building the tests to figure out how LLMs react to prompts. Is it using everything the user gives it? Is it actually doing the work the user expects it to? What, exactly, is it ignoring and not ignoring?

This is pretty significant when you consider the popular use case of “summarization.” If RoboLaw has decided to care less about the tail end of the 500 documents fed into its maw, that’s going to matter.

The lawyer of the future may be the lawyer who understands how to use AI, but training the AI-savvy attorney might have to wait until the scientists figure out what that would even look like.