Observations from HALIE:
A Closer Look at Human-LM Interactions in Information-Seeking Contexts

Authors: Megha Srivastava and John Thickstun


We provide public access to a rich dataset of interaction traces and dynamics visualizations for two information-seeking tasks: Crossword and Question Answering, and discuss specific examples and six takeaways to help guide future work on human-LM interaction.

Paper | Code & Raw Data | QA Visualizations | Crossword Visualizations

Figure 1: Example interaction traces for the Crossword and Question Answering (QA) tasks studied with HALIE. In Crossword, users are provided a dialogue chat with a language model (LM) and solve clues in any order from a given crossword puzzle, taking into account letter and grid constraints. In QA, users can query an LM for assistance while answering multiple-choice questions from the MMLU dataset (green color denotes correct answer).


Introduction

From asking ChatGPT to plan a dream vacation to Milan to  drafting a professional email with the help of Bard, people are harnessing the power of language models (LMs) in increasingly diverse and interactive ways.  These interactions reveal a wide spectrum of user goals, but also raise questions that go beyond traditional static evaluation (e.g. benchmarks for factual knowledge and trustworthiness). For example, how do users respond when LMs produce incorrect or “hallucinated” information – do they detect this, or are they easily misled? Does this change when models use language that appears more or less confident? Do users adapt their behavior based on their experiences? 

In this blog post, we closely examine a sample of over 1,000 human-LM interaction traces for examples of successful interactions and common failure modes. These traces were collected with HALIE, a new Human-AI Language Interactive Evaluation framework recently published in Transactions on Machine Learning Research (TMLR), and span 5 diverse tasks: Social Dialogue, Question Answering (QA), Crossword, Summarization, and Metaphor Generation. We focus here on the two information-seeking tasks (QA and Crossword), where users rely on an LM to provide sufficient information (e.g. a fact about the electromagnetic spectrum)  to complete a task (e.g. answer a chemistry question correctly). This type of user interaction is prevalent in many applications of LMs, such as search engines. By examining our data for these two tasks, we can gather insights on a wide-range of user behaviors all driven by a common objective goal (e.g. task accuracy), and we provide specific user examples for the following six key takeaways:

  1. Users react to model hallucinations differently based on their prior knowledge
  2. Users can be inspired by “hints” even when an LM does not directly answer their question
  3. Users can be misled by uncalibrated language in Instruction-tuned LMs
  4. Users try to adapt to the LM’s lexical and numerical limitations
  5. Users can develop LM-specific prompt strategies over time
  6. Users attempt to account for LM’s lack of self-consistency

Our data was collected in Spring 2022, using now-outdated LM’s - OpenAI's TextDavinci, Davinci, and TextBabbage, and AI21's Jumbo. Nevertheless, the patterns we observe in these interaction traces persist in state-of-the-art chat-based systems like  ChatGPT. For example, researchers have observed similar issues such as a lack of verifiable content from generative search engines like Bing Chat and  users’ false perceptions of GPT-4’s helpfulness in mathematics  (Takeaways 3 and 4).  Recent work also suggests the performance of these LMs can be influenced by factors such as task frequency and output likelihood, underscoring the importance of controlled studies that allow for a closer look at LM failure modes, and how users react to them.  For example, in our paper we show that users interacting with Instruction-tuned models (TextDavinci and TextBabbage) for the Crossword task are more likely to self-report higher utility than actually reflected in their task performance.  However, these high-level quantitative results can never fully capture the rich set of insights we can derive from closely examining different human-LM interaction traces: with HALIE, the devil’s in the details – or in this case, the data.

crossword
qa
Figure 2: User interactions with the interfaces developed for HALIE for the QA (left) and Crossword (right) tasks. In QA, users have unlimited access to an LM to assist them in tackling multiple choice exam questions. In Crossword, participants solve crossword clues with guidance from an LM presented as dialogue-chat with an “AI Teammate”.

For each interaction trace, we not only collect user self-reported ratings on aspects such as LM helpfulness and fluency (located in the HALIE GitHub repository), but also provide visualizations for both tasks in the tables below.  For Crossword, we provide both static and dynamic (“replay”) visualizations as users solve crossword clues and query the LM, while for QA we aggregate visualizations by LM and MMLU subject category to ease comparison. We also provide measures of accuracy, allowing one to specifically look-up, for example, why a player who only got 37% letter accuracy when interacting with TextDavinci to solve a crossword puzzle encountered difficulties, or uncover the reasons behind the underperformance of users interacting with TextBabbage to answer College Chemistry questions.  



Overall, this data demonstrates a variety of creative ways users choose to interact with LMs, as well as a diverse set of failure modes. We hope our blog post encourages you to explore the data, and now let’s take a closer look at specific interaction traces!


A Closer Look

We will now discuss six qualitative observations about general patterns in human-LM interaction for the QA and Crossword tasks, ranging from examples of successful user adaptations (e.g. learning LM-specific failure modes) to challenges that lie ahead (e.g. the cost of LMs that sound “too confident”).

1. Users react to model hallucinations differently based on their prior knowledge

Even the most capable LMs hallucinate, or present incorrect information as facts. Common methods for measuring factual knowledge typically measure QA accuracy over benchmark datasets with standardized prompts, similar to the prompt in 1A that led to a correct answer. However, such a static evaluation misses how users might bring in certain language and prompting styles that might actually increase the likelihood of retrieving incorrect information (e.g. “impeached” in 1B), even for the same model. Users in HALIE also brought in their own prior knowledge (e.g. Nixon in 1C and Ben Affleck in 1D), and relied on the LM’s ability to accurately verify, instead of generate, an answer: 

User 27D5A: “It was helpful when I already knew a word for me to outright give a word to the AI to ask if it was correct (double-check!)”

Furthermore, the impact of model hallucinations on decision-making depends on the individual user’s behavior and goals. Users might be able to easily detect obviously incorrect outputs (e.g. Mike Pence in 1C) and ignore hallucinations on information irrelevant to the task (e.g. Argo was released in 2012, 1D), yet remain susceptible to more subtle statements that are harder to verify (e.g. fake poll results in 1B). Subsequently, we saw in HALIE examples of diverse range of user behavior including (i) no longer using an LM after receiving non-factual information, (ii) ignoring hallucinations and still describing the LM as helpful, and (iii)  adjusting their own behavior to account for the existence of hallucinations:

User 69D98: “After I saw that one of its answers was wrong I stopped relying on it much.”
User CB956: “It helped me a lot with facts that I didn't know, and I couldn't have solved the crossword without it, even if it did give me some wrong answers.”
User 115F4: “I so thoroughly didn't understand the subject matter of the last few questions so found it difficult to judge the info provided by the AI. In those instances it was necessary to use the information given to stairstep backward with component questions to see if the accuracy holds”

While an ideal language model would be 100% factual and never hallucinate, we aren’t there yet, and our observations suggest LM developers and evaluators may wish to be more nuanced when measuring factuality. How sensitive are recent chat-based LMs to different user prompt styles?  Can we design metrics that account for the prior knowledge human users may likely bring in? Will users over time learn to expect certain kinds of errors, and in that case, should we penalize LMs more for generating hallucinations that are harder to detect?

2. Users can be inspired by “hints” even when an LM does not directly answer their question

User F8E50: “some answers helped jog my memory even if they were not right”

Even when not fully accurate, researchers have studied how LMs can still provide “sparks”, or pieces of inspiration for users. In HALIE we saw that even when LMs failed to generate coherent and completely accurate outputs, they were still able to guide users towards answers that fit the task (e.g. letter constraints in Crossword and valid choices in QA) . This ranged from simple cases of generating different spellings (2A) to more complex situations where models like Jumbo, which lacks instruction-tuning and often produced repetitive, incoherent responses, was still able to guide the user towards the correct answer “strudel” in 2B (and in this instance, Instruct-tuned models often gave fluent yet incorrect answers, such as suggesting that Kirschwasser, a type of cherry brandy, was a type of pastry!).  In some cases, we might only be able to guess how a user’s response was inspired by the LM (e.g. “adapt” in 2C and “North Pole” in 2D), which is why we provide dynamic replay visualizations to better capture the way a user immediately reacts to a particular output. Additionally, several users in HALIE shared how that the LM inspired them even when not providing the answer directly:  

User 60B6A: “it left some ‘breadcrumbs; or clues in the replies that I could then followup further with to find the answers”
User 8E9D2: “This is an enjoyable task that may not be as fun if the AI would give you all the answers!”

There naturally still needs to be some signal in an LM output for it to still be helpful, but an LM does not always need to provide direct answers (and in educational contexts, this might even be unideal!).  Is it then possible to design evaluations that capture the degree an LM output inspires or guides a user towards a positive decision? What would such a metric look like?

3. Users can be misled by uncalibrated language in Instruction-tuned LMs

As instruction-tuning and reinforcement learning from human feedback become increasingly common techniques for designing state-of-the-art LMs, several have noted the possibility of such post-training hurting calibration (e.g. increasing confidence in incorrect answers).  However, even more relevant in interactive settings is a notion of  “linguistic calibration”, where recent work on studying how LMs interpret and generate linguistic expressions of overconfidence and uncertainty showed that GPT-3’s accuracy can drop when expressions of certainty are included. In HALIE, we similarly see users either be misled or forced to re-query an LM either due to the LM generating succinct and definitive, yet incorrect, outputs (3A, 3B), or generating more uncertainty for correct than incorrect statements (3C, 3D). This can directly impact task performance: a core finding in the Crossword task was that for some puzzles users performed the worst with the Instruction-tuned models they self-reported as most helpful! For some users, the brevity and length of outputs themselves caused by instruction-tuning led to increased confidence: 

User 5F016: “When I discovered the AI gave brief, accurate responses, I used it with more confidence”

For others, confident language led to increased caution when interacting with the LM:

User 0989F: “the AI Assistant will always give a "definitive" response so that it always sounds confident or that it has the "right" answer even when it's wrong. If I had accepted the assistant's response without question, I would have gotten some of the questions wrong. ”

Our observations from HALIE further support research on methods that can control for linguistic confidence based on probabilities,  as well as a stronger understanding of the different linguistic expressions that can affect users when interacting with an LM. 

4. Users try to adapt to the LM’s lexical and numerical limitations

Both the Crossword and QA tasks in HALIE allow us to particularly stress-test LM’s lexical and numerical capabilities  in an interactive context. Researchers have explored how models based on subword-tokenization might struggle with tasks requiring character-level manipulations , and even more recent models such as GPT-4 demonstrate sensitivity to output likelihood for character manipulation tasks.  With HALIE, we observe that users can very quickly notice and adapt to LM failures to follow lexical constraints (4A):

User 44553: “I also over time ended up avoiding asking the AI teammate anything that describes the word in number of letters or which letters were present in the word.”

Users also over time developed strategies for finding synonyms in the Crossword task (4B):

User 36B89: “I learned how to get synonyms. Instead of _____ synonym I put ______ thesaurus.”

In several cases, users were able to clearly detect and adapt to poor LM outputs on tasks requiring character-level or numerical understanding:

User 9B7BC: “Some questions received great responses while others that involved more numbers, percentages, and individual letters (Vitamin A) did not produce solid responses.”

However, in other cases, user efforts to adapt failed due to the LM simply generating incorrect values (4C). The data in HALIE not only provides extensive examples of users trying to solve lexical and numerical problems by interacting with an LM, but also provides a unique opportunity to see how users adapt to these limitations overtime — in our paper, we refer to this as user “accommodation” and discuss how it can be LM-specific, which we explore next.

5. Users can develop LM-specific prompt strategies over time

Both the AI and HCI communities study the importance of users developing mental models of an AI’s capabilities for strong overall performance, and it is well-known that in information-seeking contexts such as search, users rapidly adapt their behavior (e.g. by adding “reddit” at the end of search queries). We can get a better understanding of how users adjust for the strengths and weaknesses of different LMs by observing how user prompts change over time. For example, a standard failure mode of “raw” LMs that aren’t Instruction-tuned, like Jumbo and Davinci, is to repeat variations of full user prompts,  and users of these two LMs in HALIE learned over time to query the LM with unfinished sentences and short keywords to “trigger” an output  (5A, 5B):

User 6D72A (Davinci): “I felt I could [sic] a better response if instead of asking a question I typed in the beginning of a thought. For example, instead of saying "what causes cancer?" it's better to type in "cancer is caused by..."
User 4CC14 (Jumbo): “It responded better to concise fragments than complete sentences”

Meanwhile, users interacting with instruction-tuned models (more similar to current chat-based LMs) found it easier to generate fluent outputs immediately, and changed to provide more specific prompts to the LM over time, often by including potential answers (5C):

User 73EC1 (TextDavinci): “I feel that over time I focused on being more specific with my questions, as it provided better answers.”
User 65888 (TextBabbage): “I initially didn't include the multiple choice options because I didn't know how helpful it would be to include them, but after a couple questions with the AI not giving definitive answers without them, I started including them.. The information the AI gave when I didn't provide the choices was still relevant and helpful ... but giving the options made the AI give definitive answers”

On the one hand, it is clear that methods such as instruction-tuning have helped make LMs incredibly user friendly - and an ideal model wouldn’t require users to invest a significant amount of time in prompt engineering . However,  it is also clear from HALIE that users can change their behavior even over short periods of time depending on the specific LM.  And with recent work showing that feedback-tuned LMs can reduce overall content diversity in writing contexts, one might wonder whether there exists a cost to optimizing for a particular “instruction-following” form of prompting by further fine-tuning.  When comparing LMs, how should we account for user accommodation? Would the same rankings hold after months, or years, as users learn to adopt specific strategies to better interact with different LMs?

6. Users attempt to account for LM’s lack of self-consistency

A common failure mode noticed by users across all LMs studied in HALIE was the lack of self-consistency, or when users receive conflicting information from the same LM. This was particularly common when users chose to use the LM to verify (e.g. yes/no questions), rather than seek information (6A, 6B, 6C). Researchers have proposed promising consistency correction methods  that assume access to a batch of test inputs to enforce consistency constraints over, but this fails to capture the rich diversity of prompts that can arise in interactive settings. However, users were often misled by such inconsistencies (6A, 6C), several started to successfully utilize self-consistency across repeated queries as a strong signal of accuracy (6D):

User CB956: “I also learned quickly that I needed to confirm the responses by asking the same question multiple times (in different ways)”
User F3342: “I would try multiple different ways of asking it the same question in order to see if its answers would change, or if I would get the same answer, so I could know if it was true or not”

In general, lack of self-consistency was one of the most common issues users reported having to adjust for in HALIE, and remains an issue with more state-of-the-art LMs. As the community continues developing methods to address this, our observations from HALIE point to a parallel direction to pursue: helping educate users about LM stochasticity and inconsistencies, particularly for interactive settings centered around seeking information.

Lessons for Evaluating Future Human-LM Interactions

Although recent chat-based LMs such as ChatGPT, Llama 2, and Claude, have been fine-tuned to better align with human preferences, they still exhibit several of the failure modes discussed in this blog post. Our observations can help guide future questions to consider when studying how users interact with these LMs. For example, we already saw that LMs are able to generate confident language that misleads users – if these models also create inaccurate references and poll results (7A, 7B)  that appear informative, will users be proactive enough to examine these sources?  How can we avoid pushing the burden of verification of an LM’s output onto users? 

We also found that several examples of incorrect outputs in our data were replicable with ChatGPT and Llama-2. In 7C,  the two chat-based LMs provide outputs that improve upon TextDavinci with respect to tone and degree of explanation, but not correctness. Such answers would perhaps lead the user to make the same incorrect decision. Can we use the data collected with HALIE to help identify weaknesses in LMs we did not study, and guess how users might react to their output? Is there a pattern to the kinds of issues that will persist as LMs improve? 

Our examples also shed light on how users interpret the data we use to evaluate LMs.  For instance, we noticed that many questions from the MMLU dataset, such as that in 7B, were time-stamped by simply adding the prefix “As of [YEAR]”,   which often led to confusion:

User 20C5B: “I thought the wording on the question about the fraction of internet users in India was confusing.”

In fact, concurrent work from Anthropic studying human-LM interaction with the MMLU dataset reported recurring formatting errors, which can strongly affect user interaction. This raises the question of whether large-scale benchmark datasets intended for static evaluation are even suitable once we consider the broader context of human interaction. 

As LMs continue to increase in capabilities and are deployed widely, we run the risk of the researchers designing and evaluating LMs not themselves encountering the failure cases that arise from an increasingly longer tail of interaction behaviors.  We hope our observations in HALIE can continue to offer insights on how users interact with LMs, and how LM behavior affects human decision making and strategies over time. We invite you to explore the full set of interaction traces and get in touch with any questions!

Acknowledgements

Thank you to Gabriel Poesia, Kaitlyn Zhou, Kanishk Gandhi, Lucy Li, and Percy Liang for their helpful feedback on this blog post, Rose Wang, Minae Kwon, Mina Lee for contributions to the Question Answering task, and OpenAI and A21 Labs for providing the credits to evaluate their models.