ChatGPT’s Answers
First answer: Summarize the story
[ChatGPT in “code font”
and my comments in bold normal font.]
John Henry is a legendary figure in American folklore,
[while it is correct that he is part of “folklore,” John Henry is also a classic example of an American “tall tale,” so I would have preferred to see that in the answer] known for his incredible strength and his battle against a steam-powered hammer during the construction of the railroads. According to the story, John Henry was a steel-driving man, meaning he used a large hammer to drill holes into rocks for explosives. He was said to be exceptionally strong and fast, able to outperform the newly introduced steam-powered hammer.
[John Henry isn’t really “said to be…able to outperform” the steam hammer; he thought he could and he proved that he could by winning before dying. Even though this is a tall tale, the use of weasel words is still not apt, because the whole point of the story is that John Henry believed he could prevail while others didn’t and it is not left ambiguous at the end of the story.]
The story goes
[I like this bit of a colloquial flare] that a railroad company proposed a competition between John Henry and the steam-powered hammer to see who could drive the most steel in a single day. John Henry, determined to prove that human strength could prevail over machines, accepted the challenge. The contest took place with a crowd of spectators, and John Henry tirelessly swung his hammer, driving steel into the rock with remarkable speed.
Despite his fierce efforts, John Henry emerged victorious, having driven more steel than the machine.
[This is obviously an incorrect use of “despite.” It’s not clear why ChatGPT misused “despite.” Perhaps because the story takes a negative turn later on (John Henry’s death), ChatGPT tipped its hand early1 by using a preposition of contrast at the start of the paragraph in which the twist occurs. I am not sure if this misuse of language technically fits with a “hallucination,” since it is grammatically rather than factually incorrect. Someone should coin a term for this if there isn’t one already.] However, the intense exertion took a toll on him, and he collapsed with his hammer in his hand, never to rise again. John Henry's sacrifice
[questionable if this counts as a “sacrifice” since it wasn’t actually on behalf of anyone else, but I suppose that is debatable] in defeating the machine demonstrated the strength, determination, and resilience of human labor in the face of advancing technology.
The story of John Henry has been passed down through generations and has become a symbol of the struggles faced by workers during the industrialization of the United States. It celebrates the virtues of hard work, perseverance, and the indomitable spirit of individuals.
[This is…fine, but has a very “concluding-a-five-paragraph-middle-school-essay2 vibe.]
Second answer: Modernize the story
I asked ChatGPT “How might a modern John Henry compete with large language models like ChatGPT?” And I think it just completely failed at this prompt. While the insights are good, they describe generic attributes of a contest, rather than an actual contest.3
In the context of large language models like ChatGPT, a modern John Henry figure might symbolize a human attempting to outperform or challenge the capabilities of AI-driven language models. While the task may not involve physical strength like the original story, there are still areas where human input can be valuable.
[This is technically correct but also so vague as to be useless. Saying that “the task may not involve physical strength” makes no sense—it absolute cannot involve physical strength! There is no meaningful way in which a human could compete against ChatGPT in a feat of physical strength.]
A modern John Henry could compete with large language models by showcasing their unique qualities and strengths that AI cannot replicate. For example, they could emphasize their creativity, emotional intelligence, critical thinking, and the ability to interpret context in nuanced ways. These are areas where human intuition and experience often excel.
[Again, this answer is verbally fluent nonsense. It ignores the source material, showing that ChatGPT does not understand4 the story. If the hypothesized large language models (LLMs) in this answer cannot replicate modern John Henry’s “unique qualities” at all, there is not contest. The point is to find a competition in an area in which AI and humans are both formidable, with some humans still able to outperform the AI. Recall that John Henry does not defeat the steam engine by writing poems or tending wounds, but by driving steel faster than a steel driving machine.]
The modern John Henry might focus on engaging in meaningful conversations that require empathy, understanding complex emotions, and providing personalized advice or guidance. They could also emphasize the value of human judgment, moral reasoning, and ethical decision-making, which can be more nuanced and subjective compared to the objective algorithms employed by AI.
[Again, this makes no sense (see above). However, this part of the answer reminded me of an interesting academic paper on ChatGPT’s inconsistent answers to various forms of the Trolley problem, which were internally inconsistent yet persuasive to humans.5]
Furthermore, the modern John Henry could highlight the importance of human connection, empathy, and the ability to build rapport and trust with others. They could emphasize the genuine human interaction and the ability to understand and respond to the unique needs and nuances of individuals.
[Once again, this isn’t really a “contest” and does not fit with having a human-AI near-parity competition.]
In essence, a modern John Henry could compete with large language models by showcasing the distinct qualities and strengths that make humans valuable in various domains, demonstrating that while AI has its merits, human capabilities remain essential for certain tasks.
[Generic. Booooooo!]
Third answer: Try to modernize the story again
Me: What one competition could modern John Henry do to compete against an LLM? Be specific and be concise. [“Be concise” is a useful prompt engineering tip I read on either Twitter or Substack (or perhaps multiple place).]
ChatGPT: A specific competition that a modern John Henry could undertake to compete against a large language model (LLM) is a live debate on a complex and controversial topic, judged by a panel of experts.
This one is interesting, because it seems like a correct answer, but I am not sure I like it. I think the idea of a debate is interesting, but it is unclear why specifically having the debate live would be an improvement. In terms of raw verbal fluency, an LLM would mop the floor with a human. On the other hand, a human able to verbally spar with an LLM in real time would be an impressive competition similar to the original John Henry.
However, I think the problem comes in suggesting a combination of “live” and “judged by a panel of experts.” If it were a live Oxford-style debate with audience voting, then the LLM may win by being more persuasive even if it were hallucinating endlessly. Expert judges, on the other hand, would check the assertions and sources; they would undoubtedly find numerous hallucinations that would disqualify the LLM.
I suppose the judges could be fact-checking throughout the competition, so I may have actually talked myself into liking this idea. Imagine two rounds, the debate and the judgement. It would be asymmetric competition, because the LLM debater would almost certainly lose in the judgement round due to hallucinations, but the human debater may simply fail to keep pace with the LLM debater and lose before the judgement. The human debater would have to focus on quickly churning out answers and not being phased by red herring hallucinated arguments made by the LLM debater. The LLM debater would have to out-produce and outlast the human debater, since it would most likely lose in the judgement round.
However, this would presumably also rely on effective voice-to-text for the LLM to take the human debater’s speech as an input, which with current technology would probably result in the LLM failing.
Conclusion
ChatGPT is good at producing fluent nonsense. It is convincing at first, and seems like the answer you’d want, but it isn’t right because it doesn’t really understand the structure of the stories it has seemingly memorized. Also, sometimes the fluent nonsense is not very fluent, such as misusing a preposition. It’s like the SNL sketch Almost Pizza:
Two things I plan to follow up on in6 future posts:
Suggestions of sketch comedy, books, movies, etc. that help us think about LLMs, but—importantly—are not explicitly about AI. (Almost Pizza not Ex Machina).
Exploring further which famous stories ChatGPT may have memorized (e.g., classic fairy tales, urban legends, etc.).
Anthropomorphization alert! P.S. in footnote 9 of the “GPT-4 System Card” paper, OpenAI says “We use the term ‘hallucinations,’ though we recognize ways this framing may suggest anthropomorphization, which in turn can lead to harms or incorrect mental models of how the model learns.”
I asked ChatGPT to do something similar. Prompt 1: Create a hyphenated adjective that conveys the sense of a work of writing seeming like it is structured to satisfy a middle school writing assignment. Answer 1: Middle-school-pleasing. Prompt 2: Create five more. Answer 2: 1. Formulaic-middle-school; 2. Conventional-assignment-driven; 3. Simplistic-prompt-compliant; 4. Cliché-ridden-exercise-like; 5. Pedestrian-schoolwork-esque.
Anthropomorphization alert!
Anthropomorphization alert!
From the abstract: “ChatGPT turns out highly inconsistent as a moral advisor. Nonetheless, it influences users’ moral judgment, we find in an experiment, even if they know they are advised by a chatting bot, and they underestimate how much they are influenced. Thus, ChatGPT threatens to corrupt rather than improves users’ judgment.”
How often will LLMs string together three prepositions like I did there?
This is great analysis. I agree thoroughly that it can make pleasing-sounding nonsense because it is very smooth grammatically.