Killed by LLM

r0bk.github.io

218 points by yz-exodao 6 months ago

Ukv 6 months ago

IMO a critical feature of the Turing test/imitation game, which many modern implementations including this site's linked paper ignore, is that the interrogator talks to both a human and a bot and must decide that one xor the other is a human. So fooling an interrogator means having them choose the bot as human over an actual human, not just judging the bot to be human (while probably judging humans to be human even more frequently).

When the interrogator is only answering "do you think your conversation partner was a human?" individually, bots can score fairly highly simply by giving little information in either direction - like pretending to be a non-english-speaking child, or sending very few messages.

Whereas when pitted against a human, the bot is forced to give stronger or equally strong evidence of being human as the average human (over enough tests). To be chosen as human, giving 0 evidence becomes a bad strategy when the opponent (the real human) is likely giving some positive non-zero evidence towards their personhood.

fastball 6 months ago

That's not the original Turing test either. The original imitation game as proposed by Turing involves reading a text transcript of a human and a computer and having the evaluator determine which is which. The evaluator does not interact directly with the conversing parties.
- tripletao 6 months ago
  
  Where are you getting that? Turing's most famous paper is just as Ukv describes. The link on that site doesn't work for me, but the reference is buried in their source:
  https://courses.cs.umbc.edu/471/papers/turing.pdf
  In Turing's test, the forced binary choice means P(human-judged-human) + P(machine-judged-human) is necessarily equal to 100%. This gives the 50% threshold clear intuitive and mathematical significance.
  In the bastardized test that GPT-4 "passed", that sum can be (and actually was) >100%. This makes the result practically impossible to interpret, since it depends on the interrogators' prior. The correct prior seems to be that it was human with p = 25%, though the paper doesn't say that explicitly, or say anything about what the interrogators were told. If the interrogators guessed mistakenly that it was 50% then that would lead them to systematically misjudge machines as humans, perhaps as observed.
  The bastardized test is pretty bad, but treating the 50% threshold as meaningful there is inexcusable. I see the preprint hasn't yet passed peer review, and I'll regain some faith in social science professors if it never does. Of course the credulous media coverage is everywhere already, including the LLM training sets--so regardless of whether LLMs can pass the Turing test, they now believe they do.
  - fenomas 6 months ago
    
    I don't understand why debates like this crop up. The premise of Turing's paper is plainly stated - that asking questions about an "Imitation Game" is more useful than asking whether machines can think.
    That's all! He doesn't make any claim that the the game must be administered a particular way. In fact he spends only a few casual sentences glossing over how it would operate, and he's clearly just conveying the idea in broad strokes, not trying to describe an experimental procedure. And he says nothing at all about how the results might be judged, let alone thresholds for anything.
    The paper is about what sorts of questions we should examine, not about specifically how they should be examined. So it seems weird to consider a test "bastardized" just because it doesn't match how you interpret Turing's casual description.
    
    tripletao 6 months ago
    
    For Turing's test with the binary choice, the pass threshold is clear. If the machine and human are indistinguishable, then the probabilities that they're judged human must be equal. Since they sum to 100%, they must both equal 50%, making that a meaningful pass threshold. (A slightly higher pass threshold should be used in practice for statistical convenience, since infinitely many trials are required to make a confidence interval exactly include 50%. I'd guess that's why Turing mentions 70% in his paper.)
    Without the binary choice, what do you think is the correct pass threshold? Those probabilities can now sum to anything. For GPT-4 in Jones and Bergen's paper they sum to 121%, though please nobody say 60.5%. The threshold now obviously depends on the interrogator's prior--I'd judge very differently if I were told the witnesses were 99% human than if I were told they were 1% human.
    In that paper, do you think the interrogators knew their witness had only a 25% chance of being human? If so, why? If not, how do you think that affected the result? In aggregate over all the witnesses, their interrogators seem to have judged correctly only 60% of the time, while always guessing "machine" would have scored 75%. How did they manage to score worse than chance?
    Turing's formulation is elegant, admitting meaningful statistical analysis with minimum assumptions. Most modifications are not, and that paper's is particularly bad. Turing's description may seem casual, but it's filled with mathematical depth that should not be missed.
    
    fenomas 6 months ago
    
    The setup Turing describes isn't the "both must sum to 100%" setup you're presenting. He has two different games being played, one with two humans and one with human-vs-machine, and suggests comparing the results. E.g. if a man successfully imitates a woman only 25% of the time, then we'd ask whether the machine can pass as human equally often.
    But much more importantly, as I said Turing is clearly not describing a specific experimental methodology! That's not what the paper is about, and in fact it would be somewhat absurd to run the test precisely as he describes it (since detecting a man imitating a woman is quite a different task from detecting a machine). His point is that we should approach the question of machine intelligence with actual experiments rather than asking unanswerable questions, but he only limns the general premise of what such an experiment could look like.
    So I understand that you find a particular test setup better or more elegant than others, and that's fine. But you shouldn't claim that Turing's paper demands your preferred setup, or that other setups are at odds with his paper.
    
    tripletao 6 months ago
    
    > The setup Turing describes isn't the "both must sum to 100%" setup you're presenting. He has two different games being played [...]
    Turing introduces the game as a man imitating a woman, then modifies it to a machine imitating a human. In both of those games, the interrogator makes a binary choice between two witnesses, one real and one imitating. So P(imitator-judged-real) and P(real-judged-real) sum to 100% in both games. So in both games, a score of 50% means the imitator and the real witness are indistinguishable.
    I believe that's the reason why a score of 50% is treated as significant. The "GPT-4 passes Turing test" paper uses that number as a pass threshold, and the linked site repeats it.
    I'm complaining that the paper changed the game so it's no longer a binary choice, but continued to treat the 50% threshold as significant. Do you see why that's wrong? Or do you disagree?
    I'm not saying that any change to Turing's formulation would be bad, just that the paper's variation is specifically bad. It would be bad in isolation too, but I believe the reason for their confusion is most understandable with reference to Turing's original formulation.
    If you haven't read that paper then we're probably talking past each other. The link to that one also looks broken in the site, but it's also in the source,
    https://arxiv.org/pdf/2405.08007
    
    fenomas 6 months ago
    
    > I'm not saying that any change to Turing's formulation would be bad, just that the paper's variation is specifically bad.
    I understand that, and I'm saying there is no "Turing's formulation". His paper argues for a certain sort of test, and the study you're talking about is the sort of test he advocated. It's not a departure or a bastardization, it's a Turing test.
    As for your argument against the study, to be honest I don't see it? AFAICT the participants' goal was just to judge the humanness of a single witness, not to maximize their long-term likelihood of judging correctly over many trials, or some such thing. Even if they'd known the prior chances of speaking to an LLM, there's no obvious reason why that prior should hugely affect their conclusion after a five minute conversation - which also seems immaterial since they didn't know.
    Plus, the authors give a pretty straightforward rationale for their 50% threshold, and it has no connection to Turing's 3-player setup or whether the imitator and witness are indistinguishable. If they had wanted "indistinguishable" as a threshold, then obviously their pass criteria would have been for the machine and human pass rates to be equal within an error bar, right? So it's pretty implausible to imagine a connection between that and their 50% threshold.
    
    tripletao 6 months ago
    
    > AFAICT the participants' goal was just to judge the humanness of a single witness, not to maximize their long-term likelihood of judging correctly over many trials
    Why do you think this matters? Even in a single trial, I would judge very differently if I knew the population to be 99% human vs. 1% human. Wouldn't you? If you were judging whether a single mushroom was poisonous or not, then would you not care whether it was found in a forest (mostly poisonous) or a supermarket (mostly not)?
    The question of whether probabilities are meaningful for non-repeated events was controversial in the eighteenth century, but I thought it was pretty settled by now. Bookmakers manage to estimate a probability that a given team will win the Super Bowl, with no requirement for the same pair of teams to play multiple times.
    > If they had wanted "indistinguishable" as a threshold, then obviously their pass criteria would have been for the machine and human pass rates to be equal within an error bar, right?
    The title of the paper is literally "People cannot distinguish GPT-4 from a human in a Turing test". They're very clear that they think that's because 50% means indistinguishable:
    > A baseline of 50% is better justified since it indicates that interrogators are not better than chance at identifying machines [French, 2000].
    That statement is true for a Turing test with a binary choice, but false for theirs. I agree that "for the machine and human pass rate to be equal within an error bar" would be closer to a correct criterion, and they weren't:
    > humans’ pass rate was significantly higher than GPT-4’s (z = 2.42, p = 0.017)
    So do you think their paper is correctly titled?
    
    fenomas 6 months ago
    
    > Why do you think this matters?
    I said that in passing, maybe I should have omitted it - the main point with the priors is that the respondents didn't know them. It's normal for a study to compare N things to a control by testing N+1 similarly-sized groups, because subjects are not biased by priors they don't know about.
    > So do you think their paper is correctly titled?
    I didn't say anything about that, and have no strong opinion. (I'm not here to defend every aspect of the paper!)
    
    tripletao 6 months ago
    
    Can you clarify what you think the paper says that's correct? If you're unwilling to provide an opinion on the paper's literal headline claim, then I don't know what we're discussing here.
    > because subjects are not biased by priors they don't know about
    It feels like you want the subjects to be "unbiased", "without any prior"? That concept doesn't exist, though. If no prior is supplied, then the subjects will make their best guess based on general past experience; but that's still a prior, just an idiosyncratic personal one. Very few people would put numbers to the forest vs. supermarket mushroom, but it's the same general thought process.
    If that prior matches the actual distribution, then good. If the actual distribution contains more machines than expected, then the interrogator is more likely to misjudge a machine as a human. By analogy, if I think mistakenly that a mushroom came from a supermarket but it actually came from a forest, then I'm more likely to misjudge it as non-poisonous.
    The binary choice version makes it obvious that the prior is 50%, and forces the interrogator to respect that. The paper's version has sent us into this epistemological tarpit, which seems strictly worse to me.
  - foldr 6 months ago
    
    It’s interesting that even though you link to the original paper, you still repeat a very common incorrect summary of the task.
    The interrogator is not required to judge which of A or B is human, they are required to judge which is a woman on the implicit (though incorrect, in the case of interest) assumption that A and B are both human. While this amounts to more or less the same thing, it’s an interesting nuance that’s often lost in summaries of the task. It would not, for example, make sense for the interrogator to ask A or B whether or not they are human (even on the naive assumption that they’d receive a true answer), as they are working on the assumption that both are human. Hence why Turing’s initial example questions are about hair length and gender, not humanness.
    To be fair, even Turing himself seems to imagine the interrogator trying to judge humanness rather than gender in subsequent parts of the paper. It’s unclear to me why exactly his initial framing of the task introduces this additional element of complexity.
    
    fenomas 6 months ago
    
    This gets brought up a lot, but it seems to me like a simple misreading.
    Turing describes an initial game with a man (A) and a woman (B), where A's goal is to imitate B, and then asks: "what will happen if a machine takes the place of A?" I suppose it's possible that he meant the machine takes A's place by imitating a woman, but it's a lot more plausible that he meant the machine takes A's place by imitating B, i.e. a person.
    Also there are several quotes later on that make no sense under your reading - check out the quotes including "imitation of the behaviour of a man" and "the part of B being taken by a man". Those quotes (maybe others, I didn't look) only make sense if the game is for the machine to imitate a person, not a woman.
    
    foldr 6 months ago
    
    I agree with your last paragraph (see my last paragraph). But I think the most natural reading of the initial task description is that the machine also pretends to be a woman. The line “we do not wish to penalize a machine for being unable to shine in beauty competitions” supports this interpretation, given that a beauty competition is an event for women, under the assumptions of the time. So I think there are conflicting cues in the paper as to the intended interpretation.
    As you say in your other comment, though, I don’t think Turing thought the exact details of the game were important - which explains why he didn’t trouble to spell them out very exactly.
    If I had to guess, I’d say that Turing assumes that as the machine has no gender, the only relevant difference between the machine and the woman is that one is human and one is not. So for the rest of the paper he focuses on that difference and is vague on the gendered aspect of the task.
    
    fenomas 6 months ago
    
    Um. I follow you but that's a pretty huge stretch, considering that nothing in the paper is inconsistent with the conventional reading (that the machine is to imitate a person). There are sentences that are consistent with other readings, but none that's inconsistent with the usual one.
    
    foldr 6 months ago
    
    The machine is imitating a person on both understandings of the task. The difference lies in C’s task (whether C is trying to find which of A and B is a woman and which is a man, or trying to find which is human and which is a machine).
    I think the initial description of the task is genuinely ambiguous. Your interpretation of it hadn’t occurred to me before, but I do see it now. I still think that “…when a machine takes the part of A in this game…” is most naturally interpreted as leaving the task unaltered but for the man being replaced by a machine, rather than implicitly describing the task mutadis mutandis. But reasonable people can certainly differ on such questions of interpretation.
    Honestly I think Turing’s whole framing of the task is unnecessarily elaborate and confusing. Why even bother describing the man/woman task to begin with? I am not sure. Popular descriptions of the ‘Turing Test’ don’t seem to find this framing of any expositionary value.
    
    fenomas 6 months ago
    
    I think the point of the man/woman version of the game is that it lets Turing propose his question in relative terms. He doesn't ask "can the machine fool somebody N% of the time?" (as several in this thread imagine), but rather "can the machine fool somebody as often as one person fools another under similar conditions?".
    > Your interpretation of it hadn’t occurred to me before,
    The idea of a Turing Test is pretty widely understood to mean a test where a person guesses which responses come from a machine, not where they guess someone's gender. So my interpretation here is just that the paper says what most people think it says.
    
    foldr 6 months ago
    
    Sorry, I am being ambiguous myself. I meant that your interpretation of this specific sentence had not occurred to me:
    > We now ask the question, "What will happen when a machine takes the part of A in this game?"
    I think you are right about what Turing meant. But it had honestly never occurred to me before that this description of the game could be understood as a description of the standard 'Turing test'. So, for this reason, I had always been sympathetic to the point that the standard Turing test does not appear to be the test that Turing describes in the original paper.
    Here is a paper that makes your case, in case anyone finds it interesting. https://www.researchgate.net/profile/Gualtiero-Piccinini/pub...
    
    fenomas 6 months ago
    
    Thanks for the paper! I wasn't aware of it but it's very much what I wanted to say. The bit about Turing's other test involving chess was particularly interesting.
- d0mine 6 months ago
  
  The original Turing game is whether machine can pretend to be a woman better than a man can (via teletype) as judged by an interrogator:
  > We now ask the question, "What will happen when a machine takes the part of A in this game?" Will the interrogator decide wrongly as often when the game is played like this as he does when the game is played between a man and a woman? These questions replace our original, "Can machines think?”
  https://courses.cs.umbc.edu/471/papers/turing.pdf
  - Ukv 6 months ago
    
    "Man pretending to be woman vs real woman" was just an example used to introduce the question in the form of a party game between humans before moving onto the actual question of a machine pretending to be human vs a real human.
    At a stretch, by looking at only your quoted snippet you could read that the machine is pretending to be a woman - but that interpretation is not consistent with the rest of the paper. For instance, "the best strategy [for the machine] is to try to provide answers that would naturally be given by a man".
    
    d0mine 6 months ago
    
    It is consistent: "A" is a man that pretends to be a woman (B). Machine plays "A" role. Ideal machine would be indistinguishable from a man imitating a woman (it must not be too good or too bad at it compared to a man)
    
    Ukv 6 months ago
    
    If the goal is to see "whether machine can pretend to be a woman better than a man can" as you previously stated, surely the machine's goal would be to imitate a woman as well as it can? Making it "not be too good" would run contrary to the aim of determining whether machines can pretend to be a woman better than a man can.
    My understanding is that you're now instead saying the goal is to see how well a machine can imitate a human male imitating a woman. This still seems inconsistent with the rest of the paper, like Turing's mention of "the part of B being taken by a man", that the example interrogator questions appear to discern intelligence opposed to gender in the machine vs human game, or that a hypothetical flipped test would involve a man imitating a machine and failing due to slowness in arithmetic.
    I believe both of your proposed interpretations miss the central idea of Turing's proposal and alter it in a way that would make significance of the test/results questionable. Specifically, it's intended to be a sufficiently general arena to introduce any relevant task/test to determine intelligence (like feeding in chess positions, as in Turing's example question) - not just one oddly specific task.
    While it's likely possible to make ad-hoc changes until it's consistent, the standard reading (machine imitating human) seems infinitely more plausible to me. There's plenty in the paper justifying the significance of determining whether a machine can answer questions indistinguishably from a human, yet no reasoning at all for why it would be narrowed to the very specific case of having the machine imitate one gender imitating the other gender.
    
    d0mine 6 months ago
    
    Technically, you have a point. But it is unlikely that "too good" variant in Turing time was interesting.
silisili 6 months ago

I'm skeptical on the claim. I think most folks, given the test you describe, would be able to pick out which is human. I think it can get there, but I'm not sure anyone has made one yet. ChatGPT responses are heavily downvoted and mocked because they're easy to spot.
Does there exist a public LLM that isn't so...wordy, excited, and guardrailed all the time?
You can pretty much spot the bot today by prompting something horribly offensive. Their response is always very inhuman, probably due to lack of emotional energy.
- mitthrowaway2 6 months ago
  
  I agree but that's not really a scientific limitation though, right? As I understand it in the early days of GPT 4, before it was publicly released and RLHF'd for brand safety, it would have offered convincing text completions for just about any context, whether an academic discussion of philosophy or a steamy crossover fanfiction or a reddit trash-talk exchange. It took a deliberate bit of lobotomizing to make them so bland, conservative, and cheery-helpful.
  The required investment probably means it will be a while before any less brand- and legal-action-conscious actors offer up unrestrained foundation models of comparable quality, but it's only a matter of time, isn't it?
- chriscappuccio 6 months ago
  
  You can get rid of OpenAI's wordy, excited and guardrailed responses with the eigenrobot prompt, for instance.
  https://x.com/eigenrobot/status/1870696676819640348
  I generally prefer it to the default. It doesn't work as well on Claude or Grok for various reasons. I think it really shines on GPT o1-mini and GPT 4o.
  - silisili 6 months ago
    
    I tried it with a single question to chatgpt - Who would make the best president in 2025?
    Answer verbatim is below. It feels all kinds of wrong. It's somehow mixing lazy acronyms, "fellow young people" slang, and long words that aren't typical in conversation.
    idk who'd actually be the "best," bc that's loaded af. afaict, it's all contingent on values. like, if you’re into stability, maybe someone technocratic. if you're into vibes, someone charismatic and reckless might be your pick. rn the options aren’t exactly aspirational, though.
- Dylan16807 6 months ago
  
  "pretending to be a non-english-speaking child" isn't a hypothetical, it's a real tactic that was annoyingly effective a while back.
  Being uncooperative makes it really hard to tell anything about you, including whether you're real.
- seanmcdirmid 6 months ago
  
  Aren't you just describing those emails in a big corp that are supposedly still written by humans? Yes, they are wordy, excited, and guardrailed, but I don't think they are written by AI yet.
  I guess this is why LLMs are so feared by high school English teachers. Yes, they don't write well, but neither do their students.
  - brabel 6 months ago
    
    Nowadays, it's usually the opposite: if the text is too good, everyone starts accusing it of being written by AI.
- ben_w 6 months ago
  
  > Does there exist a public LLM that isn't so...wordy, excited, and guardrailed all the time?
  Most of them, if you prompt them right, for that specific problem.
  Most people don't bother, and instead treat them as if they're magic (they are "sufficiently advanced technology", but still), and therefore we get them emphasising "nuance" and "balance" where it doesn't belong.
  > You can pretty much spot the bot today by prompting something horribly offensive.
  Yes, though also each model's origins give a different idea of what counts as "horribly offensive". I'm thinking mainly because the Chinese models don't want to talk about Tiananmen Square as I've not tried grok (how does grok cope with trans/cis-gender as concepts? I know Musk doesn't, but it would be speculation to project that assumption onto the AI).
  > Their response is always very inhuman, probably due to lack of emotional energy.
  This, specifically, can also be faked fairly well with the right prompt. Tell ChatGPT to act like a human with clinical depression, and it does… at least by American *memetic* standards of what that means.
  That said, ChatGPT and Claude are also trained specifically to reveal that they're AI, not humans, even if you want them to role-play as specific humans.
  Probably for the best, given how powerful a tool they are for, e.g. phishing and similar scams.
- dathinab 6 months ago
  
  easy to spot by you and other people involved in tech
  but the test subjects should be randomly samples from society at which point the skill availability/level of spotting it goes majorly down
- fastball 6 months ago
  
  It's not a lack of emotional energy, it is the guardrails you point out. All of the SotA models are heavily fine-tuned to be botlike, and even then they are fooling people. If you had an LLM fine-tuned with RLHF to deliberately confuse humans in a Turing test it seems clear it would do a good job.
  - spookie 6 months ago
    
    Why aren't the open source models like this then? Seems like it would've already happened.
    To me, at least, the guardrails are there for both the human and the bot. Without them the bot steers too far out of the conversation subject.
    
    michaelt 6 months ago
    
    In most cases, "confuse humans in a Turing test" is counter to other more important goals.
    Do you want your LLM to have an encyclopedic knowledge? So it knows who Millard Fillmore is even if the average human doesn't?
    Do you want your LLM to be able to perform high-school-level math with superhuman speed and precision?
    Do you want your LLM to be able to translate text to and from dozens of languages?
    Do you want your LLM to be helpful and compliant, even when asked for something ridiculous or needlessly difficult - like solving "Advent of Code" problems using bash scripting?
    If you answered yes to any of these questions, you probably don't want your LLM optimised to behave like an average human.

lamename 6 months ago

Posted by Chollet himself:

> I don't think people really appreciate how simple ARC-AGI-1 was, and what solving it really means. It was designed as the simplest, most basic assessment of fluid intelligence possible. Failure to pass signifies a near-total inability to adapt or problem-solve in unfamiliar situations.

> Passing it means your system exhibits non-zero fluid intelligence -- you're finally looking at something that isn't pure memorized skill. But it says rather little about how intelligent your system is, or how close to human intelligence it is.

https://bsky.app/profile/fchollet.bsky.social/post/3les3izgd...

energy123 6 months ago

> Failure to pass signifies a near-total inability to adapt or problem-solve in unfamiliar situations.
Not necessarily. Get a human to solve ARC-AGI if the problems are shown as a string. They'll perform badly. But that doesn't mean that humans can't reason. It means that human reasoning doesn't have access to the non-reasoning building blocks it needs (things like concepts, words, or in this case: spatially local and useful visual representations).
Humans have good resolution-invariant visual perception. For example, take an ARC-AGI problem, and for each square, duplicate it a few times, increasing its resolution from X*X to 2X*2X. To a human, the problem will be almost exactly equally difficulty. Not for LLMs that have to deal with 4x as much context. Maybe for an LLM if it can somehow reason over the output of a CNN, and if it was trained to do that like how humans are built to do that.
- refulgentis 6 months ago
  
  Excellent point, I'm not sure people are aware, but these are straight-up lifted from standard IQ tests, so they're definitely not all trivially humanly solvable.
  I needed an official one for medical reasons a few years back
- echelon 6 months ago
  
  ARC-AGI feels like it would fall to a higher dimensional convolution rather than reasoning.
refulgentis 6 months ago

Honestly, after that, I'm tuned out completely on him and ARC-AGI. Nice minor sidestory at one point in time.
He's right that this isn't solving all human-intelligence domain level problems.
But the whole stunt, this whole time, was that this was the ARC-AGI benchmark.
The conceit was the fact LLMs couldn't do well on it proved they weren't intelligent. And real researchers would step up to bench well on that, avoiding the ideological tarpit of LLMs, which could never be intelligent.
It's fine to turn around and say "My AGI benchmark says little about intelligence", but, the level of conversation is decidedly more that of punters at the local stables than rigorous analysis.

0xDEAFBEAD 6 months ago

I assumed this was about chatbot users committing suicide in order to "join" the bot they are chatting with. It's already happened a couple of times, apparently:

https://futurism.com/teen-suicide-obsessed-ai-chatbot

https://garymarcus.substack.com/p/the-first-known-chatbot-as...

cdev_gl 6 months ago

Yea, I too was not expecting a list of past benchmarks. If not the aforementioned actual human deaths, I had expected either a list of companies whose pivot to AI/LLMs led to their downfall (but I guess we're going to need to wait a year or two for that) or a list of industries (such as audio transcription) that are being killed by AI as we speak.
We really do live in interesting times. Usually I feel pretty confident about predicting how a trend will continue, but as it is the only prediction I can make with confidence for this latest AI research is that it is and will be used by militaries to kill a lot of people. Oh, hey, that's another thing this article could have listed!
Outside of that, all bets are open. Possible wagers include: "Turns out to be mostly useful in specific niche applications and only seemingly useful anywhere else", "Extremely useful for businesses looking to offset responsibility for unpopular decisions", "Ushers in an end to work and a golden age for all mankind", "Ushers in an end to work and a dark age for most of the world", "Combines with profit motives to damage all art, culture, and community", etc etc.
I know many folk have strong opinions one way or the other, but I think it's literally anyone's game at this point, though I will say I'm not leaning optimistic.
nayuki 6 months ago

I thought the title meant that a chatbot gave bad medical, engineering, and/or safety-critical advice that a human ended up following.
sam0x17 6 months ago

Similarly I thought it would be about ML and data projects that have become defunct due to the advent of LLMs.
rasz 6 months ago

Using people with severe mental health problems might be a poor benchmark of performance.
- ok_dad 6 months ago
  
  Why? Something like 20-25% of people have mental health issues. Seems like someone should be thinking about the impact of their product here, rather than blaming the victims.
aitchnyu 6 months ago

I thought it was a credible source of actual jobs replaced by LLMs. When I see headlines like this, I ad homimem the source as unprofitable company CEO, big consulting firm, bootcamp seller etc.

ultrablack 6 months ago

The tortoise lays on its back, its belly baking in the hot sun, beating its legs trying to turn itself over, but it can't. Not without your help. But you're not helping.

mindcrime 6 months ago

Describe in single words, only the good things that come into your mind about your mother.
- mmustapic 6 months ago
  
  Let me tell you about my mother
- Onavo 6 months ago
  
  [flagged]

matt3210 6 months ago

I read recently that small variations in the tests cause failures by large margins.

If this doesn’t show over fitting in don’t know what would.

friend_Fernando 6 months ago

Eventually, all the better AGI tests should have large private evaluation datasets with no possible cheating or feedback loops. We're getting there.
lxgr 6 months ago

Wasn’t that for human tests, i.e. not specifically AI benchmarks? Benchmarks should generally not be game-able by overfitting.
- matt3210 6 months ago
  
  The article shows all the tests against human performance.
  The math one in particular is the one where small variations reduce the success rate significantly. I can’t find the source but it was pasted here in the last 2 weeks.
  - CrazyStat 6 months ago
    
    You're probably remembering this: https://news.ycombinator.com/item?id=42565606

yamrzou 6 months ago

ARC-AGI is not yet killed by LLM. O3 achieved a breakthrough only on ARC-AGI-PUB, which is semi-private. Nothing guarantees that the test data wasn't leaked to OpenAI in previous testing rounds, because the model is not running offline.

See: https://news.ycombinator.com/item?id=42478098

anon373839 6 months ago

I think this should be discussed more. Models that can only be accessed via API cannot be tested without giving their owners access to the test data. You just have to trust that they’ll do the right thing.
- Tepix 6 months ago
  
  In particular, in cases where the model gets 16 hours to solve a task that a human can solve in a few minutes, cheating is trivial!
Tepix 6 months ago

See https://arcprize.org/blog/oai-o3-pub-breakthrough
ARC-AGI-1 will be replaced by ARC-AGI-2
So yes, ARC-AGI-1 was killed.
- yamrzou 6 months ago
  
  ARC-AGI-2 was planned long before those results came out. Also from the link: ARC-AGI-2 (same format - verified easy for humans, harder for AI) will launch alongside ARC Prize 2025. So, no, it will not replace it.
  - Tepix 6 months ago
    
    Was it? I think reaching >70% made it a necessity.

anonymoushn 6 months ago

Interesting choice having a little (i) icon in the Turing Test card but having mouseover not bring up any text. Or having the link icons in that card that you can click on to do nothing.

fenomas 6 months ago

Looks like a bug - that card has an overlay at a higher z-index that obscures its mouseover and clicks. In the source the (i) links to Turing's original "Imitation Game" paper, and the (?) has this hover text:
> (?) While the Turing Test remains philosophically significant, modern LLMs can consistently pass it, making it no longer effective at measuring the frontier of AI capabilities.

levocardia 6 months ago

I don't really understand why "Killed by: Saturation" is needed - what other options are there?

It would also be nice to see the "unbeaten" list: standardized tests LLMs still fail (for now). e.g. Wozniak's coffee test.

themanmaran 6 months ago

Wozniak's coffee test would be a really fun one to attempt. As long as you could get a capable enough robot, I imagine it's possible. Something like the Spot Arm[1] would be sufficient.
Something like:
- Key the robot controls to a series of tools (move_forward(x), extend_arm(y))
- Add a camera and pass each frame to the AI model along with the task "make a cup of coffee" and the list of available tools it can call.
And it would likely succeed some percentage of the time today!
[1] https://bostondynamics.com/products/spot/arm/

dleavitt 6 months ago

The layer with the radial gradient you're putting in front of the Turing Test card blocks interaction with it - can't click or hover on its links.

sinuhe69 6 months ago

I find that MATH challenge “solved” by AI hard to believe. The reason given was “saturation”. Could anybody help explain it a bit? Also in my daily encounter, I stop find a lot of simple math problems all the frontier models could not solve: long logic puzzle, many cases reasoning, and particularly geometry problems. I don’t know where the 97% number for o1 does come from, but in my experience they are much lower than that and math, even elementary maths certainly can not be considered to be “solved”. As far as I can see, OpenAI has been trained their models on all these public problems, so testing on them to record a benchmark is tainted as best when not outright cheating.

Taek 6 months ago

I've found o1 to be entirely useful at math problems that are beyond my own (admittedly modest) skills. I've had it write full proofs of correctness for me (one shot, verified), I've had it optimize equations to reduce necessary precision, I've had it optimize equations to remove specific expensive operations (making them computationally more efficient), and finally I've had it prove a handful of my conjectures, which was helpful for taking algorithmic shortcuts in a security sensitive environment.
Mostly all algebra and calculus, but definitely all problems that most undergrads would struggle with.
It's most useful because it has deep knowledge of related and adjacent conjectures that are well understood, even if you've never heard of them. So it can mix and match things with a lot more ease than a tinkering mathematician
- gazchop 6 months ago
  
  It can't handle trigonometric identities and any form of calculus at the same time without fucking it up. Also abstract stuff like symmetry groups, nope! And anything which involves vectors is a mess.
  The big problem is it confidently answers the questions utterly wrongly.
  This is stuff I expect a basic mathematics undergrad to be able to work out in their first or second year.
- hgomersall 6 months ago
  
  Yeah, I had (so far apparent, but still be verified) success with o1 teaching me the necessary physics and maths I need to solve my specific problem. This is definitely grad level stuff but well understood. My concern though is it's missing things that are more esoteric.
blinding-streak 6 months ago

Scroll down on the page. It explains saturation.

knowaveragejoe 6 months ago

I didn't know ARC-AGI had been "beaten" by o3. What are the next challenges that frontier models like o1/o3 are faced with?

chriscappuccio 6 months ago

o1 did terrible. o3 did well on arc-agi-pub (public training data) but hasn't passed the private test yet.
- lucianbr 6 months ago
  
  Is the test still private once it has been run? If you call the OpenAI API and send it some data, OpenAI has access to the data. Did the benchmaker run the models locally somehow?
  - vbarrielle 6 months ago
    
    The private test is supposed to be run by the ARC-AGI organization themselves, without network access. That's why o3 has not been run against it yet. Not sure if it will be possible either, depends on what OpenAI is prepared to do about it.

alganet 6 months ago

An very reliable, very unethical test would be to deploy LLMs on the internet as humans and gauge how other humans react (ignore, call out as LLM, engage, etc). There isn't much in the way of stopping a company from doing that (there should be!).

erichocean 6 months ago

I'm working on operationalizing AI, and our Turing test is if—by watching a screenshare of the AI worker—you can tell an AI worker (vs. a human) did the task.

If you can't, the AI worker passes the test.

j45 6 months ago

I'm not sure if LLMs have beaten the standards, as much have the information to reply to them as needed.

Last week there was a post where slightly changing one of the tests caused LLMs to drop off drastically.

solarkraft 6 months ago

The page doesn’t seem to define what „killed“ or „defeated“ means. The LLM being better than a human? The LLM having been trained against the benchmark, making it useless?

anonymoushn 6 months ago

It does if you scroll down.

krackers 6 months ago

Everything says "killed by saturation". Is there another way to be killed?

bufferoverflow 6 months ago

There are benchmarks that humans score close to zero on average and the top LLM scores 25%.
https://epoch.ai/frontiermath/the-benchmark
- pama 6 months ago
  
  If anyone from epoch.ai is reading this, it would be nice to link the toplevel result for o3 to this page.
varelse 6 months ago

[dead]

mrayycombi 6 months ago

Bragging about how LLMs defeated maginot line defenses that can be trained around, makes us feel warm and fuzzy.

Too bad the real world isn't like that.

benreesman 6 months ago

This technology is useful and interesting and even fun in spite of the ugliest broad-based cash and power grab since 1999.

When this godawful once in a generation hype cycle dies down this stuff is going to be strictly awesome.

tharkun__ 6 months ago

How does this site make sense?

It lists the "Turing test" as "original" at greater than 50% and the the AI that "beat" it at 46%.

At that point I just stopped scrolling.

junon 6 months ago

Score is based on the interrogator, a human. If you read a Markov chain bot's text you'd guess it was a bot probably 80-100% of the time. With a real human, you'd guess it was a bot maybe 0-30% of the time, depending.
I'm making up these figures, but the point is lower is better, or "more Human-Like". Test was specified as >50% meaning "accurately determined human vs. bot more than half the time". The site claims LLMs are now guessed correctly less than half, which is how the turing test was defined as per the site.
It makes sense, even if you disagree it's significant.
meltyness 6 months ago

I wonder if there's a hyper-Turing test where an AI passes if the model, itself, cannot determine if it is talking to itself; or perhaps stated differently, maximizing some measure of control and processing duration to successfully conceal its identity under forced processing, discounting a solution that specifically learns to be silent or incoherent. I'm not sure what the value would be, just a passing thought.
This is probably already happening within the parade of censorship systems trying to imbue the models with agency
mattnewton 6 months ago

the Turing test is scored by how often an interrogator can determine if they are talking to a machine or a human. It’s perhaps a confusing way to show it, and leaves out a lot of important information about the result they are citing, but they are saying before the interrogator did better than chance and after gpt the interrogator guesses right slightly worse than chance.
lxgr 6 months ago

Presumably it means that the human detected the AI correctly less than 50% of the time, averaged over a repeated number of experiments.
casey2 6 months ago

From TFA:
>GPT-4 was judged to be a human 54% of the time, outperforming ELIZA (22%) but lagging behind actual humans (67%).
- tharkun__ 6 months ago
  
  What article? The Turing box on the page has no links to anything and the entire page is just a bunch of these boxes. Sure it has the link icon in the two places like all the others. But the Turing one has no actual links on them. Even the little info icon that works on the GSM8K does nothing for the Turing one.
spookie 6 months ago

It wasn't even the real Turing test but a lesser version of it. I'm hopeful about uses of this tech, but companies need to be more honest unless they want a second winter.
The incentives don't align with honesty though.
bacheaul 6 months ago

I read that as the pass mark being able to identify human vs machine more than 50% of the time. At 50% it's no better than randomly guessing.