I remain skeptical of emergent properties in LLMs in the way that people have used that term. There was a belief 3-4 years ago that if you just make the models big enough, they magically acquire intelligence. But since then, we’ve seen that the models are actually still pretty limited by the training data: like other ML models, they interpolate well between the data they’ve been trained on, but they don’t generalize well beyond it. Also, we have seen models that are 50-100x smaller now exhibit the same “emergent” capabilities that were once thought to require hundreds of billions of parameters. I personally think the emergent properties really belong to the data instead.
Yes, deep learning models only interpolate, and essentially represent an effective way of storing data labeling effort. Doesn't mean they're not useful, just not what tech adjacent promoters want people to think.
I still think it seems unclear what you mean by “interpolate” in this context? If your NN takes in several numbers and assigns logits to each class based on those numbers, then if you consider the n dimensional space of possible inputs, and if the new input is in the convex hull of the inputs that appear in training samples, then the meaning of “interpolate” is fairly clear.
But when the inputs are sequences of tokens…
Granted, each token gets embedded as some vector, and you can concatenate those vectors to represent the sequence of tokens as one big vector, but, are these vectors for novel strings in the convex hull of such vectors for the strings in the training set?
The answer is kind of right there in the start of your last sentence. From the transformer model's perspective, the input is just a time series of vectors. It ultimately isn't any different from any other time series of vectors.
Way back in the day when I was working with latent Dirichlet allocation models, I had a minor enlightenment moment when I realized that the models really weren't capturing any semantically meaningful relationships. They were only capturing meaningless statistical correlations to which I would then assign semantic value so effortlessly and automatically that I didn't even realize it was always me doing it, never the model.
I'm pretty sure LLMs exist on that same continuum. And if you travel down it in the other direction, you get to simple truisms such as "correlation does not equal causation."
The part about “is it in the convex hull?” was an important part of the question.
It seems to me that if it isn’t in the convex hull, it could be more fitting to describe it as extrapolation, rather than interpolation?
In general, my question does apply to the task of predicting how a time series of vectors continues: Given a dataset of time series, where the dimension of each vector in the series is such and such, the length of each series is yea long, and there are N series in the training set, should we expect series in the test set or validation set to be in the convex hull of the ones in the training set?
I would think that the number of series in the training set, N, while large, might not be all that large compared to the dimensionality of a whole series?
Hm, are there efficient techniques for evaluating whether a high dimensional vector is in the convex hull of a large number of other high dimensional vectors?
Just shooting from the hip, LLMs operate out on a frontier where the curse of dimensionality removes a large chunk of the practical value from the concept of a convex hull. Especially in a case like this where the vector embedding process places hard limits on the range of possible magnitudes and directions for any single vector.
My hot take is that what some people are labeling as "emergent" is actually just "incidental encoding" or "implicit signal" -- latent properties that get embedded just by nature of what's being looked at.
For instance, if you have a massive tome of English text, a rather high percentage of it will be grammatically-correct (or close), syntactic and understandable, because humans who speak good English took the time to write it and wrote it how other humans would expect to read or hear it. This, by its very nature, embeds "English language" knowledge due to sequence, word choice, normally-hard-to-quantify expressions (colloquial or otherwise), etc.
When you consider source data from many modes, there's all kinds of implicit stuff that gets incidentally written.. for instance, real photographs of outer space or deep sea would only show humans in protective gear, not swimming next to the Titanic. Conversely, you won't see polar bears eating at Chipotle, or giant humans standing on top of mountains.
There's a statistical probability of "this showed up enough in the training data to loosely confirm its existence" / "can't say I ever saw that, so let's just synthesize it" aspect of the embeddings that one person could interpret as "emergent intelligence", while another could just-as-convincingly say it's probabilistic output that is mostly in line with what we expect to receive. Train the LLM on absolute nonsense instead and you'll receive exactly that back.
Emergent as I have known and used it before is when more complex behavior emerges from simple rules.
My goto example for this was Game of Life, where from very simple rules, a very organically behaving (turing complete) system emerges. Now Game of Life is a deterministic system, meaning that the same rules and the same start-configurarion will play out in exactly the same way each time — but given the simplicity of the logic and the rules the resulting complexity is what I'd call emergent.
So maybe this is more about the definition of what we'd call emergent and what not.
As someone who has programmed markov chains where the stochastic interpolation really shines through, transformer-based LLMS definitly show some emergent behavior one wouldn't have immediately suspected just from the rules. Emergent does not mean "conscious" or "self-reflective" or anything like that. But the things a LLM can infer from its training data is already quite impressive.
Interesting. Is there a quantitative threshold to emergence anyone could point at with these smaller models? Tracing the thoughts of a large language model is probably the only way to be sure, or is it?
Disregarding the downvotes, I mean this as a serious question.
From the liked article:
“We don’t know an “algorithm” for this, and we can’t even begin to guess the required parameter budget or the training data needed.”
Why not, at least the external ones? The computational resources and the size of the training dataset is quantifiable from an input point of view. What gets used is not, but the input size should.
This seems superficial and doesn't really get to the heart of the question. To me it's not so much about bits and parameters but a more interesting fundamental question of whether pure language itself is enough to encompass and encode higher level thinking.
Empirically we observe that an LLM trained purely to predict a next token can do things like solve complex logic puzzles that it has never seen before. Skeptics claim that actually the network has seen at least analogous puzzles before and all it is doing is translating between them. However the novelty of what can be solved is very surprising.
Intuitively it makes sense that at some level, that intelligence itself becomes a compression algorithm. For example, you can learn separately how to solve every puzzle ever presented to mankind, but that would take a lot of space. At some point it's more efficient to just learn "intelligence" itself and then apply that to the problem of predicting the next token. Once you do that you can stop trying to store an infinite database of parallel heuristics and just focus the parameter space on learning "common heuristics" that apply broadly across the problem space, and then apply that to every problem.
The question is, at what parameter count and volume of training data does the situation flip to favoring "learning intelligence" rather than storing redundant domain specialised heuristics? And is it really happening? I would have thought just looking at the activation patterns could tell you a lot, because if common activations happen for entirely different problem spaces then you can argue that the network has to be learning common abstractions. If not, maybe it's just doing really large scale redundant storage of heurstics.
> However the novelty of what can be solved is very surprising.
I've read that the 'surprise' factor is much reduced when you actually see just how much data these things are trained on - far more than a human mind can possibly hold and (almost) endlessly varied. I.e. there is 'probably' something in the training set close to what 'surprised' you.
Good take, but while we're invoking intuition, something is clearly missing in the fundamental design given real brains don't need to consume all the worlds literature before demonstrating intelligence. There's some missing piece w.r.t self learning and sense making. The path to emergent reasoning you lay out is interesting and might happen anyway as we scale up, but the original idea was to model these algorithms in our own image in the first place - I wonder if we won't discover that missing piece first.
What seems a bit miraculous to me is, how did the researchers who put us on this path come to suspect that you could just throw more data and more parameters at the problem? If the emergent behavior doesn't appear for moderate sized models, how do you convince management to let you build a huge model?
This is perhaps why it took us this long to get to LLMs, the underlying math and ideas were (mostly) there, and even if the Transformer as an architecture wasn't ready yet, it wouldn't surprise me if throwing sufficient data/compute at a worse architecture wouldn't also produce comparable emergent behavior
There needed to be someone willing to try going big at an organization with sufficient idle compute/data just sitting there, not a surprise it first happened at Google.
But we got here step by step, as other interesting use cases came up by using somewhat less compute. Image recognition, early forms of image generation, AlphaGo, AlphaZero for chess. All earlier forms of deep neural networks that are much more reasonable than training a top of the line LLM today, but seemed expensive at the time. And ultimately a lot of this also comes from the hardware advancements and the math advancements. If you took classes neural networks in the 1990s, you'd notice that they mostly talked about 1 or 2 hidden layers, and not all that much focus on the math to train large networks, precisely because of how daunting the compute costs were for anything that wasn't a toy. But then came video card hardware, and improvements to use it to do gradient descent, making going past silly 3 layer networks somewhat reasonable.
Every bet makes perfect sense after you consider how promising the previous one looked, and how much cheaper the compute was getting. Imagine being tasked to train an LLM in 1995: All the architectural knowledge we have today and a state-level mandate would not have gotten all that far. Just the amount of fast memory that we put to bear wouldn't have been viable until relatively recently.
I remember back in the 90s how scientists/data analysts were saying that we'd need exaflop scale systems to tackle certain problems. I remember thinking how foreign that number was when small systems were running maybe tens of megaFLOPS. Now we have systems starting to zettaflops (FP8 so not exact comparison).
They didn't. Not LLM people specifically. Google a long time ago figured out that you get far better results on a very wide range of problems just by going bigger. (Which then must have become frustrating for some people because most of the effort seems to have gone to scaling? See for example as-opposed-to-symbolic.)
While GPT-2 didn't show emergent abilities, it did show improved accuracy on various tasks with respect to GPT-1. At that point, it was clear that scaling made sense.
In other words, no one expected GPT-3 to suddenly start solving tasks without training as it did, but it was expected to be useful as an incremental improvement to what GPT-2 did. At the time, GPT-2 was seeing practical use, mainly in text generation from some initial words - at that point the big scare was about massive generation of fake news - and also as a model that one could fine-tune for specific tasks. It made sense to train a larger model that would do all that better. The rest is history.
I don't think model sizes increased suddenly, there might not be emergent properties for certain tasks at smaller scales but there was improvement at slower rate for sure. Competition to improve that metric albeit at lower pace led to slow increase in model sizes and by chance led to emergence the way its defined in paper?
The reasoning in the article is interesting, but this struck me as a weird example to choose:
> “The real question is how can we predict when a new LLM will achieve some new capability X. For example, X = “Write a short story that resonates with the social mood of the present time and is a runaway hit”
Framing a capability as something that is objectively measurable (“able to perform math on the 12th grade level”, “able to write a coherent, novel text without spelling/grammar mistakes”) makes sense within the context of what the author is trying to demonstrate.
But the social proof aspect (“is a runaway hit”) feels orthogonal to it? Things can be runaway hits for social factors independently of the capability they actually represent.
It’s not about being “a runaway hit” as an objective measurement it’s about the things an LLM would need to achieve before that was possible. At first AI scores on existing tests seemed like a useful metric. However, tests designed for humans make specific assumptions that don’t apply to these systems making such tests useless.
AI is very good at gaming metrics so it’s difficult to list some criteria where achieving it is meaningful. A hypothetical coherent novel without spelling/grammar mistakes could in effect be a copy of some existing work that shows up in its corpus, however a hit requires more than a reskinned story.
While not a technical term of art, copyright applies to a reskinned story. “the series is not available in English translation, because of the first book having been judged a breach of copyright.” https://en.wikipedia.org/wiki/Tanya_Grotter
There’s plenty of room to take inspiration and go in another direction aka Pride and Prejudice and Zombies.
That it seems hard (impossible) or not clear intuitively how to go about it, to us humans, is what makes the question interesting. In a way. The other questions are interesting but a different class of interesting. At any rate, both good for this question. Either way this becomes "what would we need to estimate this emergence threshold?".
I often find that people using the word emergent to describe properties of a system tend to ascribe quasi magical properties to the system. Things tend to get vague and hand wavy when that term comes up.
Just call them properties with unknown provenance.
> Just call them properties with unknown provenance.
They would if it would be the correct designation, however, it is not.
Emergence does not equal non-understanding or some spooky-hooky force coming from the unknown.Reductionism does not lead to an explaining-away of emergence.
I always wondered if the specific dimensionality of the layers and tensors has a specific effect on the model.
It's hard to explain, but higher dimensional spaces have weird topological properties, not all behave the same way and some things are perfectly doable in one set of dimensions while for others it just plain doesn't work (e.g. applying surgery on to turn a shape into another).
The bag of heuristics thing is interesting to me, is it not conceivable that a NN of a certain size trained only on math problems would be able to wire up what amounts to a calculator? And if so, could that form part of a wider network, or is I/O from completely different modalities not really possible in this way?
I didn't follow entirely on a fast read, but this confused me especially:
The parameter count of an LLM defines a certain bit budget. This bit budget must be spread across many, many tasks
I'm pretty sure that LLMs, like all big neural networks, are massively under-specified, as in there are way more parameters than data to fit (understanding the training data set is bigger than the size of the model, but the point is the same loss can be achieved with many different combinations of parameters).
And I think of this underspecification as the reason neural networks extrapolate cleanly and this generalize.
> the same loss can be achieved with many different combinations of parameters
Perhaps it can be, but it isn't. The loss in a given model was achieved with that particular combination of parameters that the model has, and there exists no other combination of parameters which which that model can appeal to for more information.
To have the other combinations, we would need to train more models and then somehow combine them so that the combinations are available; but that's just conceptually the same as making one larger model with more parameters, and lower loss.
Not significantly, as I understand it. There's certainly variation in LLM abilities with different initializations but the volume and content of the data is a far bigger determinant of what an LLM will learn.
For LLMs (as with other models), many local optima appear to support roughly the same behavior. This is the idea of the problem being under-specified ie many more equations than unknowns so there are many ways to get the same result.
You end up with different weights when using different random initialization, but with modern techniques the behavior of the resulting model is not really distinct. Back in the image-recognition days it was like +/- 0.5% accuracy. If you imagine you're descending in a billion-parameter space, you will always have a negative gradient to follow in some dimension: local minima frequency goes down rapidly with (independent) dimension count.
Emergent properties are unavoidable for any complex system and probably exponentially scale with complexity or something (I'm sure there's an entire literature about this somewhere).
This is also my impression "how could they not" but it goes a bit further: Can we predict it? Can we estimate the size of a system that will achieve X? Can we build systems that are pre-disposed to emergent behaviors? Can we build systems that preclude some class of emergent behavior (relevant to AGI safety perhaps)? And then of course many systems will not achieve anything because even when "large", they are "uselessly large" - as in, you can define more points on a line and it's still a dumb line.
To me the "how could they not" comes from the idea that if LLMs somehow encapsulate/ contain/ exploit all human writings, then they most likely cover a large part of human intelligence. For damn sure much more than the "basic human". The question is more of how we can get this behavior back out of it - than whether it's there.
"Here, we present an alternative explanation for emergent abilities: that for a particular task and model family, when analyzing fixed model outputs, emergent abilities appear due to the researcher's choice of metric rather than due to fundamental changes in model behavior with scale. Specifically, nonlinear or discontinuous metrics produce apparent emergent abilities, whereas linear or continuous metrics produce smooth, continuous predictable changes in model performance."
An aircraft can approach powered flight without achieving it. With a given amount of thrust or aerodynamic characteristics, the aircraft will weigh dynamic_weight=(static_weight - x) where x is a combination of the aerodynamic characteristics and the amount of thrust applied.
In no case where dynamic_weight>0 will the aircraft fly, even though it exhibits characteristics of flight, I.e the transfer of aerodynamic forces to counteract gravity.
So while it progressively exhibits characteristics of flight, it is not capable of any kind of flight at all until the critical point of dynamic_weight<0. So the enabling characteristics are not “emergent”, but the behavior is.
Yeah, I’m pretty sure analogy would have been fine there, I think maybe it fell off the edge of my vocabulary for a moment? Not really sure, but I really can’t think of any reason why “thought proxy” would have been more descriptive, informative, or accurate ¯_(ツ)_/¯
Yes, this paper is under-appreciated. The point is that we as humans decide what constitutes a given task we're going to set as a bar and it turns out that statistical pattern matching can solve many of those tasks to a reasonable level (we also get to define "reasonable") when there's sufficient scale of parameters and data, but that tip-over point is entirely arbitrary.
The continuous metrics the paper uses are largely irrelevant in practice, though. The sudden changes appear when you use metrics people actually care about.
To me the paper is overhyped. Knowing how neural networks work, it's clear that there are going to be underlying properties that vary smoothly. This doesn't preclude the existence of emergent abilities.
The author himself explicitly acknowledges the paper but the incomprehensibly ignores it ("Even so, many would like to understand, predict, and even facilitate the emergence of these capabilities."). It's like saying "some say [foo] doesn't exist but even so many would like to understand [foo]". It's incoherent.
That has been a problem with most LLM benchmarks. Any test that's rated in percentages tends to be logarithmic, since getting from say 90% to 95% is not a linear 5% improvement but probably more like a 2x or 10x improvement in practical terms, since the metric is already nearly maxed out and only the extreme edge cases remain that are much harder to master.
Metaphor: finding a path from a initial point to a destination in a graph. As the number of parameters increases one can expect the LLM to be able to remember how to go from one place to another and in the end it should be able to find a long path. This can be an emergent property since with less parameters the LLM could not be able to find the correct path. Now one has to find what kind of problems this metaphor is a good model of.
A simple process produces a Mandelbrot set.
A simple process (loss minimization through gradient descent) produces LLMs.
So what plays the role of 2D-plane or dense point grid in the case of LLMs?
It is the embeddings, (or ordered combinations of embeddings ) which are generated after pre-training.
In case of a 2D plan, the closeness between two points is determined by our numerical representation schema.
But in case of embeddings, we learn the 2D-grid of words (playing the role of points) by looking at how the words are getting used in corpus
The following is a quote from Yuri Manin, an eminent Mathematician.
https://www.youtube.com/watch?v=BNzZt0QHj9U
Of the properties of mathematics, as a language, the most peculiar one is that by playing formal games with an input mathematical text, one can get an output text which seemingly carries new knowledge. The basic examples are furnished by scientific or technological calculations: general laws plus initial conditions produce predictions, often only after time-consuming and computer-aided work. One can say that the input contains an implicit knowledge which is thereby made explicit.
I have a related idea which I picked up from somewhere which mirrors the above observation.
When we see beautiful fractals generated by simple equations and iterative processes,
we give importance to only the equations, not to the cartesian grid on which that process
operates.
> the most peculiar one is that by playing formal games with an input mathematical text, one can get an output text which seemingly carries new knowledge.
Or biologically, DNA/RNA behaves in a similar manner.
I've always understood it more to mean, "phenomena that happen due to the interactions of a system's parts without being explicitly encoded into their individual behavior." Fractal patterns in nature are a great example of emergent phenomena. A single water molecule contains no explicit plan for how to get together with its buddies and make spiky hexagon shapes when they get cold.
And I've always understood talking about emergence as if it were some sort of quasi-magical and unprecedented new feature of LLMs to mean, "I don't have a deep understanding of how machine learning works." Emergent behavior is the entire point of artificial neural networks, from the latest SOTA foundation model all the way back to the very first tiny little multilayer perceptron.
Emergence in the context of LLMs is really just us learning that "hey, you don't actually need intelligence to do <task>, turns out it can be done using a good enough next token predictor. We're basically learning what intelligence isn't as we see some of the things these models can do.
I always understood this to be the initial framing, e.g. in the Language Models are Few Shot Learners paper but then it got flipped around.
Or maybe you need intelligence to be a good enough next token predictor. Maybe the thing that “just” predicts the next token can be called “intelligence”.
The challenge there would be showing that humans have this thing called intelligence. You yourself are just outputting ephemeral actions that rise out of your subconscious. We have no idea what that system feeding our output looks like (except it's some kind of organic neural net) and hence there isn't really a basis for discriminating what is and isn't intelligent besides "if it solves problems, it has some degree of intelligence"
If you want to understand how birds fly, the fact that planes also fly is near useless. While a few common aerodynamic principles apply, both types of flight are so different from each other that you do not learn very much about one from the other.
On the other hand, if your goal is just "humans moving through the air for extended distances", it doesn't matter at all that airplanes do not fly the way birds do.
And then, on the generated third hand, if you need the kind of tight quarters maneuverability that birds can do in forests and other tangled spaces, then the way our current airplanes fly is of little to no use at all, and you're going to need a very different sort of technology than the one used in current aircraft.
And on the accidentally generated fourth hand, if your goal is "moving very large mass over very long distance", the the mechanisms of bird flight are likely to be of little utility.
The fact that two different systems can be described in a similar way (e.g. "flying") doesn't by itself tell you that they are working in remotely the same way or capable of the same sorts of things.
doesn't by itself tell you that they are working in remotely the same way or capable of the same sorts of things.
I believe any intelligence that reaches 'human level' should be capable of nearly the same things with tool use, the fact it accomplishes the goal in a different way doesn't matter because the systems behavior is generalized. Hence the term (artificial) general intelligence. Two different general intelligences built on different architectures should be able to converge on similar solutions (for example solutions based on lowest energy states) because they are operating in the same physical realm.
An AGI and an HGI should be able to have convergent solutions for fast air travel, ornithopters, and drones.
There is no "human level" because we don't even understand what we mean by "human level". We don't know what metrics to use, we don't even know what to measure.
> Two different general intelligences built on different architectures should be able to converge on similar solutions (for example solutions based on lowest energy states) because they are operating in the same physical realm.
Lots of things connected to intelligence do not operate (much) in any physical realm.
Also, you've really missed the point of the analogy. It's not a question of whether AGI would pick the same solution for fast air travel as HGI. It is that there are least two solutions to the challenge of moving things through the air in a controlled way, and they don't really work in the same way at all. Consequently, we should be ready for the possibility that there is more than one way to do the things LLMs (and to some degree) humans do with text/language, and that they may not be related to each very much. This is a counter to the claim that "since LLMs get so close to human language behavior, it seems quite likely human language behavior arises from a system like an LLM".
A better bird analogy would be if we didn't understand at all how flight worked, and then started throwing rocks and had pseudo-intellectuals saying "how do we know that isn't all that flight is, we've clearly invented artificial flight".
Not quite. Complex systems can exhibit macroscopic properties not evident at microscopic scales. For example, birds self organize into flocks, an emergent phenomenon, visible to the untrained eye. Our understanding of how it happens does not change the fact that it does.
There is a field of study for this called statistical mechanics.
I understood it to mean properties of large-scale systems that are not properties of its components. Like in thermodynamics: zooming in to a molecular level, you can reverse time without anything seeming off. Suddenly you get a trillion molecules and things like entropy appear, and time is not reversible at all.
Not at all. Here is an analogy: A car is a system which brings you from point A to B. No part of the car can bring you from point A to B. Not the seats, the wheels, not the frame, not even the motor. If you put the motor on a table, it won’t move one bit. The car, as a system, however does. The emergent property of a car, seen as a system, is that it brings you from one location to another.
A system is the product of the interaction of its parts. It is not the sum of the behaviour of its parts. If a system does not exhibit some form of emergent behaviour, it is not a system, but something else. Maybe an assembly.
If putting together a bunch of X's in a jar always makes the jar go Y, then is Y an emergent property?
Or we need to better understand why a bunch of X's in a jar do that, and then the property isn't emergent anymore, but rather the natural outcome of well-understood X's in a well-understood jar.
Ah. Not semantics, that is cybernetics and systems theory.
As in your example: If a bunch of x in a jar leads to the jar tipping over, it is not emergent. That’s just cause and effect. Problem to start with is that the jar containing x is not even a system in the first place, emergence as a concept is not applicable here.
There may be a misunderstanding on your side of the term emergence. Emergence does not equal non-understanding or some spooky-hooky force coming from the unknown. We understand the functions of the elements of a car quite well. The emergent behaviour of a car was intentionally brought about by massive engineering.
Reductionism does not lead to an explaining-away of emergence.
It's more specific than that. Most complex systems just produce noise. A few complex systems produce behavior that we perceive as simple. This is surprising, and gets the name "emergent".
It just means they haven't modeled the externalities. A plane on the ground isn't emergent. In the air it is, at least until you perfectly model weather, which you can't do, so its behavior is emergent. But I think a plane is also a good comparison because it shows that you can manage it; we don't have to perfectly model weather to still have fairly predictable air travel.
Sure but also see the benchmark creation to benchmark breaking race. The benchmark creation researchers have been doing their best to create difficult, lasting benchmarks that won't be broken by next week. They are not illiterate idiots. And yet the LLMs (or LLM-based systems really) have been consistently so far breaking the benchmarks in absurd time.
It's hard to say that nothing significant is going on.
Also the fine article's entire point is that these are not points on a continuum.
> The benchmark creation researchers have been doing their best to create difficult, lasting benchmarks that won't be broken by next week.
That is really easy, say make it play pokemon as well as a 10 year old. That would take a very long time, I watched gemini play pokemon and its nowhere close to that even with all that help.
The hard part isn't making a benchmark that wont be broken, its making a benchmark that is so easy that LLM can solve them but that is still hard for them to solve. Essentially what this means is that we have ran out of easy progress, and are now stumbling in the dark since we have no effective benchmarks to chase.
The authors haven’t demonstrated emergence of LLMs. If I write a piece of code and it does what I programmed it to do that’s not emergence. LLMs aren’t doing anything unexpected yet. I think that’s the smell test because emergence is still subjective.
There are eerie similarities in radiographs of LLM inference output and mammalian EEGs. I would be surprised not see latent and surprisingly complicated characteristics become apparent as context and recursive algorithms grow larger.
I'm not a techie, so perhaps someone can help me understand this: AFAIK, no theoretical computer scientist predicted emergence in AI models. Doesn't that suggest that the field of theoretical computer science (or theoretical AI, if you will) is suspect? It's like Lord Kelvin saying that heavier-than-air flying machines are impossible a decade before the Wright brothers' first flight.
I’m not even clear on the AI def of “emergent behavior”. The AI crowd mixes in terms and concepts from biology to describe things that are dramatically more simple. For example, using “neuron” to really mean a formula calculation or function. Neurons are a lot more than that and not even understood completely to begin with however developers use the term as if they have neurons implemented in software.
Maybe it’s a variation of the “assume a frictionless spherical horse” problem but it’s very confusing.
I believe it's been predicted in traffic planning and highway design and tested in via simulation and in field experiments. Use of self driving cars to modify traffic behaviors and decrease traffic jams is a field of study these days.
> Doesn't that suggest that the field of theoretical computer science (or theoretical AI, if you will) is suspect?
Consider the story of Charles Darwin, who knew evolution existed, but who was so afraid of public criticism that he delayed publishing his findings so long that he nearly lost his priority to Wallace.
For contrast, consider the story of Alfred Wegener, who aggressively promoted his idea of (what was later called) plate tectonics, but who was roundly criticized for his radical idea. By the time plate tectonics was tested and proven, Wegener was long gone.
These examples suggest that, in science, it's not the claims you make, it's the claims you prove with evidence.
I remain skeptical of emergent properties in LLMs in the way that people have used that term. There was a belief 3-4 years ago that if you just make the models big enough, they magically acquire intelligence. But since then, we’ve seen that the models are actually still pretty limited by the training data: like other ML models, they interpolate well between the data they’ve been trained on, but they don’t generalize well beyond it. Also, we have seen models that are 50-100x smaller now exhibit the same “emergent” capabilities that were once thought to require hundreds of billions of parameters. I personally think the emergent properties really belong to the data instead.
Yes, deep learning models only interpolate, and essentially represent an effective way of storing data labeling effort. Doesn't mean they're not useful, just not what tech adjacent promoters want people to think.
> Yes, deep learning models only interpolate
What do you mean by this? I don’t think the understanding of LLMs is sufficient to make this claim
An LLM is a classifier, there is lots of research into how deep learning classifiers work, that I haven't seen contradicted when applied to LLMs.
I still think it seems unclear what you mean by “interpolate” in this context? If your NN takes in several numbers and assigns logits to each class based on those numbers, then if you consider the n dimensional space of possible inputs, and if the new input is in the convex hull of the inputs that appear in training samples, then the meaning of “interpolate” is fairly clear.
But when the inputs are sequences of tokens…
Granted, each token gets embedded as some vector, and you can concatenate those vectors to represent the sequence of tokens as one big vector, but, are these vectors for novel strings in the convex hull of such vectors for the strings in the training set?
The answer is kind of right there in the start of your last sentence. From the transformer model's perspective, the input is just a time series of vectors. It ultimately isn't any different from any other time series of vectors.
Way back in the day when I was working with latent Dirichlet allocation models, I had a minor enlightenment moment when I realized that the models really weren't capturing any semantically meaningful relationships. They were only capturing meaningless statistical correlations to which I would then assign semantic value so effortlessly and automatically that I didn't even realize it was always me doing it, never the model.
I'm pretty sure LLMs exist on that same continuum. And if you travel down it in the other direction, you get to simple truisms such as "correlation does not equal causation."
The part about “is it in the convex hull?” was an important part of the question.
It seems to me that if it isn’t in the convex hull, it could be more fitting to describe it as extrapolation, rather than interpolation?
In general, my question does apply to the task of predicting how a time series of vectors continues: Given a dataset of time series, where the dimension of each vector in the series is such and such, the length of each series is yea long, and there are N series in the training set, should we expect series in the test set or validation set to be in the convex hull of the ones in the training set?
I would think that the number of series in the training set, N, while large, might not be all that large compared to the dimensionality of a whole series?
Hm, are there efficient techniques for evaluating whether a high dimensional vector is in the convex hull of a large number of other high dimensional vectors?
Just shooting from the hip, LLMs operate out on a frontier where the curse of dimensionality removes a large chunk of the practical value from the concept of a convex hull. Especially in a case like this where the vector embedding process places hard limits on the range of possible magnitudes and directions for any single vector.
My hot take is that what some people are labeling as "emergent" is actually just "incidental encoding" or "implicit signal" -- latent properties that get embedded just by nature of what's being looked at.
For instance, if you have a massive tome of English text, a rather high percentage of it will be grammatically-correct (or close), syntactic and understandable, because humans who speak good English took the time to write it and wrote it how other humans would expect to read or hear it. This, by its very nature, embeds "English language" knowledge due to sequence, word choice, normally-hard-to-quantify expressions (colloquial or otherwise), etc.
When you consider source data from many modes, there's all kinds of implicit stuff that gets incidentally written.. for instance, real photographs of outer space or deep sea would only show humans in protective gear, not swimming next to the Titanic. Conversely, you won't see polar bears eating at Chipotle, or giant humans standing on top of mountains.
There's a statistical probability of "this showed up enough in the training data to loosely confirm its existence" / "can't say I ever saw that, so let's just synthesize it" aspect of the embeddings that one person could interpret as "emergent intelligence", while another could just-as-convincingly say it's probabilistic output that is mostly in line with what we expect to receive. Train the LLM on absolute nonsense instead and you'll receive exactly that back.
Emergent as I have known and used it before is when more complex behavior emerges from simple rules.
My goto example for this was Game of Life, where from very simple rules, a very organically behaving (turing complete) system emerges. Now Game of Life is a deterministic system, meaning that the same rules and the same start-configurarion will play out in exactly the same way each time — but given the simplicity of the logic and the rules the resulting complexity is what I'd call emergent.
So maybe this is more about the definition of what we'd call emergent and what not.
As someone who has programmed markov chains where the stochastic interpolation really shines through, transformer-based LLMS definitly show some emergent behavior one wouldn't have immediately suspected just from the rules. Emergent does not mean "conscious" or "self-reflective" or anything like that. But the things a LLM can infer from its training data is already quite impressive.
Interesting. Is there a quantitative threshold to emergence anyone could point at with these smaller models? Tracing the thoughts of a large language model is probably the only way to be sure, or is it?
Disregarding the downvotes, I mean this as a serious question.
From the liked article: “We don’t know an “algorithm” for this, and we can’t even begin to guess the required parameter budget or the training data needed.”
Why not, at least the external ones? The computational resources and the size of the training dataset is quantifiable from an input point of view. What gets used is not, but the input size should.
This seems superficial and doesn't really get to the heart of the question. To me it's not so much about bits and parameters but a more interesting fundamental question of whether pure language itself is enough to encompass and encode higher level thinking.
Empirically we observe that an LLM trained purely to predict a next token can do things like solve complex logic puzzles that it has never seen before. Skeptics claim that actually the network has seen at least analogous puzzles before and all it is doing is translating between them. However the novelty of what can be solved is very surprising.
Intuitively it makes sense that at some level, that intelligence itself becomes a compression algorithm. For example, you can learn separately how to solve every puzzle ever presented to mankind, but that would take a lot of space. At some point it's more efficient to just learn "intelligence" itself and then apply that to the problem of predicting the next token. Once you do that you can stop trying to store an infinite database of parallel heuristics and just focus the parameter space on learning "common heuristics" that apply broadly across the problem space, and then apply that to every problem.
The question is, at what parameter count and volume of training data does the situation flip to favoring "learning intelligence" rather than storing redundant domain specialised heuristics? And is it really happening? I would have thought just looking at the activation patterns could tell you a lot, because if common activations happen for entirely different problem spaces then you can argue that the network has to be learning common abstractions. If not, maybe it's just doing really large scale redundant storage of heurstics.
> However the novelty of what can be solved is very surprising.
I've read that the 'surprise' factor is much reduced when you actually see just how much data these things are trained on - far more than a human mind can possibly hold and (almost) endlessly varied. I.e. there is 'probably' something in the training set close to what 'surprised' you.
Good take, but while we're invoking intuition, something is clearly missing in the fundamental design given real brains don't need to consume all the worlds literature before demonstrating intelligence. There's some missing piece w.r.t self learning and sense making. The path to emergent reasoning you lay out is interesting and might happen anyway as we scale up, but the original idea was to model these algorithms in our own image in the first place - I wonder if we won't discover that missing piece first.
What seems a bit miraculous to me is, how did the researchers who put us on this path come to suspect that you could just throw more data and more parameters at the problem? If the emergent behavior doesn't appear for moderate sized models, how do you convince management to let you build a huge model?
This is perhaps why it took us this long to get to LLMs, the underlying math and ideas were (mostly) there, and even if the Transformer as an architecture wasn't ready yet, it wouldn't surprise me if throwing sufficient data/compute at a worse architecture wouldn't also produce comparable emergent behavior
There needed to be someone willing to try going big at an organization with sufficient idle compute/data just sitting there, not a surprise it first happened at Google.
But we got here step by step, as other interesting use cases came up by using somewhat less compute. Image recognition, early forms of image generation, AlphaGo, AlphaZero for chess. All earlier forms of deep neural networks that are much more reasonable than training a top of the line LLM today, but seemed expensive at the time. And ultimately a lot of this also comes from the hardware advancements and the math advancements. If you took classes neural networks in the 1990s, you'd notice that they mostly talked about 1 or 2 hidden layers, and not all that much focus on the math to train large networks, precisely because of how daunting the compute costs were for anything that wasn't a toy. But then came video card hardware, and improvements to use it to do gradient descent, making going past silly 3 layer networks somewhat reasonable.
Every bet makes perfect sense after you consider how promising the previous one looked, and how much cheaper the compute was getting. Imagine being tasked to train an LLM in 1995: All the architectural knowledge we have today and a state-level mandate would not have gotten all that far. Just the amount of fast memory that we put to bear wouldn't have been viable until relatively recently.
> and how much cheaper the compute was getting.
I remember back in the 90s how scientists/data analysts were saying that we'd need exaflop scale systems to tackle certain problems. I remember thinking how foreign that number was when small systems were running maybe tens of megaFLOPS. Now we have systems starting to zettaflops (FP8 so not exact comparison).
You might appreciate this article: https://www.quantamagazine.org/when-chatgpt-broke-an-entire-...
They didn't. Not LLM people specifically. Google a long time ago figured out that you get far better results on a very wide range of problems just by going bigger. (Which then must have become frustrating for some people because most of the effort seems to have gone to scaling? See for example as-opposed-to-symbolic.)
While GPT-2 didn't show emergent abilities, it did show improved accuracy on various tasks with respect to GPT-1. At that point, it was clear that scaling made sense.
In other words, no one expected GPT-3 to suddenly start solving tasks without training as it did, but it was expected to be useful as an incremental improvement to what GPT-2 did. At the time, GPT-2 was seeing practical use, mainly in text generation from some initial words - at that point the big scare was about massive generation of fake news - and also as a model that one could fine-tune for specific tasks. It made sense to train a larger model that would do all that better. The rest is history.
I don't think model sizes increased suddenly, there might not be emergent properties for certain tasks at smaller scales but there was improvement at slower rate for sure. Competition to improve that metric albeit at lower pace led to slow increase in model sizes and by chance led to emergence the way its defined in paper?
There’s that Sinclair quote:
It Is Difficult to Get a Man to Understand Something When His Salary Depends Upon His Not Understanding It
The reasoning in the article is interesting, but this struck me as a weird example to choose:
> “The real question is how can we predict when a new LLM will achieve some new capability X. For example, X = “Write a short story that resonates with the social mood of the present time and is a runaway hit”
Framing a capability as something that is objectively measurable (“able to perform math on the 12th grade level”, “able to write a coherent, novel text without spelling/grammar mistakes”) makes sense within the context of what the author is trying to demonstrate.
But the social proof aspect (“is a runaway hit”) feels orthogonal to it? Things can be runaway hits for social factors independently of the capability they actually represent.
It’s not about being “a runaway hit” as an objective measurement it’s about the things an LLM would need to achieve before that was possible. At first AI scores on existing tests seemed like a useful metric. However, tests designed for humans make specific assumptions that don’t apply to these systems making such tests useless.
AI is very good at gaming metrics so it’s difficult to list some criteria where achieving it is meaningful. A hypothetical coherent novel without spelling/grammar mistakes could in effect be a copy of some existing work that shows up in its corpus, however a hit requires more than a reskinned story.
> however a hit requires more than a reskinned story.
demonstrably false with a lot of hits in the past that is a reskinned story of existing stories!
While not a technical term of art, copyright applies to a reskinned story. “the series is not available in English translation, because of the first book having been judged a breach of copyright.” https://en.wikipedia.org/wiki/Tanya_Grotter
There’s plenty of room to take inspiration and go in another direction aka Pride and Prejudice and Zombies.
That it seems hard (impossible) or not clear intuitively how to go about it, to us humans, is what makes the question interesting. In a way. The other questions are interesting but a different class of interesting. At any rate, both good for this question. Either way this becomes "what would we need to estimate this emergence threshold?".
I often find that people using the word emergent to describe properties of a system tend to ascribe quasi magical properties to the system. Things tend to get vague and hand wavy when that term comes up.
Just call them properties with unknown provenance.
> Just call them properties with unknown provenance.
They would if it would be the correct designation, however, it is not.
Emergence does not equal non-understanding or some spooky-hooky force coming from the unknown.Reductionism does not lead to an explaining-away of emergence.
I always wondered if the specific dimensionality of the layers and tensors has a specific effect on the model.
It's hard to explain, but higher dimensional spaces have weird topological properties, not all behave the same way and some things are perfectly doable in one set of dimensions while for others it just plain doesn't work (e.g. applying surgery on to turn a shape into another).
How is topology specifically related to emergent capabilities in AI?
The bag of heuristics thing is interesting to me, is it not conceivable that a NN of a certain size trained only on math problems would be able to wire up what amounts to a calculator? And if so, could that form part of a wider network, or is I/O from completely different modalities not really possible in this way?
I didn't follow entirely on a fast read, but this confused me especially:
I'm pretty sure that LLMs, like all big neural networks, are massively under-specified, as in there are way more parameters than data to fit (understanding the training data set is bigger than the size of the model, but the point is the same loss can be achieved with many different combinations of parameters).And I think of this underspecification as the reason neural networks extrapolate cleanly and this generalize.
This doesn't seem right and most people recognize that 'neurons' encode for multiple activations. https://transformer-circuits.pub/2022/toy_model/index.html
They’re 1000% right on the idea that most models are hilariously undertrained
Pretty sure that since the Chinchilla paper this probably isn't the case. https://arxiv.org/pdf/2203.15556
> the same loss can be achieved with many different combinations of parameters
Perhaps it can be, but it isn't. The loss in a given model was achieved with that particular combination of parameters that the model has, and there exists no other combination of parameters which which that model can appeal to for more information.
To have the other combinations, we would need to train more models and then somehow combine them so that the combinations are available; but that's just conceptually the same as making one larger model with more parameters, and lower loss.
Since gradient descent converges on a local minima, would we expect different emergent properties with different initialization of the weights?
Not significantly, as I understand it. There's certainly variation in LLM abilities with different initializations but the volume and content of the data is a far bigger determinant of what an LLM will learn.
So there is an "attractor" that different initializations end up converging on?
Different initialization converge to different places, e.g https://arxiv.org/abs/1912.02757
For LLMs (as with other models), many local optima appear to support roughly the same behavior. This is the idea of the problem being under-specified ie many more equations than unknowns so there are many ways to get the same result.
You end up with different weights when using different random initialization, but with modern techniques the behavior of the resulting model is not really distinct. Back in the image-recognition days it was like +/- 0.5% accuracy. If you imagine you're descending in a billion-parameter space, you will always have a negative gradient to follow in some dimension: local minima frequency goes down rapidly with (independent) dimension count.
How could they not?
Emergent properties are unavoidable for any complex system and probably exponentially scale with complexity or something (I'm sure there's an entire literature about this somewhere).
One good instance are spandrels in evolutionary biology. The wikipedia article is a good explanation of the subject: https://en.m.wikipedia.org/wiki/Spandrel_(biology)
This is also my impression "how could they not" but it goes a bit further: Can we predict it? Can we estimate the size of a system that will achieve X? Can we build systems that are pre-disposed to emergent behaviors? Can we build systems that preclude some class of emergent behavior (relevant to AGI safety perhaps)? And then of course many systems will not achieve anything because even when "large", they are "uselessly large" - as in, you can define more points on a line and it's still a dumb line.
To me the "how could they not" comes from the idea that if LLMs somehow encapsulate/ contain/ exploit all human writings, then they most likely cover a large part of human intelligence. For damn sure much more than the "basic human". The question is more of how we can get this behavior back out of it - than whether it's there.
Alternate view: Are Emergent Abilities of Large Language Models a Mirage? https://arxiv.org/abs/2304.15004
"Here, we present an alternative explanation for emergent abilities: that for a particular task and model family, when analyzing fixed model outputs, emergent abilities appear due to the researcher's choice of metric rather than due to fundamental changes in model behavior with scale. Specifically, nonlinear or discontinuous metrics produce apparent emergent abilities, whereas linear or continuous metrics produce smooth, continuous predictable changes in model performance."
A decent thought-proxy for this : powered flight.
An aircraft can approach powered flight without achieving it. With a given amount of thrust or aerodynamic characteristics, the aircraft will weigh dynamic_weight=(static_weight - x) where x is a combination of the aerodynamic characteristics and the amount of thrust applied.
In no case where dynamic_weight>0 will the aircraft fly, even though it exhibits characteristics of flight, I.e the transfer of aerodynamic forces to counteract gravity.
So while it progressively exhibits characteristics of flight, it is not capable of any kind of flight at all until the critical point of dynamic_weight<0. So the enabling characteristics are not “emergent”, but the behavior is.
I think this boils down to a matter of semantics.
“Thought-proxy”?
I think the word you’re looking for is “analogy”.
Analogy is a great word proxy for thought proxy.
Yeah, I’m pretty sure analogy would have been fine there, I think maybe it fell off the edge of my vocabulary for a moment? Not really sure, but I really can’t think of any reason why “thought proxy” would have been more descriptive, informative, or accurate ¯_(ツ)_/¯
Yes, this paper is under-appreciated. The point is that we as humans decide what constitutes a given task we're going to set as a bar and it turns out that statistical pattern matching can solve many of those tasks to a reasonable level (we also get to define "reasonable") when there's sufficient scale of parameters and data, but that tip-over point is entirely arbitrary.
The continuous metrics the paper uses are largely irrelevant in practice, though. The sudden changes appear when you use metrics people actually care about.
To me the paper is overhyped. Knowing how neural networks work, it's clear that there are going to be underlying properties that vary smoothly. This doesn't preclude the existence of emergent abilities.
The author himself explicitly acknowledges the paper but the incomprehensibly ignores it ("Even so, many would like to understand, predict, and even facilitate the emergence of these capabilities."). It's like saying "some say [foo] doesn't exist but even so many would like to understand [foo]". It's incoherent.
No point in letting facts get in the way of an entire article I guess.
That has been a problem with most LLM benchmarks. Any test that's rated in percentages tends to be logarithmic, since getting from say 90% to 95% is not a linear 5% improvement but probably more like a 2x or 10x improvement in practical terms, since the metric is already nearly maxed out and only the extreme edge cases remain that are much harder to master.
Metaphor: finding a path from a initial point to a destination in a graph. As the number of parameters increases one can expect the LLM to be able to remember how to go from one place to another and in the end it should be able to find a long path. This can be an emergent property since with less parameters the LLM could not be able to find the correct path. Now one has to find what kind of problems this metaphor is a good model of.
What do you think about this analogy?
A simple process produces a Mandelbrot set. A simple process (loss minimization through gradient descent) produces LLMs. So what plays the role of 2D-plane or dense point grid in the case of LLMs? It is the embeddings, (or ordered combinations of embeddings ) which are generated after pre-training. In case of a 2D plan, the closeness between two points is determined by our numerical representation schema. But in case of embeddings, we learn the 2D-grid of words (playing the role of points) by looking at how the words are getting used in corpus
The following is a quote from Yuri Manin, an eminent Mathematician.
https://www.youtube.com/watch?v=BNzZt0QHj9U Of the properties of mathematics, as a language, the most peculiar one is that by playing formal games with an input mathematical text, one can get an output text which seemingly carries new knowledge. The basic examples are furnished by scientific or technological calculations: general laws plus initial conditions produce predictions, often only after time-consuming and computer-aided work. One can say that the input contains an implicit knowledge which is thereby made explicit.
I have a related idea which I picked up from somewhere which mirrors the above observation.
When we see beautiful fractals generated by simple equations and iterative processes, we give importance to only the equations, not to the cartesian grid on which that process operates.
> the most peculiar one is that by playing formal games with an input mathematical text, one can get an output text which seemingly carries new knowledge.
Or biologically, DNA/RNA behaves in a similar manner.
*Do
It feels like this can be tracked with addition. Humans expect “can do addition” is a binary skill, because humans either can or cannot add.
LLMs approximate addition. For a long time they would produce hot garbage. Then after a lot of training, they could sum 2 digit numbers correctly.
At this point we’d say “they can do addition”, and the property has emerged. They have passed a binary skill threshold.
Or you could cobble up a small electronic circuit or a mechanical apparatus and have something that can add numbers.
Sure, but then what you're doing would be irrelevant to this discussion.
Isn't "emergent properties" another way to say "we're not very good at understanding the capabilities of complex systems"?
I've always understood it more to mean, "phenomena that happen due to the interactions of a system's parts without being explicitly encoded into their individual behavior." Fractal patterns in nature are a great example of emergent phenomena. A single water molecule contains no explicit plan for how to get together with its buddies and make spiky hexagon shapes when they get cold.
And I've always understood talking about emergence as if it were some sort of quasi-magical and unprecedented new feature of LLMs to mean, "I don't have a deep understanding of how machine learning works." Emergent behavior is the entire point of artificial neural networks, from the latest SOTA foundation model all the way back to the very first tiny little multilayer perceptron.
Emergence in the context of LLMs is really just us learning that "hey, you don't actually need intelligence to do <task>, turns out it can be done using a good enough next token predictor. We're basically learning what intelligence isn't as we see some of the things these models can do.
I always understood this to be the initial framing, e.g. in the Language Models are Few Shot Learners paper but then it got flipped around.
Or maybe you need intelligence to be a good enough next token predictor. Maybe the thing that “just” predicts the next token can be called “intelligence”.
Maybe?
Mostly I just think that "Intelligence" and "AI" go together like "life, the universe and everything" and "42".
The challenge there would be showing that humans have this thing called intelligence. You yourself are just outputting ephemeral actions that rise out of your subconscious. We have no idea what that system feeding our output looks like (except it's some kind of organic neural net) and hence there isn't really a basis for discriminating what is and isn't intelligent besides "if it solves problems, it has some degree of intelligence"
To return an old but still good analogy ...
If you want to understand how birds fly, the fact that planes also fly is near useless. While a few common aerodynamic principles apply, both types of flight are so different from each other that you do not learn very much about one from the other.
On the other hand, if your goal is just "humans moving through the air for extended distances", it doesn't matter at all that airplanes do not fly the way birds do.
And then, on the generated third hand, if you need the kind of tight quarters maneuverability that birds can do in forests and other tangled spaces, then the way our current airplanes fly is of little to no use at all, and you're going to need a very different sort of technology than the one used in current aircraft.
And on the accidentally generated fourth hand, if your goal is "moving very large mass over very long distance", the the mechanisms of bird flight are likely to be of little utility.
The fact that two different systems can be described in a similar way (e.g. "flying") doesn't by itself tell you that they are working in remotely the same way or capable of the same sorts of things.
doesn't by itself tell you that they are working in remotely the same way or capable of the same sorts of things.
I believe any intelligence that reaches 'human level' should be capable of nearly the same things with tool use, the fact it accomplishes the goal in a different way doesn't matter because the systems behavior is generalized. Hence the term (artificial) general intelligence. Two different general intelligences built on different architectures should be able to converge on similar solutions (for example solutions based on lowest energy states) because they are operating in the same physical realm.
An AGI and an HGI should be able to have convergent solutions for fast air travel, ornithopters, and drones.
There is no "human level" because we don't even understand what we mean by "human level". We don't know what metrics to use, we don't even know what to measure.
> Two different general intelligences built on different architectures should be able to converge on similar solutions (for example solutions based on lowest energy states) because they are operating in the same physical realm.
Lots of things connected to intelligence do not operate (much) in any physical realm.
Also, you've really missed the point of the analogy. It's not a question of whether AGI would pick the same solution for fast air travel as HGI. It is that there are least two solutions to the challenge of moving things through the air in a controlled way, and they don't really work in the same way at all. Consequently, we should be ready for the possibility that there is more than one way to do the things LLMs (and to some degree) humans do with text/language, and that they may not be related to each very much. This is a counter to the claim that "since LLMs get so close to human language behavior, it seems quite likely human language behavior arises from a system like an LLM".
I think that many birds gets too sensitive when discussing what "flight" means, heh
A better bird analogy would be if we didn't understand at all how flight worked, and then started throwing rocks and had pseudo-intellectuals saying "how do we know that isn't all that flight is, we've clearly invented artificial flight".
> "we've clearly invented artificial flight"
Scaling laws shows that the harder we throw the rock the further we fly, we just have to throw them hard enough and we have invented flying rocks!
And for the naysayers out there, lemme throw this rock at your head and then tell me it isn't real!
If your goal is to get to stable orbit, even never having learned to fly, then the brute force approach works too
If we use some metric as proxy for intelligence, emergence simply means a non-linear sudden change in that metric?
Or more generally "fitting a model to data".
Not quite. Complex systems can exhibit macroscopic properties not evident at microscopic scales. For example, birds self organize into flocks, an emergent phenomenon, visible to the untrained eye. Our understanding of how it happens does not change the fact that it does.
There is a field of study for this called statistical mechanics.
https://ganguli-gang.stanford.edu/pdf/20.StatMechDeep.pdf
Very interesting crossover!
See also: stigmergy
I understood it to mean properties of large-scale systems that are not properties of its components. Like in thermodynamics: zooming in to a molecular level, you can reverse time without anything seeming off. Suddenly you get a trillion molecules and things like entropy appear, and time is not reversible at all.
Yes, it’s a cop-out and smells mostly of dualism: https://plato.stanford.edu/entries/properties-emergent/
Not at all. Here is an analogy: A car is a system which brings you from point A to B. No part of the car can bring you from point A to B. Not the seats, the wheels, not the frame, not even the motor. If you put the motor on a table, it won’t move one bit. The car, as a system, however does. The emergent property of a car, seen as a system, is that it brings you from one location to another.
A system is the product of the interaction of its parts. It is not the sum of the behaviour of its parts. If a system does not exhibit some form of emergent behaviour, it is not a system, but something else. Maybe an assembly.
That sounds like semantics.
If putting together a bunch of X's in a jar always makes the jar go Y, then is Y an emergent property?
Or we need to better understand why a bunch of X's in a jar do that, and then the property isn't emergent anymore, but rather the natural outcome of well-understood X's in a well-understood jar.
Ah. Not semantics, that is cybernetics and systems theory.
As in your example: If a bunch of x in a jar leads to the jar tipping over, it is not emergent. That’s just cause and effect. Problem to start with is that the jar containing x is not even a system in the first place, emergence as a concept is not applicable here.
There may be a misunderstanding on your side of the term emergence. Emergence does not equal non-understanding or some spooky-hooky force coming from the unknown. We understand the functions of the elements of a car quite well. The emergent behaviour of a car was intentionally brought about by massive engineering.
Reductionism does not lead to an explaining-away of emergence.
https://en.wikipedia.org/wiki/Emile_Leray
haha cool!
turned the car into a motorcycle.
here's an article with a photo for anyone who's interested: https://archive.is/y96xb
It's more specific than that. Most complex systems just produce noise. A few complex systems produce behavior that we perceive as simple. This is surprising, and gets the name "emergent".
It just means they haven't modeled the externalities. A plane on the ground isn't emergent. In the air it is, at least until you perfectly model weather, which you can't do, so its behavior is emergent. But I think a plane is also a good comparison because it shows that you can manage it; we don't have to perfectly model weather to still have fairly predictable air travel.
Perhaps we should ask: Why do humans pick arbitrary points on a continuum beyond which things are labeled “emergent”?
Sure but also see the benchmark creation to benchmark breaking race. The benchmark creation researchers have been doing their best to create difficult, lasting benchmarks that won't be broken by next week. They are not illiterate idiots. And yet the LLMs (or LLM-based systems really) have been consistently so far breaking the benchmarks in absurd time.
It's hard to say that nothing significant is going on.
Also the fine article's entire point is that these are not points on a continuum.
> The benchmark creation researchers have been doing their best to create difficult, lasting benchmarks that won't be broken by next week.
That is really easy, say make it play pokemon as well as a 10 year old. That would take a very long time, I watched gemini play pokemon and its nowhere close to that even with all that help.
The hard part isn't making a benchmark that wont be broken, its making a benchmark that is so easy that LLM can solve them but that is still hard for them to solve. Essentially what this means is that we have ran out of easy progress, and are now stumbling in the dark since we have no effective benchmarks to chase.
The authors haven’t demonstrated emergence of LLMs. If I write a piece of code and it does what I programmed it to do that’s not emergence. LLMs aren’t doing anything unexpected yet. I think that’s the smell test because emergence is still subjective.
They weren't trying to demonstrate it. They were explaining why it might not be surprising.
Are you writing the neural networks for LLMs?
[flagged]
There are eerie similarities in radiographs of LLM inference output and mammalian EEGs. I would be surprised not see latent and surprisingly complicated characteristics become apparent as context and recursive algorithms grow larger.
What graphs are you talking about? I've never heard of LLM radiographs, and my searches are coming up empty.
I'm not a techie, so perhaps someone can help me understand this: AFAIK, no theoretical computer scientist predicted emergence in AI models. Doesn't that suggest that the field of theoretical computer science (or theoretical AI, if you will) is suspect? It's like Lord Kelvin saying that heavier-than-air flying machines are impossible a decade before the Wright brothers' first flight.
I’m not even clear on the AI def of “emergent behavior”. The AI crowd mixes in terms and concepts from biology to describe things that are dramatically more simple. For example, using “neuron” to really mean a formula calculation or function. Neurons are a lot more than that and not even understood completely to begin with however developers use the term as if they have neurons implemented in software.
Maybe it’s a variation of the “assume a frictionless spherical horse” problem but it’s very confusing.
https://hai.stanford.edu/news/ais-ostensible-emergent-abilit...
Has emergent behavior ever been predicted prior to it being observed in other theoretical fields?
I believe it's been predicted in traffic planning and highway design and tested in via simulation and in field experiments. Use of self driving cars to modify traffic behaviors and decrease traffic jams is a field of study these days.
emergent behavior is common in all large systems.
it doesn't seem that surprising to me.
> Doesn't that suggest that the field of theoretical computer science (or theoretical AI, if you will) is suspect?
Consider the story of Charles Darwin, who knew evolution existed, but who was so afraid of public criticism that he delayed publishing his findings so long that he nearly lost his priority to Wallace.
For contrast, consider the story of Alfred Wegener, who aggressively promoted his idea of (what was later called) plate tectonics, but who was roundly criticized for his radical idea. By the time plate tectonics was tested and proven, Wegener was long gone.
These examples suggest that, in science, it's not the claims you make, it's the claims you prove with evidence.