Meta Is Probably Training AI on Images Taken by Meta Ray-Bans

77 points by reaperducer 9 months ago

brookst 9 months ago

Macrumors is such a trash site. This article is Macrumors adding approximately nothing to this TechCrunch article: https://techcrunch.com/2024/09/30/meta-wont-say-whether-it-t...

Note that the tiny bit Macrumors added is converting TechCrunch’s accurate “Meta declined to say” into a claim of probability.

TechCrunch has updated the story [1] with concrete answers: Meta trains when the user asks for recognition of an image, but not passively in the background.

1. https://techcrunch.com/2024/10/02/meta-confirms-it-may-train...

cscurmudgeon 9 months ago

The sad thing is the high brow tech crowd on HN accepting this with minimal questions.

simonw 9 months ago

> TechCrunch doesn't come out and say it, but if the answer is not a clear and definitive "no," it's likely that Meta does indeed plan to use images captured by the Meta Glasses to train Meta AI. If that wasn't the case, it doesn't seem like there would be a reason for Meta to be ambiguous about answering, especially with all of the public commentary on the methods and data that companies use for training.

I have a slightly different interpretation of this: I think Meta want to keep their options open.

I think that’s true of many of these “will they train on your data?” stories.

People tend to over-estimate the value of their data for training. AI labs are constantly looking for new sources of high quality data - but quality really matters to them. Random junk people feed into the models is right at the bottom of that quality list.

But what happens if Meta say “we will never train on this data”… and then next week a researcher comes up with some new training technique that makes that data 10x more valuable than when they made that decision not to use it?

Safer for them to not make concrete promises that they can’t back out of later if it turns out the data was more valuable than they initially expected.

visarga 9 months ago

> AI labs are constantly looking for new sources of high quality data
OpenAI has 200M users, and solve over 1B tasks per month interactively. This amounts to 1-2 trillion mixed human/AI tokens. The fact is that every user has their own unique life experience, and a reservoir of tacit knowledge they didn't communicate or write down anywhere else. The LLM can elicit that tacit knowledge that would be otherwise lost, it can crawl our minds for ideas and problem solving choices.
LLMs are in a situation of indirect agency. When they propose a solution, the human usually takes it out and implements it in the real world, and comes back for more help communicating the outcomes. Across many sessions it becomes possible to check what AI ideas worked out and what ideas were bad. This is a huge resource, it collects experience from every user. The LLM becomes an experience flywheel, people are attracted to the best models, and they will get the lion share of this experience.
And yes, you can do it with privacy in mind. You can train just a preference model instead of supervised training on chat logs. Just a model that would pick the right answer from a lineup. This way PII and user specifics don't leak.
- Tier3r 9 months ago
  
  Such a service is the opposite of a flywheel (a brake?) in practice. Those tokens are extremely low quality data.
  1) I strongly double if even one in a thousand people actually bothers to report the outcome of their result. I and every programmer I know certainly don't (if we give up or it works we both just stop asking). And I'd think its one in a hundred thousand who gives detailed feedback that is useful for training. And one in ten million an expert that sits down and talks about its deep knowledge with it (and the problem of how you deduce this from chats that it isn't some crazy person).
  2) AI's are extremely confident in their answer, which fools many people, especially those in a void of knowledge. Even if people did tell OpenAI whether every solution worked or not, I would heavily discount the accuracy of such data.
  3) AI output autocannibalism does not lead to better outcomes, AI companies avoid using AI data for training like the plague. Mixed tokens I doubt would be much better.
  The situation in reality is something like one in some huge number - maybe hundreds of thousands - of mixed tokens can be useful. Of those, those that are repeats of high quality sources like textbooks, dictionaries, man pages - have no value. Of the remaining, there is huge problem of how do you extract these needles in the haystack with high confidence. Given the incredibly lopsided confusion matrix (with the massive amount of actual negatives to actual positives) and this incredibly unstructured data set, I doubt its even remotely possible to find a way where you don't end up with a totally unacceptable ratio of actual to false positives. Letting this kind of garbage data in is how you get Gemini's gasoline spaghetti.
  - visarga 9 months ago
    
    > Such a service is the opposite of a flywheel (a brake?) in practice. Those tokens are extremely low quality data.
    I think not. Say you ask the model to help solve a coding problem. It gives you an idea, you try, it fails, come back and iterate. They can save a note for later finetuning - what worked and what didn't work, using you the user as a validation system for the LLM.
    But you might also have your own experience and help the model where it struggles, and finally achieve the task. That is how the model can borrow both your experience and your manual validation work to improve itself.
    Some tasks are spread over multiple sessions, or multiple days. They can cluster and look at your progress over time. The latter steps provide rich feedback on the quality of the former steps. Hindsight is 20/20.
    Even in chats where the user doesn't perform validation there is rich feedback, people share some of their tacit experience. It's a form of delayed feedback, humans act as caches of unique experience.
    The way I conceptualize this is as a search process - problem space search. LLMs can search better with assistance, and humans also search better with assistance. LLMs collect experience from millions of people, they funnel experience into their logs.
  - simonw 9 months ago
    
    “AI companies avoid using AI data for training like the plague”
    That’s not accurate. All of the big LLM training labs are leaning increasingly into deliberately AI-created training data these days. I’m confident that’s part of the story behind the big improvements for tasks like coding in models such as Claude 3.5 Sonnet.
    The idea of “model collapse” from recursively training on AI-created data only occurs in lab conditions that very deliberately set up those conditions, from what I’ve seen. It doesn’t seem to be a major concern in real-world model training.
meroes 9 months ago

I don’t think it’s about junk vs not junk here. It’s whether this is novel enough data compared to the billions of images and hours they already possess. Since others have already said these are fundamentally different than pointing and shooting, such as capturing downtime in the home or times where a traditional photo might be inappropriate, it is novel.
whimsicalism 9 months ago

> People tend to over-estimate the value of their data for training. AI labs are constantly looking for new sources of high quality data - but quality really matters to them. Random junk people feed into the models is right at the bottom of that quality list.
I do not think this is at all accurate. Sure quality is increasingly important, but that is just the pendulum shifting only slightly back from the fact that quantity is the primary thing you need.
- simonw 9 months ago
  
  Here's one of the signals that makes me think quality matters more than quantity:
  https://twitter.com/karpathy/status/1797313173449764933
  > Turns out that LLMs learn a lot better and faster from educational content as well. This is partly because the average Common Crawl article (internet pages) is not of very high value and distracts the training, packing in too much irrelevant information. The average webpage on the internet is so random and terrible it's not even clear how prior LLMs learn anything at all.
  - whimsicalism 9 months ago
    
    Yes, it's true that LLMs will learn a lot faster from higher quality content - but they're still remarkably good at extracting signal from not very high-quality content.
    Further, you can generally at least bootstrap on shitty data a model that lets you pan through your data for the higher-quality gem.
fraboniface 9 months ago

Videos streams from everyday life sound like extremely high quality data for training AI. Videos available on the internet are a very biased sample compared to that.
nbardy 9 months ago

Ray bans won’t be random junk. It will be incredible end task data seeing how humans perform tasks in arbitrary homes.
freedomben 9 months ago

I agree, we shouldn't be jumping to conclusions. Especially with flawed logic/reasoning like this:
> it doesn't seem like there would be a reason for Meta to be ambiguous about answering, especially with all of the public commentary on the methods and data that companies use for training.
There is another possible reason and it's not that hard to think of: they want to keep their options open. It's also possible that they don't want to play the game of "anything not explicitly
My guess is that they are currently (or will in the future be) training on it, but I don't think we should take these statements as evidence of that.
In general I think big tech is atrocious with private information, and if the average person knew the depth of the data they have, they would not stand for it. Certainly Meta doesn't have a great track record on the data front so there's no reason to think they'd be different. Unfortunately the average person thinks people like me are paranoid and/or crazy when I try to tell them about it. At best they just feel powerless and shrug and keep using the product anyway.
unshavedyak 9 months ago

I have a feeling that random junk will get valuable again when AI can augment it at scale. Which I’m sure already happens, I’m just assuming it’ll get more extreme.

Shank 9 months ago

The glasses actually have two modes where you can take pictures. First, you can take pictures with the capture button or just asking for a picture. Second, you can ask Meta AI to look and tell you something about what it sees. You can say “Hey Meta, look at this and tell me what it is” and then it goes off to the AI cloud to get an answer.

I would assume that since you can enable AI and “improve the AI” that that’s data that’s fair game for training.

But when you take photos using the capture button or saying “Hey Meta, take a photo” the photos don’t even use cloud storage by default. You have to specifically turn on Meta’s temporary cloud storage feature to sync data while charging with your iPhone to work around iOS rules. If you don’t, those photos are just local only.

There are instances where questions can be answered by just using the product and reading the legal documents. I think this is clearly just lackluster research from TechCrunch carried to MacRumors.

rchaud 9 months ago

Proof positive that paying for a product doesn't pre-empt the vendor from collecting your data anyway to monetize something else.

The real question is, who's paying $500 for the privilege of being a willing mule for Zuck's surveillance empire building dreams?

dylan604 9 months ago

At this point, if it has the Meta name on it, you'd be pretty safe in thinking that collecting data will be part of its purpose
- timeon 9 months ago
  
  I still remember time when spyware was considered on same level as virus.
  Now it is just: 'we care about your privacy, allow us to sell it'.
  - jajko 9 months ago
    
    "allow", unless I see the code and DB and all data flows on it for all branches, its extremely safe to presume with companies like Meta that with private data they do morally the worst thing possible, and breaking laws is just part of business. Why the naivety in 2024. Company that stores users password in plaintext (recent EU fine) ain't the forefront of user security in any meaningful way.
    Thats core of their market value, just think for a second what is valuable to them and what they couldn't care less about.
- candiddevmike 9 months ago
  
  The real "meta" of Meta.

mgh2 9 months ago

Not surprised, the question is: will the average buyer demographic care?

As seen with their social media products targeting less savvy consumers, will this product cross into the mainstream, after early tech savvy people (innovators)?

Mark is betting they wouldn't, like with other tech using their data (ex: Android).

ilrwbwrkhv 9 months ago

No. It's the boiling frog situation. The world will get shittier and shittier and everyone will complain but no one will stop it.
karaterobot 9 months ago

People will complain when the data is inevitably misused (I'm shocked! Shocked to find irresponsible data collection going on at Facebook!) but they won't do anything to prevent or avoid it in advance, if doing so gets in the way of them buying something that is sufficiently hyped.

cs702 9 months ago

"Probably?"

In what universe would Meta not use the data it collects to improve its AI models?

Most consumers don't care about privacy implications in the abstract, so they won't even think of asking Meta to stop.

Most tech people working with AI want Meta to continue to improve its open-source, open-weight Llama models, so they will be reluctant to ask Meta to stop.

Simon_ORourke 9 months ago

Of course they are, anything for a quick buck but it must be hinted at somewhere in the terms and conditions

rchaud 9 months ago

One yearns for simpler times when the buck could be made from the sale of $350-$500 glasses alone.

samatman 9 months ago

The only interesting question, to my mind, is if Facebook reserves the right to.

If they do, then of course they will.

Do they? You be the judge. https://www.meta.com/legal/smart-glasses/

sub7 9 months ago

Despite all the money they've recently spent trying to rehab the founder's image, Meta's core DNA is built around invasion of privacy, dark silent opt-in patterns, abuse of user data etc.

I judge anyone who made their money there because they simply have made the world a much shittier place just by existing as some low quality 21st century nicotine dealership.

These glasses will 100% be used to ID, track, and ad bucket tag people without their consent. I'll be slapping them off anyone's face who looks in my direction and you should too.

mattlutze 9 months ago

Unique datasets are the moat for AI businesses. The data from these glasses is quite novel in it's perspective, context and contiguous-ness among other attributes.

Versus mobile phones, Meta are making a better go at, and seem to be the leader on, the bet that fundamentally ML/AI-driven glasses are going to be the next default modality for UX on the Internet.

BadHumans 9 months ago

If you didn't expect this then I don't know what to say at this point. Meta must have hired a new PR agency because the amount of leeway and charitable interpretations I've seen given to this company recently is absurd given their track record.

bentt 9 months ago

I always think about that Zuck quote where he called users who willingly uploaded their personal data “dumb f$cks” as we move into Meta’s increasing dominance of VR and now AR.

Of COURSE they will use whatever data they can get.

laweijfmvo 9 months ago

source?
- 0x123128 9 months ago
  
  Not really that hard to find:
  https://www.theguardian.com/technology/2018/apr/17/facebook-...

nonrandomstring 9 months ago

TFA has some entertaining descriptions of the userbase of low-IQ knuckle-draggers who need help choosing their clothes and can't remember where they parked their car. I'd prefer a more honest take on how most "glassholes" use this tech to find the names and addresses of "hot" passers-by they've recorded for their wank-bank. Wait till all this gets leaked from Meta (which is inevitable) and we see where the average users attention really dwells.

zombiwoof 9 months ago

This crossed the red line. So meta can use passer by images of like me, to train their models when I want nothing to do with them

SoftTalker 9 months ago

You’ve never really had a reasonable expectation of privacy walking around in public.
- jajko 9 months ago
  
  One actor is state, that's not the hill to die on usually. Another are individuals and private companies who will never act in your interest, mostly directly against it.
  I'd say its a worthy battle to wage, even if ie via EU but they need to step up the fines t be really punitive and demotivating for clearly long term amoral / illegal practices, ie tens of % of global revenue (income can and is trivially gamed for those of such size).
  - SoftTalker 9 months ago
    
    Yes, I see the difference. If a reason existed, it's always been possible for an individual (private investigator, police, random infatuated person) to follow you around and track your movements in public.
    It's not been possible for a corporation to collect images and videos from thousands of sources and use facial recognition and AI to track the movements of many people in public and be able to associate that with their online activities, credit reports, and other records. Information that will then be subject to subpoena or worse, depending on the desires of governments that may eventually be in power in the future.

spacecadet 9 months ago

Any wearable that includes audio or image capture is spyware. Including the AirPods with their new hearing aid firmware... Im actively building an app that uses them to record and transcribe conversations from across a room. Where is the consent?! Its madness to me and goes to show how people put the glamor of new technology over the privacy and rights of others...

Melatonic 9 months ago

Don't worry - they're only collecting (Meta)-Data

theptip 9 months ago

ChatGPT does this too by default right?

DesiLurker 9 months ago

Duh! there is almost zero chance they are not. this is probably how they are selling such a big expense line item internally.

isodev 9 months ago

Is this Apple PR trying to stop people from buying Meta VR/MR hardware?

I mean it’s Facebook, we get it. But they make WhatsApp and they make affordable and actually working Quest and now glasses… I’m not going to get triggered by opinion posts just because the Vision Pro was a flop.

This would sound exactly the same if we say Apple is training their AI on everyone’s Photos and content from notifications (because Apple Intelligence).

sulandor 9 months ago

seems kinda obvious

rkahga 9 months ago

Given that Facebook has been creating shadow profiles of non-users for a long time, it is not far fetched to assume that they will track the physical contacts of the Ray-Ban spy device wearer and record every interaction.

The current status regarding electronic devices is:

- If you have a pager or walkie-talkie, assume that it might blow up.

- If you have a smart phone, assume that it records your conversations.

- If another person has these RayBans, run don't walk.

More and more people know this. Even non-technical people are beginning to wake up.