@BB84

BB84@mander.xyz · edit-2 2 days ago

Can someone explain why I am being downvoted and attacked in this thread? I swear I am not sealioning. Genuinely confused.

@sc_griffith@awful.systems asked how request frequency might impact cost per request. Batch inference is a reason (ask anyone in the self-hosted LLM community). I noted that this reason only applies at very small scale, probably much smaller than what ~~Open~~AI is operating at.

@dgerard@awful.systems why did you say I am demanding someone disprove the assertion? Are you misunderstanding “I would be very very surprised if they couldn’t fill [the optimal batch size] for any few-seconds window” to mean “I would be very very surprised if they are not profitable”?

The tweet I linked shows that good LLMs can be much cheaper. I am saying that ~~Open~~AI is very inefficient and thus economically “cooked”, as the post title will have it. How does this make me FYGM? @froztbyte@awful.systems

BB84@mander.xyz · 2 days ago

What? I’m not doubting what he said. Just surprised. Look at this. I really hope Sam IPO his company so I can short it.

BB84@mander.xyz · edit-2 2 days ago

LLM inference can be batched, reducing the cost per request. If you have too few customers, you can’t fill the optimal batch size.

That said, the optimal batch size on today’s hardware is not big (<100). I would be very very surprised if they couldn’t fill it for any few-seconds window.

BB84@mander.xyz · 11 days ago

Okay that sounds like the best one could get without self-hosting. Shame they don’t have the latest open-weight models, but I’ll try it out nonetheless.

BB84@mander.xyz · 12 days ago

Interesting. So they mix the requests between all DDG users before sending them to “underlying model providers”. The providers like OAI and Anthropic will likely log the requests, but mixing is still a big step forward. My question is what do they do with the open-weight models? Do they also use some external inference provider that may log the requests? Or does DDG control the inference process?

BB84@mander.xyz · 12 days ago

Stop depending on these proprietary LLMs. Go to !localllama@sh.itjust.works.

There are open-source LLMs you can run on your own computer if you have a powerful GPU. Models like OLMo and Falcon are made by true non-profits and universities, and they reach GPT-3.5 level of capability.

There are also open-weight models that you can run locally and fine-tune to your liking (although these don’t have open-source training data or code). The best of these (Alibaba’s Qwen, Meta’s llama, Mistral, Deepseek, etc.) match and sometimes exceed GPT 4o capabilities.