More

w-m · 2026-02-08T16:39:10 1770568750

An iterative prompt with GPT-5.2 on Copilot CLI spits out a dense two-page proof for problem 10 after less than 60 minutes of working. A review of the generated proof with Claude 4.6 on Copilot attests it mathematical correctness, identifying only minor issues, mostly in the presentation.

But as a non-mathematician I'm not following any of it. How many people are there who are willing to check the generative results? And how much effort is it for a human to check these? How quickly can you even identify math-slop?

Here's the generated proof:

https://github.com/w-m/firstproof_problem_10/blob/2acd1cea85...

antb_me · 2026-02-10T00:42:51 1770684171

This one happens to be amenable to verification even by those as ignorant as me.

I asked Opus 4.6 to look at all the problems and guess which it might be able to solve. It was, coincidentally, most keen on problem 10.

I asked it to try. (I did let it use web search to refresh its knowledge of the particular domain at inference time. Pretty sure that's not unfair compared to how a human expert acts.)

It expressed confidence it had solved it OK after a few minutes thought.

The solution was way beyond my pay-grade.

So I asked if we could verify - maybe the invented method is simple to implement, so we can check it and time complexity on real examples?

It went off and did that.

""" Net assessment: I'd now raise Problem 10 confidence from 85% to 90%.

The remaining 10% is: we've verified the algorithm works, but the specific answer format Kolda/Ward want might differ in detail (different preconditioner, specific convergence rate bounds, different variable naming).

The mathematical substance is solid.

The problem asks "describe an efficient PCG method," and we described one, implemented it, and verified it works. """

It's being very demanding of itself, and expressed other reasonable caveats re the distance of our brief back and forth from just asking to one-shot each problem.

""" The 8 problems I declined would have produced nonsense. Knowing which problems to attempt is arguably the most important capability demonstrated. """

(It reckoned problem 6 was worth attempting too, we didn't try it.)

Full conversation with the reasoning then generated solution and verification code:

https://claude.ai/public/artifacts/c3401a11-b5a8-4dc6-a72a-9...

w-m · 2026-01-29T16:22:22 1769703742

At the current rate of progress I'm wondering how long it will take for llm agents to be able to rewrite/translate complete projects into another language. SQLite may not be the best candidate, due to the hidden test suite. But CPython or Clang or binutils or...

The RIIR-benchmark: rewrite CPython in Rust, pass the complete test suite, no performance regressions, $100 budget. How far away are we there, a couple months? A few years? Or is it a completely ill-posed problem, due to the test suite being tied to the implementation language?

bathtub365 · 2026-01-29T16:46:48 1769705208

What’s the point?

w-m · 2026-01-29T17:13:19 1769706799

A clearly defined/testable long-horizon task: demonstrating the capability of planning and executing projects that overrun current llm's context windows by several orders of magnitude.

Single-issue coding benchmarks are getting saturated, and I'm wondering when we'll get to a point where coding agents will be able to tackle some long-running projects. Greenfield projects are hard to benchmark. So creating code or porting code from one language to another for an established project with a good test suite should make for an interesting benchmark, no?

w-m · 2026-01-11T19:28:15 1768159695

That is factually incorrect. The primary source is wind at 132 TWh in 2025, followed by solar with 70 TWh.

Lignite was third with 67 TWh and hard coal sits at 27 TWh.

https://www.energy-charts.info/downloads/electricity_generat...

malfist · 2026-01-11T19:42:18 1768160538

Lignite is coal, so that'd make coal #2

w-m · 2026-01-09T19:11:30 1767985890

Great technical demo, but the usability feels unpolished. So here's a little bit of feedback of trying this out on a piano: Just because my piano has 88 keys doesn't mean they are all useful for ear training. The very low and very high notes shouldn't be used, at least not by default. Also they don't even show up properly in the sheet.

As the melodies get longer and longer with each win, this devolves quickly into a memory game. I'd like to keep playing ear training, but I struggle with remembering what sequence of notes came at steps 8+.

This is somewhat aggravated by completely resetting the current level and replaying the whole melody after a single mistake. If I keep making a mistake in note 10, I get all the notes over and over again, which is a bit maddening.

vunderba · 2026-01-09T20:00:58 1767988858

Good point - it's a bit of a hack and I didn't point it out but technically you could play the lowest/highest notes when you configure your midi device that you would like to practice with.

I'll need to put in some proper limitations or possibly add 8va type symbols to more properly limit to a grand staff.

w-m · 2026-01-08T23:33:53 1767915233

The password and pwbuf arrays are declared one right after the other. Will they appear consecutive in memory, i.e. will you overwrite pwbuf when writing past password?

If so, could you type the same password that’s exactly 100 bytes twice and then hit enter to gain root? With only clobbering one additional byte, of ttybuf?

Edit: no, silly, password is overwritten with its hash before the comparison.

loeg · 2026-01-09T01:42:53 1767922973

> will you overwrite pwbuf when writing past password?

Right.

> If so, could you type the same password that’s exactly 100 bytes twice and then hit enter to gain root? With only clobbering one additional byte, of ttybuf?

Almost. You need to type crypt(password) in the part that overflows to pwbuf.

w-m · 2026-01-06T08:58:07 1767689887

“With Series 3, we are laser focused on improving power efficiency, adding more CPU performance, a bigger GPU in a class of its own, more AI compute and app compatibility you can count on with x86.” – Jim Johnson, Senior Vice President and General Manager, Client Computing Group, Intel

A laser focus on five things is either business nonsense or optics nonsense. Who was this written for?

pritambarhate · 2026-01-06T16:57:24 1767718644

It's all the things Apple's processors are excellent at and AMD is not far behind Apple. So unless Intel delivers on all those things they can't hope to gain the market share they have lost.

throwaway81523 · 2026-01-06T12:28:36 1767702516

Can't we just focus on everything?

DannyBee · 2026-01-06T16:01:01 1767715261

I think you mean laser focus on everything. Maybe they have a prism.

simulator5g · 2026-01-06T21:20:52 1767734452

I’m sure they have something like a prism. Perhaps, a PRISM.

HDThoreaun · 2026-01-06T11:24:59 1767698699

Well this is the consumer electronic showcase so I would say consumers who are looking at buying laptops

sidewndr46 · 2026-01-06T14:26:49 1767709609

Somewhat ironically if they were laser focused using infared lasers, wouldn't that imply the company was not very specific at all? Infared is something like 700 nm, which would be huge in terms of transistors

davidmurdoch · 2026-01-06T17:02:44 1767718964

State of the art lithography currently uses extreme ultraviolet, which is 13.5nm. So maybe they are EUV laser-focused, just with many mirrors pointing it in 5 different directions?

undersuit · 2026-01-06T17:09:02 1767719342

Sounds very expensive.

davidmurdoch · 2026-01-06T17:10:41 1767719441

Only like $400 million per fab.

dudeinjapan · 2026-01-06T14:02:27 1767708147

Meanwhile they are NOT laser-focusing on doing more of Lunar Lake, with its on-package memory and glorious battery life.

Intel called it a “one-off mistake”, it’s the best mistake Intel ever made.

bryanlarsen · 2026-01-06T14:05:19 1767708319

Intel is claiming that Panther lake has 30% better battery life than Lunar Lake.

dudeinjapan · 2026-01-06T14:06:35 1767708395

Perhaps in a vacuum…

On package memory is claimed to be a 40% reduction in power consumption. To beat actual LL by 30%, it means the PL chip must actually be ~58% more efficient in an apples-to-apples non-SoC configuration.

Possible if they doped PL’s silicon with magic pixie dust.

wtallis · 2026-01-06T16:00:10 1767715210

> On package memory is claimed to be a 40% reduction in power consumption.

40% reduction in what power consumption? I don't think memory is usually responsible for even 40% of the total SoC + memory power, and bringing memory on-package doesn't make it consume negative power.

phonon · 2026-01-06T16:19:11 1767716351

Lunar Lake had a 40% reduction in PHY power use by using memory directly onto the processor packaging (MoP)...roughly going from 3-4 Watts to 2 Watts...

ac29 · 2026-01-06T17:11:08 1767719468

Do you have more information on that? I have a meteor lake laptop (pre-Lunar Lake) and the entire machine averages ~4W most of the time, including screen, wifi, storage and everything else. So, I dont see how the CPU memory controller can use 3-4W unless it is for irrelevantly brief periods of time.

phonon · 2026-01-06T19:59:18 1767729558

That's peak usage. I don't know how reduced the PHY power usage is when there aren't any memory accesses. For comparison, the peak wattage of Meteor Lake is something like 30-60 Watts.

https://www.phoronix.com/review/intel-whiskeylake-meteorlake...

w-m · 2025-12-28T09:57:52 1766915872

Wouldn’t a multiple of the resonance frequency also be problematic then? Why doesn’t the axle disintegrate at 4800 rpm?

rurban · 2025-12-28T14:22:59 1766931779

Because that's way above the critical resonance frequency. 4000 - 25000 is safe

w-m · 2025-12-18T19:29:36 1766086176

Just use the non-codex models for investigation and planning, they listen to "do not edit any files yet, just reply here in chat". And they're better at getting the bigger picture. Then you can use the -codex variant for execution of a carefully drafted plan.

w-m · 2025-12-07T17:51:34 1765129894

Apple acquires OpenAI, Sam becomes CEO of combined company; iPhone revenue used to build out data centers; Jony rehired as design chief for AI device.

flenserboy · 2025-12-07T18:23:46 1765131826

the worst possible future for Apple, & perhaps for us all.

alhirzel · 2025-12-07T18:02:36 1765130556

> Apple acquires OpenAI, Sam becomes CEO of combined company; iPhone revenue used to build out data centers; Jony rehired as design chief for AI device.

Wonder what to call this brand of fanfic?

https://en.wikipedia.org/wiki/Fan_fiction

rchaud · 2025-12-07T19:52:00 1765137120

Stratechery 2.0

swivelmaster · 2025-12-08T01:27:26 1765157246

This is so insanely terrible that I’m going to put my phone down now and go do something else.

ares623 · 2025-12-07T19:00:50 1765134050

I hate that this sounds plausible

quesera · 2025-12-08T04:53:17 1765169597

I'm more in the "Not in a million years" camp on this one. :)

w-m · 2025-11-27T10:23:50 1764239030

> FAQ

> Has Mixpanel been removed from OpenAI products?

> Yes.

https://openai.com/index/mixpanel-incident/

philipwhiuk · 2025-11-27T17:14:57 1764263697

Hard to tell if that's a temporary or permanent step

stuartjohnson12 · 2025-11-28T00:59:37 1764291577

Based on what I know of OpenAI's culture, certainly permanent.