Hacker Newsnew | past | comments | ask | show | jobs | submit | w-m's commentslogin

An iterative prompt with GPT-5.2 on Copilot CLI spits out a dense two-page proof for problem 10 after less than 60 minutes of working. A review of the generated proof with Claude 4.6 on Copilot attests it mathematical correctness, identifying only minor issues, mostly in the presentation.

But as a non-mathematician I'm not following any of it. How many people are there who are willing to check the generative results? And how much effort is it for a human to check these? How quickly can you even identify math-slop?

Here's the generated proof:

https://github.com/w-m/firstproof_problem_10/blob/2acd1cea85...


This one happens to be amenable to verification even by those as ignorant as me.

I asked Opus 4.6 to look at all the problems and guess which it might be able to solve. It was, coincidentally, most keen on problem 10.

I asked it to try. (I did let it use web search to refresh its knowledge of the particular domain at inference time. Pretty sure that's not unfair compared to how a human expert acts.)

It expressed confidence it had solved it OK after a few minutes thought.

The solution was way beyond my pay-grade.

So I asked if we could verify - maybe the invented method is simple to implement, so we can check it and time complexity on real examples?

It went off and did that.

""" Net assessment: I'd now raise Problem 10 confidence from 85% to 90%.

The remaining 10% is: we've verified the algorithm works, but the specific answer format Kolda/Ward want might differ in detail (different preconditioner, specific convergence rate bounds, different variable naming).

The mathematical substance is solid.

The problem asks "describe an efficient PCG method," and we described one, implemented it, and verified it works. """

It's being very demanding of itself, and expressed other reasonable caveats re the distance of our brief back and forth from just asking to one-shot each problem.

""" The 8 problems I declined would have produced nonsense. Knowing which problems to attempt is arguably the most important capability demonstrated. """

(It reckoned problem 6 was worth attempting too, we didn't try it.)

Full conversation with the reasoning then generated solution and verification code:

https://claude.ai/public/artifacts/c3401a11-b5a8-4dc6-a72a-9...


At the current rate of progress I'm wondering how long it will take for llm agents to be able to rewrite/translate complete projects into another language. SQLite may not be the best candidate, due to the hidden test suite. But CPython or Clang or binutils or...

The RIIR-benchmark: rewrite CPython in Rust, pass the complete test suite, no performance regressions, $100 budget. How far away are we there, a couple months? A few years? Or is it a completely ill-posed problem, due to the test suite being tied to the implementation language?


What’s the point?


A clearly defined/testable long-horizon task: demonstrating the capability of planning and executing projects that overrun current llm's context windows by several orders of magnitude.

Single-issue coding benchmarks are getting saturated, and I'm wondering when we'll get to a point where coding agents will be able to tackle some long-running projects. Greenfield projects are hard to benchmark. So creating code or porting code from one language to another for an established project with a good test suite should make for an interesting benchmark, no?


That is factually incorrect. The primary source is wind at 132 TWh in 2025, followed by solar with 70 TWh.

Lignite was third with 67 TWh and hard coal sits at 27 TWh.

https://www.energy-charts.info/downloads/electricity_generat...


Lignite is coal, so that'd make coal #2


Great technical demo, but the usability feels unpolished. So here's a little bit of feedback of trying this out on a piano: Just because my piano has 88 keys doesn't mean they are all useful for ear training. The very low and very high notes shouldn't be used, at least not by default. Also they don't even show up properly in the sheet.

As the melodies get longer and longer with each win, this devolves quickly into a memory game. I'd like to keep playing ear training, but I struggle with remembering what sequence of notes came at steps 8+.

This is somewhat aggravated by completely resetting the current level and replaying the whole melody after a single mistake. If I keep making a mistake in note 10, I get all the notes over and over again, which is a bit maddening.


Good point - it's a bit of a hack and I didn't point it out but technically you could play the lowest/highest notes when you configure your midi device that you would like to practice with.

I'll need to put in some proper limitations or possibly add 8va type symbols to more properly limit to a grand staff.


The password and pwbuf arrays are declared one right after the other. Will they appear consecutive in memory, i.e. will you overwrite pwbuf when writing past password?

If so, could you type the same password that’s exactly 100 bytes twice and then hit enter to gain root? With only clobbering one additional byte, of ttybuf?

Edit: no, silly, password is overwritten with its hash before the comparison.


> will you overwrite pwbuf when writing past password?

Right.

> If so, could you type the same password that’s exactly 100 bytes twice and then hit enter to gain root? With only clobbering one additional byte, of ttybuf?

Almost. You need to type crypt(password) in the part that overflows to pwbuf.


“With Series 3, we are laser focused on improving power efficiency, adding more CPU performance, a bigger GPU in a class of its own, more AI compute and app compatibility you can count on with x86.” – Jim Johnson, Senior Vice President and General Manager, Client Computing Group, Intel

A laser focus on five things is either business nonsense or optics nonsense. Who was this written for?


It's all the things Apple's processors are excellent at and AMD is not far behind Apple. So unless Intel delivers on all those things they can't hope to gain the market share they have lost.


Can't we just focus on everything?


I think you mean laser focus on everything. Maybe they have a prism.


I’m sure they have something like a prism. Perhaps, a PRISM.


Well this is the consumer electronic showcase so I would say consumers who are looking at buying laptops


Somewhat ironically if they were laser focused using infared lasers, wouldn't that imply the company was not very specific at all? Infared is something like 700 nm, which would be huge in terms of transistors


State of the art lithography currently uses extreme ultraviolet, which is 13.5nm. So maybe they are EUV laser-focused, just with many mirrors pointing it in 5 different directions?


Sounds very expensive.


Only like $400 million per fab.


Meanwhile they are NOT laser-focusing on doing more of Lunar Lake, with its on-package memory and glorious battery life.

Intel called it a “one-off mistake”, it’s the best mistake Intel ever made.


Intel is claiming that Panther lake has 30% better battery life than Lunar Lake.


Perhaps in a vacuum…

On package memory is claimed to be a 40% reduction in power consumption. To beat actual LL by 30%, it means the PL chip must actually be ~58% more efficient in an apples-to-apples non-SoC configuration.

Possible if they doped PL’s silicon with magic pixie dust.


> On package memory is claimed to be a 40% reduction in power consumption.

40% reduction in what power consumption? I don't think memory is usually responsible for even 40% of the total SoC + memory power, and bringing memory on-package doesn't make it consume negative power.


Lunar Lake had a 40% reduction in PHY power use by using memory directly onto the processor packaging (MoP)...roughly going from 3-4 Watts to 2 Watts...


Do you have more information on that? I have a meteor lake laptop (pre-Lunar Lake) and the entire machine averages ~4W most of the time, including screen, wifi, storage and everything else. So, I dont see how the CPU memory controller can use 3-4W unless it is for irrelevantly brief periods of time.


That's peak usage. I don't know how reduced the PHY power usage is when there aren't any memory accesses. For comparison, the peak wattage of Meteor Lake is something like 30-60 Watts.

https://www.phoronix.com/review/intel-whiskeylake-meteorlake...


Wouldn’t a multiple of the resonance frequency also be problematic then? Why doesn’t the axle disintegrate at 4800 rpm?


Because that's way above the critical resonance frequency. 4000 - 25000 is safe


Just use the non-codex models for investigation and planning, they listen to "do not edit any files yet, just reply here in chat". And they're better at getting the bigger picture. Then you can use the -codex variant for execution of a carefully drafted plan.


Apple acquires OpenAI, Sam becomes CEO of combined company; iPhone revenue used to build out data centers; Jony rehired as design chief for AI device.


the worst possible future for Apple, & perhaps for us all.


> Apple acquires OpenAI, Sam becomes CEO of combined company; iPhone revenue used to build out data centers; Jony rehired as design chief for AI device.

Wonder what to call this brand of fanfic?

https://en.wikipedia.org/wiki/Fan_fiction


Stratechery 2.0


This is so insanely terrible that I’m going to put my phone down now and go do something else.


I hate that this sounds plausible


I'm more in the "Not in a million years" camp on this one. :)


> FAQ

> Has Mixpanel been removed from OpenAI products?

> Yes.

https://openai.com/index/mixpanel-incident/


Hard to tell if that's a temporary or permanent step


Based on what I know of OpenAI's culture, certainly permanent.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: