Hacker Newsnew | past | comments | ask | show | jobs | submit | josu's commentslogin

>So my verdict is that it's great for code analysis, and it's fantastic for injecting some book knowledge on complex topics into your programming, but it can't tackle those complex problems by itself.

I don't think you've seen the full potential. I'm currently #1 on 5 different very complex computer engineering problems, and I can't even write a "hello world" in rust or cpp. You no longer need to know how to write code, you just need to understand the task at a high level and nudge the agents in the right direction. The game has changed.

- https://highload.fun/tasks/3/leaderboard

- https://highload.fun/tasks/12/leaderboard

- https://highload.fun/tasks/15/leaderboard

- https://highload.fun/tasks/18/leaderboard

- https://highload.fun/tasks/24/leaderboard


All the naysayer here have clearly no idea. Your large matrix multiplication implementation is quite impressive! I have set up a benchmark loop and let GPT-5.1-Codex-Max experiment for a bit (not 5.2/Opus/Gemini, because they are broken in Copilot), but it seems to be missing something crucial. With a bit of encouragement, it has implemented:

    - padding from 2000 to 2048 for easier power-of-two splitting
    - two-level Winograd matrix multiplication with tiled matmul for last level
    - unrolled AVX2 kernel for 64x64 submatrices
    - 64 byte aligned memory
    - restrict keyword for pointers
    - better compiler flags (clang -Ofast -march=native -funroll-loops -std=c++17)
But yours is still easily 25 % faster. Would you be willing to write a bit about how you set up your evaluation and which tricks Claude used to solve it?

Thank you. Yeah, I'm doing all those things, which do get you close to the top. The rest of things I'm doing are mostly micro-optimizations such as finding a way to avoid AVX→SSE transition penalty (1-2% improvement).

But I don't want to spoil the fun. The agents are really good at searching the web now, so posting the tricks here is basically breaking the challenge.

For example, chatGPT was able to find Matt's blog post regarding Task 1, and that's what gave me the largest jump: https://blog.mattstuchlik.com/2024/07/12/summing-integers-fa...

Interestingly, it seems that Matt's post is not on the training data of any of the major LLMs.


How are you qualified to judge its performance on real code if you don't know how to write a hello world?

Yes, LLMs are very good at writing code, they are so good at writing code that they often generate reams of unmaintainable spaghetti.

When you submit to an informatics contest you don't have paying customers who depend on your code working every day. You can just throw away yesterday's code and start afresh.

Claude is very useful but it's not yet anywhere near as good as a human software developer. Like an excitable puppy it needs to be kept on a short leash.


I know what's like running a business, and building complex systems. That's not the point.

I used highload as an example because it seems like an objective rebuttal to the claim that "but it can't tackle those complex problems by itself."

And regarding this:

"Claude is very useful but it's not yet anywhere near as good as a human software developer. Like an excitable puppy it needs to be kept on a short leash"

Again, a combination of LLM/agents with some guidance (from someone with no prior experience in this type of high performing architecture) was able to beat all human software developers that have taken these challenges.


> Claude is very useful but it's not yet anywhere near as good as a human software developer. Like an excitable puppy it needs to be kept on a short leash.

The skill of "a human software developer" is in fact a very wide distribution, and your statement is true for a ever shrinking tail end of that


> How are you qualified to judge its performance on real code if you don't know how to write a hello world?

The ultimate test of all software is "run it and see if it's useful for you." You do not need to be a programmer at all to be qualified to test this.


What I think people get wrong (especially non-coders) is that they believe the limitation of LLMs is to build a complex algorithm. That issue in reality was fixed a long time ago. The real issue is to build a product. Think about microservices in different projects, using APIs that are not perfectly documented or whose documentation is massive, etc.

Honestly I don't know what commenters on hackernews are building, but a few months back I was hoping to use AI to build the interaction layer with Stripe to handle multiple products and delayed cancellations via subscription schedules. Everything is documented, the documentation is a bit scattered across pages, but the information is out there. At the time there was Opus 4.1, so I used that. It wrote 1000 lines of non-functional code with 0 reusability after several prompts. I then asked something to Chat gpt to see if it was possible without using schedules, it told me yes (even if there is not) and when I told Claude to recode it, it started coding random stuff that doesn't exist. I built everything to be functional and reusable myself, in approximately 300 lines of code.

The above is a software engineering problem. Reimplementing a JSON parser using Opus is not fun nor useful, so that should not be used as a metric


> The above is a software engineering problem. Reimplementing a JSON parser using Opus is not fun nor useful, so that should not be used as a metric.

I've also built a bitorrent implementation from the specs in rust where I'm keeping the binary under 1MB. It supports all active and accepted BEPs: https://www.bittorrent.org/beps/bep_0000.html

Again, I literally don't know how to write a hello world in rust.

I also vibe coded a trading system that is connected to 6 trading venues. This was a fun weekend project but it ended up making +20k of pure arbitrage with just 10k of working capital. I'm not sure this proves my point, because while I don't consider myself a programmer, I did use Python, a language that I'm somewhat familiar with.

So yeah, I get what you are saying, but I don't agree. I used highload as an example, because it is an objective way of showing that a combination of LLM/agents with some guidance (from someone with no prior experience in this type of high performing architecture) was able to beat all human software developers that have taken these challenges.


This hits the nail on the head. There's a marked difference between a JSON parser and a real world feature in a product. Real world features are complex because they have opaque dependencies, or ones that are unknown altogether. Creating a good solution requires building a mental model of the actual complex system you're working with, which an LLM can't do. A JSON parser is effectively a book problem with no dependencies.

You are looking at this wrong. Creating a json parser is trivial. The thing is that my one-shot attempt was 10x slower than my final solution.

Creating a parser for this challenge that is 10x more efficient than a simple approach does require deep understanding of what you are doing. It requires optimizing the hot loop (among other things) that 90-95% of software developers wouldn't know how to do. It requires deep understanding of the AVX2 architecture.

Here you can read more about these challenges: https://blog.mattstuchlik.com/2024/07/12/summing-integers-fa...


You need to give it search and tool calls and the ability to test its own code and iterate. I too could not oneshot an interaction layer with Stripe without tools. It also helps to make it research a plan beforehand.

If that is true; then all the commentary around software people having jobs still due to "taste" and other nice words is just that. Commentary. In the end the higher level stuff still needs someone to learn it (e.g. learning ASX2 architecture, knowing what tech to work with); but it requires IMO significantly less practice then coding which in itself was a gate. The skill morphs more into a tech expert rather than a coding expert.

I'm not sure what this means for the future of SWE's though yet. I don't see higher levels of staff in big large businesses bothering to do this, and at some scale I don't see founders still wanting to manage all of these agents, and processes (got better things to do at higher levels). But I do see the barrier of learning to code gone; meaning it probably becomes just like any other job.


None of the problems you've shown there are anything close to "very complex computer engineering problems", they're more like "toy problems with widely-known solutions given to students to help them practice for when they encounter actually complex problems".

I think you misunderstood, it's not about solving the problem, is about finding the most efficient solution. Give it a shot, and see if you can get to the top 10 on any task.

The point is these problems are well understood and solved, solving them well with AI doesn’t mean anything as you’re just reciting something that’s already been done already.

>I'm currently #1 on 5 different very complex computer engineering problems

Ah yes, well known very complex computer engineering problems such as:

* Parsing JSON objects, summing a single field

* Matrix multiplication

* Parsing and evaluating integer basic arithmetic expressions

And you're telling me all you needed to do to get the best solution in the world to these problems was talk to an LLM?


Lol, the problem is not finding a solution, the problem is solving it in the most efficient way.

If you think you can beat an LLM, the leaderboard is right there.


I was able to lead these two competitions using LLM agents, with no prior rust or c++ knowledge. They both have real world applications.

- https://highload.fun/tasks/15/leaderboard

- https://highload.fun/tasks/24/leaderboard

In both cases my score showed other players that there were better solutions and pushed them to improve their scores as well.


This reads like a hit piece. Cars crash, that obvious. Are Robotaxis crashing above or below human-driver rates?


I've been asking for independent analysis for years now. The data is there. Yet all the headlines are from people who have an obvious bias - this is the first where I've seen a headline where there is no evidence that the data has been looked at.

There are many ways to "lie with statistics". Comparing all drivers - including those who are driving in weather self driving cars are not allowed to see - for example. there are many others and I want some deep analysis to know how they are doing and so far I've not seen it.


> I've been asking for independent analysis for years now. The data is there

Independent analysis would be great, but Tesla has been very withholding and even deceptive with its data.

Compare to https://waymo.com/safety/impact where anyone can download the data.


The biggest clue is that Tesla still needs to have a human supervisor in the car. They aren't doing that for show, it's an active admission that the tech isn't there yet.


The rates stated are about 10x higher than humans, and also far higher than Waymo.


well we're comparing waymo to tesla here and tesla is crashing a lot? more than waymo. 4 crashes in 250k miles versus like 10 in nearly 50 million?


From this article, Tesla crashes 50% more often. But hard to compare when one has a human safety driver and the other does not.

> the report finds that Tesla Robotaxis crash approximately once every 62,500 miles. Waymo vehicles, which have been involved in 1,267 crashes since the service went live, crash approximately every 98,600 miles. And, again, Waymo does not have human safety monitors inside its vehicles, unlike Tesla's Robotaxis.

https://mashable.com/article/tesla-robotaxis-with-human-safe...


how is it hard to compare? the company with a safety driver has more crashes. the comparison is easily made in that it makes it even worse


I think you also need to consider blame and severity, rather than raw crash numbers.


Tesla response:

Everything is caused by human safety drivers making mistakes. It's never the AI.


hmm, seems like the 1267 crashes includes general interruptions (like people hitting the car)


Above human rates for sure. In the 90s in my country, accident rate were 5 for a million kilometer (so 5 for 621371,192 miles), and the rate have come down since.

Basically they are crashing at the same rate as 18-25 years old in the 90s, in France. When we could still drink like 3 glasses and drive.


You just made me think of our PSA.

Was it "un verre, ça va, deux verres, ça va, trois verres, bonjour les dégâts" ? Something like that.

Edit: looks like we didn't have "deux verres", maybe.


Driverless Teslas are the hit pieces. Hitting people. Ayooo.

Seriously though, Tesla has an extension history of irresponsibly selling "autopilot" which killed a ton of people. Because they don't take safety seriously. Waymo hasn't.


Can you expand on a "ton" of people? Best source I've found was this wikipedia article:

https://en.wikipedia.org/wiki/List_of_Tesla_Autopilot_crashe...

But I suspect it isn't comprehensive. It's hard to get good data on this for a variety of reasons.


And are the crashes leading to fatalities at a higher or lower rate than human drivers?


This is the key. The known instances I've seen are very minor taps / fender benders. Not great, but not fatal accidents


Accountability is a pretty big issue, I think. We've accepted, for better or worse, a certain level of human-caused crashes for 100 years or so. If machines take the wheel they have to be an order of magnitude (or more) better.


I know what 99% of the people in HN are thinking while reading this post, and I agree, so I would like to hear the contrarian views:

Who is this useful for? What is the strategy behind this?


Their investors and investments. Artificially increasing the demand for AI by ramming it down users’ throats is great for the bottom line.

Edit: To give an actual contrarian view, this could be useful for people who are completely computer illiterate, but need to use a computer for work/school etc and have no desire to learn the basics of using a desktop OS.


>people who are completely computer illiterate, but need to use a computer for work/school etc and have no desire to learn the basics of using a desktop OS.

basically gen zalpha.


>Who is this useful for?

Middle manager careers

>What is the strategy behind this?

We must do something, this is something, so we do it.


People want to give you snarky responses without having read the article.

This is a push for a new way to interact with computers. Using natural language, at every part of the interface, to give the computer instruction. Allowing those instructions to be dialog.

Also, on the topic of keyboard vs voice, I think differently when my thoughts are expressed through typing and speech. A computer that can follow my babble around (which would also need eye tracking or some brain connection to fully understand the context of each thought) is radically different than using a mouse or keyboard commands to select focus.

If anything, this marketing copy appears to be a push towards a first preview of the future. What you can actually do with it is rudimentary, but the paradigm of _how_ you interact is a major behavior change that is required first before full capability can be unlocked. We are probably close than people believe to a full holodeck experience of interacting in a fully 3d rendered environment, and having it manifest change realtime. At some point the interface of the shell fades away completely, or at least optionally.


This is useful for corporations that want to reduce their workforce’s mean skill level in order to lower wages without impacting growth in revenue growth year-over-year. Such firms are the majority of Microsoft’s revenue, I expect.


It amazes me that corporations think that employees making a decent wage and revenue are not correlated. What happens when everyone is making subsistence wages? Who is buying consumer goods at that point?


That’s why payment of wages in company scrip was made illegal! Previously, the next step was to pay people in a made-up currency that had no market value outside of the company store, so that workers could be effectively held prisoner at the destitution line by disabling their ability to save up money to leave. The corporation can buy the goods at a heavy discount in bulk from another corporation and no profits need be paid out to workers, nor to retention efforts.

This cycle, gig work and part-time hours are exploiting the health care gap to ensure that people are paid just barely enough money to show up to work the next day, but not enough money or hours so that they can afford to voluntarily quit being exploited. Universal healthcare and basic income at a national level would crater involuntary employment numbers and likely cause a banking crash when consumers default en masse on their credit debt balances.

“Chattel slavery is dead, but industrial slavery remains.” (1886)


Imagine someone who doesn't know much about computers, and doesn't really want to learn. Then imagine a PC with a built in AI assistant a few generations more advanced than today's SOTA.

-

User: hey, why is the game running so slow?

AI Assistant: starts checking things.

Telemetry, apps: GTA 6 was launched 1 hour ago, then stopped 4 minutes ago. Likely the source of the performance complaints.

Telemetry, frames: FPS was at 60 when the game started, started dropping after 30 minutes of gameplay, deteriorated sharply. Reached the low of 18 FPS with unstable frame timings 12 minutes ago and remained there. Confirms GTA 6 as the source of the performance complaints.

Game settings: "medium" preset, locked to 60 FPS, lines up with recommendations for this system.

Expected performance of this game on this hardware with those settings: should stay at 60, with dips down to 50 in heavy areas. Does not match the observed performance.

Telemetry, resource utilization: normal until 30 minutes into gameplay, then increasingly GPU bound, causing poor performance.

Telemetry, power: the laptop was at 100% of charge and plugged in throughout. Temps were at 40C when the game was launched, then started ramping up, fan reached its RPM limit at 20 minutes, temperature kept climbing. Progressively more throttling was applied to GPU and CPU. 80C on GPU die (thermal limit) was hit 34 minutes in. GPU performance was choked by the thermal throttling.

Local temperature: no dedicated air intake sensor to query the air temperature from. 19C outside according to the weather report, but the laptop is plugged in, so it's probably not outside. Inside would be warmer. No smart thermostat to query the air temperature from. Anything else to get the air temperature from? Paired smartphone: is in proximity, was not used or moved for the past hour, reports that its core temperature is 24C. That's the room temperature.

Cooling system analysis: cross-referencing power draw logs to estimated room temperature to internal temperature. Not enough heat is being rejected by the cooling system. It underperforms the factory specification by at least 60%.

Past performance data: this laptop was used 7 times in the past 10 days. Seems like the cooling system was underperforming every time.

AI Assistant: your computer was overheating while you were playing GTA 6, which caused the performance issues. Check the air exhaust - the port at the left side of the laptop, with hot air coming out of it. Is there anything blocking it?

User: uh, no? The laptop was on the table the entire time. There's nothing to the left of it.

AI Assistant: it's likely that the laptop's cooling system needs cleaning. Should I check the prices at the nearest repair shop?


More likely:

User: hey, why is the game running so slow?

AI Assistant: It's because of me. I'm in the process of creating a detailed psychological profile based on what you play and how you play it.


Read the article, the author is definitely in favor of using git.


The monetary base expanded (aka "they printed a ton of money") between 2020-2022, so there were 2 options: the monetary base had to contract again, or the prices needed to adjust.


That is too simplistic.


Simple is good.


SPF is a Sun Protection Factor, meaning it multiplies the time it takes for your skin to burn. For example, if very light skin normally burns in about 10 minutes, SPF 20 stretches that to ~200 minutes, which is already over 3 hours. Since dermatologists recommend reapplying every 2 hours regardless, going beyond SPF 30–50 (which blocks ~97–98% of UVB) doesn’t add much practical benefit. Even for very fair skin, correct application and reapplication are far more important than chasing SPF 100.


Where I live in summer I regularly get days with UV index above 15.

If you burn in 15 minutes under UV index 6 on the worst days that I've seen you'd burn in 5 minutes. So a SPF of 60 is as useful here like an SPF of 20 is wherever you live.


Jesus H Christ, UV index of 15? I thought the 12 we see in the middle of Texas summers was bad. I've burnt in 10 minutes through a windshield with that.


The UV index in the southern hemisphere goes a lot higher than anything you experience up in the northern hemisphere. Do yourself a favour and go have look at the UV index on a hot summers day in Sydney in January.


For example today in sw Australia in late winter/spring it's a uv index of 5.

Summer time it sits at 13+ at noon on a clear day.

https://www.bom.gov.au/climate/maps/averages/uv-index/?perio...


In the risk of not picking up your hyperbole, I did think the windshields block UV and thus you cannot get sunburned through them.


In new vehicles, yes.


The protection factor from that degrades over time / with exposure, too.


This kind of SPF fatalism doesn't really make sense to me. There's absolutely no reason to quantize sun damage into "below burn time" and "above burn time." Damage is dose-dependent. Even burns come in different classes at different exposure durations; and maybe you'd prefer to get, you know, 30 seconds unprotected equivalent of sun damage instead of 3 minutes equivalent, at the same re-application interval.

If someone can make a true SPF 200 economically, it's valid for consumers to prefer that to a true SPF 100 or true SPF 50.


Use Claudia to onboard the first users, gain some traction, and when the inevitable C&D letter comes change the name.


He may have burned the keys, not the coins. The process of burning the coins is by sending them to an address such as: 1111111111111111111114oLvT2


Airplanes. Some airlines offer 30 minutes of free wifi or something.


I can't remember the last time I encountered this as an obstacle while flying. Don't you usually have to pay then log in from any device you choose?

The bigger problem is how utterly useless and unstable the connection is to begin with..


In the US, Delta and JetBlue offers everyone free WiFi.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: