Hacker Newsnew | past | comments | ask | show | jobs | submit | panosv's commentslogin

Everyone knows ping. But over the decades, the networking community has quietly built an entire family of specialized variants — each solving a problem that standard ICMP couldn't. A few examples of why you'd reach for something else:

tcping — when firewalls eat your ICMP and you need to test port availability

arping — L2 diagnostics and duplicate IP detection, no IP stack needed

fping — scan a /24 in seconds, all hosts in parallel

OWAMP — when you actually need one-way latency, not just RTT

dnsping — when the slowness lives in your resolver, not the network

I put together a comparison table of the most useful ones, across protocol, OSI layer, platform, multi-host support, and root requirements. The OSI layer column alone tells you a lot — if you're reaching for ping to debug something that lives at L4 or L7, you're probably using the wrong tool.


What about Google? Anyone has any insights on their unit economics since they own the models and the infrastructure (which is also custom TPUs)? Are they doing better or are they in the same money losing business?


It must be hard for them to figure how much of their revenue is down to AI and how much to other stuff like search. They certainly make a lot of revenue and it would be foolish for them to ignore AI and have OpenAI and Perplexity eat their lunch.


It feels like Google should be able to come up with a revenue figure for search ai results right? How many people do a search but don't click on any links because they just read the ai blurb, but advertisers are still charged for being visible on the page.


Can someone recommend a dongle that actually works? I’ve tried a few and they are highly unreliable or stop working after a few months.


I've used a Sunweyer dongle from Amazon. If you can't get the "new shopper" discount on Aliexpress, it's cheaper. Seems to work fine. Doesn't like pairing to multiple phones, and the phone doesn't like being plugged into one of the car's USB outlets (it drops the audio because AFAICT the phone thinks it's plugged into the car for audio, but the car is still expecting audio over the Bluetooth "CarPlay" connection), even if it's one of the outlets that's not supposed to do anything but power.

I'd use it a little differently, but it's my wife's car, not mine. Who would have thought a 2022 Mercedes would have wired-only CarPlay?

Anyway, I find it excellent for podcast control. If maps are off (in my case, because location services are turned off) it doesn't really use more power than plain Bluetooth audio, and when I approach my destination on a trip I'll turn on location and plug it in to juice up the last bit.


I have been using CarlinKit 5.0 from AliExpress for last 1 year. No issues so far.



fping is most commonly seen as the backend tool under the classic (24 years old now, and still maintained) smokeping

https://www.google.com/search?client=firefox-b-e&q=smokeping...


Smokeping is an amazing and underrated resource as a network health metric and diagnostics tool.

If you have a network monitoring or asset system you can export IP addresses from, you should use a small glue script to automatically build a smokeping configuration. I've got one for our LAN and one for the WAN at each of our sites so I can track down issues at either level.

The LAN connection charts are a great daily sanity check, and the WAN connections (I have every-to-every for each site so any and all inter-site issues can be seen) can help keep your ISP honest with the service they're delivering.


mtr is also pretty great!


Lemurian Labs looks like it's doing something similar: https://www.lemurianlabs.com/technology They use the Logarithmic Number System (LNS)


MacOS has now a built in dedicated tool called networkQuality that tries to capture these variables https://netbeez.net/blog/measure-network-quality-on-macos/

Also take a look at Measurement Swiss Army-Knife (MSAK) https://netbeez.net/blog/msak/


big fan of flent.org and this tool, written in rust - is coming along smartly.

https://github.com/Zoxc/crusader


Not completely relevant, but another long standing bug: An 11 Year Old Bug in the macOS Popen(): https://news.ycombinator.com/item?id=37238433


If this is your bug, consider also sending in a feedback with your patch. The open source projects don’t usually take PRs.


We had a discussion with one of Apple's moderators here: https://developer.apple.com/forums/thread/726713

If you have a better way to reach them, lmk.


That's more than "one of Apple's moderators", that's the legendary Quinn (and the _only_ person within Apple I've seen to respond on the dev forums).

Also if you do need to reach someone at apple, I think filing a DTS incident would do that.


There are a handful of Apple engineers who will reply in the forums for their products.


That's about the best thing short of knowing someone on that team, I think.


From the forum discussion, this has been reported as FB12144217.


If you don't mind, which one is your start up?


Self plug of our Linux of Network Engineers series: https://netbeez.net/blog/category/linux/


this is the better link, no book sale, and on-topic information


How about if you did the same on 8 or 16 core CPU that can have much more than 16 GB of memory and is not as expensive to move data around its own memory?


Roughly 1000x slower? GPUs nowadays have 5000+ "cores" inside.


That's the point. On the GPU side they use all the 5000+ cores to parallelize the algorithm (they use the hardware to its full potential). On the CPU side they use just one core (at least there is no mention around the cores used on the CPU). It's like saying a Camry beat a Ferrari in maximum speed, but you don't mention that the Ferrari was only in the first gear for that specific race.


> they use the hardware to its full potential

If only! In fact it's a struggle to utilize a GPU to its full potential because the communication bottleneck makes it infeasible. Compute is fast but data can't get there fast enough.

The authors of this paper were saying the same thing in the promo video, in fact, they were working on making GPU's more efficient. Why would they do that if GPU's are using their "full potential" already?


> Roughly 1000x slower?

Not really. A modern Coffee Lake i7 has several distinct advantages over GPUs. (AMD Ryzen also has similar advantages, but I'm gonna focus on Coffee Lake)

1. AVX2 (256-bit SIMD), for 32-bit ints / floats that's 8 operations per cycle. AVX512 exists (16 operations per cycle) but it its only on Server architectures. Also, AVX512 has... issues... with the superscaling point#2 below. So I'm assuming AVX2 / 256-bit SIMD.

2. Superscalar execution: Every Skylake i7 (and Coffee Lake by extension) has THREE AVX ports (Port0, Port1, and Port5). We're now up to 24-operations per cycle in fully optimized code... although Skylake AVX2 can only do 16 Fused-multiply-adds at a time per core.

3. Intel machines run at 4GHz or so, maybe 3GHz for some of the really high core-count models. GPUs only run at 1.6GHz or so. This effectively gives a 2x to 2.5x multiplier.

So realistically, an Intel Coffee Lake core at full speed is roughly equivalent to 32 GPU "cores". (8x from AVX2 SIMD, x2 or x3 from Superscalar, and x2 from clock speed). If we compare like-with-like, a $1000 Nvidia Titan X (Pascal) has 3584 cores. While a $1000 Intel i9-7900 Skylake has 10 CPU cores (each of which can perform as well as 32-NVidia cores in Fused MultiplyAdd FLOPs).

i9-7900 Skylake is maybe 10x slower than an Nvidia Titan X when both are pushed to their limits. At least, on paper.

And remember: CPUs can "act" like a GPU by using SIMD instructions such as AVX2. GPUs cannot act like a CPU with regards to latency-bound tasks. So the CPU / GPU split is way closer than what most people would expect.

-------------

A major advantage GPUs have is their "Shared" memory (in CUDA) or "LDS" memory (in OpenCL). CPUs have a rough equivalent in L1 Cache, but GPUs also have L1 cache to work with. Based on what I've seen, GPU "cores" can all access Shared / LDS memory every clock (if optimized perfectly: perfectly coalesced accesses across memory-channels and whatever. Not easy to do, but its possible).

But Intel Cores can only do ~2 accesses per clock to their L1 cache.

GPUs can execute atomic operations on the Shared / LDS memory extremely efficiently. So coordination and synchronization of "threads", as well as memory-movements to-and-from this shared region is significantly faster than anything the CPU can hope to accomplish.

A second major advantage is that GPUs often use GDDR5 or GDDR5x (or even HBM), which is superior main-memory. The Titan X has 480 GB/s (that's "big" B, bytes) of main memory bandwidth.

A quad-channel i9-7900 Skylake will only get ~82 GB/second when equipped with 4x DDR4-3200MHz ram.

GPUs have a memory-advantage that CPUs cannot hope to beat. And IMO, that's where their major practicality lies. The GPU architecture has a way harder memory model to program for, but its way more efficient to execute.


Very good analysis, and a correct conclusion that memory bandwidth is the bottleneck (at least for Matrix fused multiply-add intensive workloads - like feeedforward NNs and Convnets). We have done experiments on the 1080Ti (484 GB/s) and for 32-bit FP training (convnets on tensorflow), it is close in performance to the P100 (717 GB/s).

The other point to add is that SIMD operation for GPUs is what gives them efficient batched reads from GPU memory for each operation.


Thanks.

I can't say I'm an expert yet. But the more and more I read about highly optimized code on any platform, the more and more I realize that 90% of the problem is dealing with memory.

Virtually every optimization guide or highly-optimized code tutorial spends an enormous amount of time discussing memory problems. It seems like memory bandwidth is the singular thing that HPC coders think about the most.


It's worth noting that this GPU RAM advantage is usually coupled with a PCIe bus disadvantage, which means that you need to be able to hold a complete working set of data in the GPU long enough to really benefit from the extra bandwidth and horsepower.

If you don't have enough computations-per-byte to perform on the GPU, you will find your total job time starts to be dominated by the time it takes to stage data in and out of the GPU, without being able to keep the GPU cores busy. Even if the CPU is 5-10x slower according to issue rates and RAM bandwidth, it can keep calculating steadily with a higher duty cycle since system RAM can be much larger.

However, the CPU also benefits from locality, so you should still prefer to structure your work into block-decomposed work units if possible. A decomposition which allows you to work through a large problem as a series of sub-problems sized for a modest GPU RAM area will also let the sub-problem rise higher in the CPU cache hierarchy to get more effective throughput. However, if the decomposition adds too much sequential overhead for marshalling or final reduction of results, it may not help versus a monolithic algorithm with reasonably good vectorization/streaming access to the full data.


The I9-7900 seems like a rather strange CPU to compare a video card. Why not a Intel xeon with 50% more memory bandwidth? Or an AMD Epyc with 100% more bandwidth? Not hard to get 2-3x the cores (in a single socket) and double the bandwidth/cores with a dual socket.

That way you get pretty good memory bandwidth, can directly access much more ram (1TB easy), and you can run a wide variety of codes (not just GPU codes).

Sure the Titan X is great if your code A) doesn't communicate B) fits entirely in system memory and C) runs on CUDA. Of course the real world often intrudes with PCI-e latency and memory limitations.

Not saying GPUs don't have their place, but it's easy to overstate their usefulness.


I picked two $1000 components from memory. I recognize that there are other choices out there, but $1000 is a nice round number and I honestly don't know the market any more to pick another price point.

If you know the name of a Xeon Skylake-server, and its memory capacity, that is roughly $1000 (and therefore comparable to a Titan X in MSRP cost), you are welcome to rerun the analysis yourself.

I can't do that because I don't know the capabilities of the Xeon Skylake servers from memory, nor their prices. And I'm certainly not going to spend 30 minutes googling this information for other people's sake.

What I will say is that the i9-7900x is a Skylake-server part with AVX512 support and Quad-channel memory. That's way stronger than a typical desktop. And I think assuming Quad-Channel 4xDDR4-3200MHz is pretty fair, all else considered.


Both chips have similar (within an order of magnitude) die areas, frequencies, power dissipations, and external pin bandwidth.

If the GPU were truly 1000x more efficient than the CPU, then the CPU vendor could just take 1/1000th of a GPU and squeeze it onto their own chip to double their performance.

(In a sense the trend since the late 90's has been to do exactly this via vector extensions.)


That's wrong by orders of magnitude. The actual speedup of GPU's is about 8x. Those GPU cores are much weaker than CPU cores.

The paper in discussion here reports 10x speedup for GPU vs CPU.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: