I love OneDrive. I don't know why everyone hates it. I save things in OneDrive, and they are backed up to the cloud and synced with the other computers I use.
Some computers are shared by six family members, each with their own MS account and 1TB of storage. OneDrive makes it so the computers can have a 1TB hard drive but still give everyone ready access to their files by not storing them all on the hard drive.
If I don't want something synced to Onedrive then I usually save it in my downloads folder or another separate folder I create outside Onedrive. I've never had any problems.
The comments by people who hate make me wonder why my experience is so different than theirs.
If I don't want something synced to Onedrive then I usually save it in my downloads folder or another separate folder
My complaint is that MS has intentionally made it SO DIFFICULT to do that. There's a significant amount of extra clicking and thinking necessary to go through the steps to do it.
There is? Not at all. Whatever I save from my browser (Edge/Firefox/Chromium) is automatically saved to Downloads and Downloads is not automatically synced with OneDrive Backup. Backed up is Documents, Music, Videos, Desktop etc. but Downloads not AFAIK.
That's true. I wasn't clear - I was thinking mainly of the Office products: Word, Excel, etc. They've got a whole different Save As dialog that go out of their way to funnel you into OneDrive. So it's interrupting my flow precisely in the course of things I need to use brainpower on themselves.
Additionally if you don't save on Onedrive autosave is disabled.
Autosave worked decades ago before we even had cloud storage, but apparently in 2026 we just can't have feature parity with Word 97 without cloud storage.
MS actually made it harder to save stuff in OneDrive by basically hardcoding the folders it will backup unlike most other similar services that let you sync arbitrary folders on any drive.
The windows Explorer and the save dialog make it pretty easy to save a file somewhere else. Also, there is just one folder which syncs with OneDrive but some many more which don’t sync with OneDrive.
I haven't had too many issues with it that I couldn't resolve myself, but I understood what it was and what it was doing on my system. I opted in. I wonder if most of the people who don't like OneDrive didn't know it was enabled.
> OneDrive makes it so the computers can have a 1TB hard drive
No, it's a 1TB storage account accessible via the Internet, and wholly dependent on a good Internet connection, especially if you actually use most of that 1TB. Non-tech people will take misrepresentations like this at face value, which further makes tech a disempowering force.
The "Files-On-Demand" feature of OneDrive makes it possible for everyone in the family to login into any computer with OneDrive and quickly access all their files (I have 500/500 fiber). Because some/most of the files are stored on the cloud and only downloaded when they are accessed, the hard drive of the computer can be smaller (1TB) than the total sum of all the files stored by all the users (6TB). It's very nice. I can also turn off Files-on-Demand and everything will be stored locally on my computer. It sounds like many people are bothered that Files-on-Demand is enabled by default.
The two main frustrations that I've seen encountered are:
1) Microsoft aggressively attempting to convince you to use OneDrive, no matter how many times you turn it off.
I've had a Dropbox subscription since before OneDrive became a common name, and it works for me, so I have no real use case for OneDrive. That doesn't stop Microsoft forcibly "helpfully" re-enabling the OneDrive app and embedded link in the quick access bar regularly...which leads to
2) Microsoft attempting to sync your user profile in OneDrive, and bugs that arise from how it implements that.
I've never enabled this, so I haven't dug into how it works, but the first time I ever encountered OneDrive discussion in tech or adjacent circles was people complaining about OneDrive syncing of user profile folders breaking some games.
I assume it's something like the comedy of errors that could come out of folder redirection and software not expecting multiple people touching it at once, or the comedy of conflict resolution on a filesystem layer that isn't expecting that for semantics, but I have heard more complaints about OneDrive in this context than I've heard anything else about it.
So I suspect that it works fine, if you use it as a Dropbox-alike.
Using it as a Folder Redirection/Roaming Profiles replacement, or trying to say "no" to Microsoft, is where the problems ensue.
The problem is Microsoft shoving it down our throats and making saving to local FS as cumbersome as possible.
I wouldn't have minded Onedrive, in fact I would have used it a lot more, if it just showed up as an external mounted drive in Explorer where I could just paste files and folders I want shared or backed up. But nope, Microsoft in their infinite wisdom just have to sync up my entire home directory by default and have Office/365 only save docs and sheets to cloud by default. No thank you.
Agreed, I'd even be more inclined to use the Web versions, which are inherently tethered to OneDrive in Linux if the boundaries were better respected... I want an explicit spot backed up, maybe with the option to sync other directories. I don't want to be tricked or forced into it... or see a windows update change the configuration on me.
My wife got into trouble whilst unknowingly using OneDrive. She had a lot of photos on her laptop and they were being uploaded to OneDrive until it hit a limit (1TB?) and then started nagging her to upgrade to more storage. I had to disable OneDrive and move the photos back to her computer to resolve it.
Wholeheartedly agreed. Also the OneDrive Backup feature is great - previously people had to rely on other services (Box, Dropbox) and remember to save stuff into those folders. Now your most important Documents folder is saved in the cloud. Great. Backup! I don't think that OneDrive pestering you about buying more storage after using up your free storage is a bad thing, somebody needs to pay for stuff.
I understand the point, that everything is a bit convoluted and badly explained and may even lead to bad stuff happening. When you disable OneDrive Backup (good feature) and OneDrive deleting all your files locally with a little shortcut to OneDrive in the Cloud with all your files? Yeah... that is bad practice, but an easy fix for MS. Besides that hickup I currently don't undestand what the fuss is about.
I prefer the Dropbox solution, and tend to configure OneDrive, Google Drive, etc the same... I like to be explicit about it... I don't want all my general docs sync'd, for that matter, most of my projects are in a git repo anyway.
I know my Documents/Pictures directory isn't sync'd, I don't want it to be... to me my workspace is far different than what I want backed up to the cloud... I also have a local nas that is also setup for cloud sync for my service accounts. I emphatically do NOT want system default workspace directories sync'd.
How does this work with multiple PC's? Does it just merge all files into the same Documents folder? What if apps are saving app data to these folders, and you have the same app on multiple computers?
All your multiple PCs have the same Documents folder. Files created on one PC are synced to the cloud and appear in all the other PCs' Documents folders, and will be downloaded to the local storage if you try to access them. You get a small icon next to each file or folder to try and tell you if the files are local or in the cloud or whatever status. If apps are saving data to synced folders (eg. all those many many games happily polluting my Documents folder), then that same data is available to the same app on different computers. Could be good, or bad if the apps are being used on different computers at the same time with no real way of determining which PCs changes win for which particular file.
> All your multiple PCs have the same Documents folder. Files created on one PC are synced to the cloud and appear in all the other PCs' Documents folders
That sounds horrible.
> Could be good, or bad
It sounds bad either way. If the app exists on both computers there's good chance of conflicting overwrites of whatever is saved there. If the app is not used on a device then it's just a waste of downloading and syncing useless data.
They should just make the computers and have subfolders in a shared documents folder AND prompt you about so this before starting so you can turn off this or that folder before anything gets uploaded or downloaded. It's user-hostile design currently
It doesn't matter what I think, and you don't have to understand. Just don't turn it on by default, and if you do, make it safe and easy to turn it off.
Tooling and workflows that make sense on a centrally-administered domain do not belong on my home computer.
Considering I've had multiple family members and friends tricked into syncing to OneDrive without meaning to, I'd say that's a big reason. My Dad's old neighbor lost the ability to send and receive email after a Windows update did that to her, since it filled up her free Microsoft account and then Hotmail stopped working so she couldn't use email anymore and she had no idea why.
You clearly understand what’s happening better than most people. As someone who came back to Windows recently after 15 years away, the lengths to which the UI goes to hide the actual location of files and prevent you from directly addressing the filesystem is incredible. Thankfully I don’t have to use Windows for anything important. I would never recommend it to anyone else. (Not that the alternatives are much better.)
What’s mainly wrong with OneDrive is that it doesn’t work how most people expect, it’s on by default, and it deletes files from your local PC without asking. No matter how nice it is if you understand what’s going on, those details are enough to make it hate-worthy IMO.
It's about consent and respect for the user. If you build something awesome, you don't have to shove it down peoples' throats.
When more people Google "How to disable xyz" than "How to enable xyz," that would be a strong hint to most of us, but it doesn't mean anything to Microsoft's developers. ("Hey, at least they're engaging with the product," they tell each other.)
When I see comment saying that onedrive confusing, and that people are having problems with it, I always wonder if the "hacker" crowd is really the "hacker" type they want to portrait themselves, or just inexperienced computer users bitching on Microsoft for internet points.
OneDrive is an easy and well integrated cloud drive. The principle of having a cloud drive is not new anymore, and I believe people should get over the fact that, indeed, the files on the cloud drive is... In the cloud; colour me surprised!
I'm not a developer, but I consider myself relatively tech literate. That said, I really like Windows 11 and many other MS products such as M365, Edge, new Outlook, One Drive, and even Teams is tolerable (maybe not as polished as Zoom). Copilot is still struggling to find relevance, but I just chalk that up to MS pivoting rapidly to figure out how to make it useful (I currently use the enterprise version solely to draft emails, which is a huge time saver even if it is a bit pricey at $30/mo).
The complaints I see on here so frequently just don't register with me. For example, I never see any ads. Maybe there is a setting I turned off long ago, but I cannot recall ever seeing an ad. The setting must be persistent with my Windows backups because no ads show when I reinstall Windows and restore the settings and configuration from a backup.
Things are not always perfect, but the issues I have encountered are relatively minor all things considered.
> Additionally, every colleague, counterparty, outside-counsel, and client a lawyer ever works with uses docx. To introduce a new format into this ecosystem would introduce friction into every single interaction.
As an attorney, this is what kept me from switching to LibreOffice or Google Docs. I gave it a shot, but since the other attorneys I work with (both in and outside the US) and my clients all use Word, I ended up wasting a lot of time fixing files after converting between formats. In the end, it just wasn’t worth it.
I’m fairly tech-savvy, but many of my coworkers struggle with the mental effort required to switch to new software. Two colleagues I greatly respect still use WordPerfect and Word 2003 because they dislike change so much. It's too much of a lift for these people to wholesale switch word processors.
Assistive access mode for an iPhone is fantastic for the elderly. It's the only way my 85-year-old father can even use a phone. One of the best features is that it can be set to allow incoming calls only from people in his contacts. It's such a lifesaver.
What if it's police or the hospital trying to call?
I suppose it'd be "simple" for the device to answer the call and then prompt for a password before ringing for the user, but then the random caller needs to know the password. But then again, it could be as simple as "This is Siri, please say the name of the person you're trying to reach.", and since spammers usually don't have a name associated with the number they just randomly dialled, they'll be stumped.
I strongly suspect the people who throw all these roadblocks if an unknown number calls them don’t have kids, elderly relatives, various medical visits, etc. personally I’m not going to make myself hard to reach because of the cost of a spam call now and then.
If it works for you fine. But understand that different people have different needs.
I used to feel the same, but nowadays, I have very few normal calls, mostly from close relatives only, but a dozen of spam calls a week. And that used not to be the case before 2024. I know robocalls are/were a terrible affliction in the US for like the last 10 years, but they were very few of these in Hungary.
I wish I'd know what caused this change..
It's hard to say why people's experiences differ so much in the US I get maybe 3 a week or so. I would probably be more aggressive about adding businesses and people to my contact list if I were more aggressive about blocking possible SPAM. For me, the SPAM is one of life's lesser annoyances but I should probably check out some of the prevent features Apple has built into iOS.
iOS 26 can sort-of do what you suggest in your second paragraph. The device answers the call, asks who is calling and the purpose of the call, then (if the caller doesn’t hang up) presents the answers textually in the call-answering UI to allow you to decide whether to answer or not.
At the moment, most spam callers just hang up so I never even get to the choice to answer, although I do wonder if that’ll change if the feature gets popular.
The solution proposed by Kagi—separate the search index from the rest of Google—seems to make the most sense. Kagi explains it more here: https://blog.kagi.com/dawn-new-era-search
Google has two interlocked monopolies, one is the search index and the other is their advertising service. We often joked that if Google reasonable and non-discriminatory priced access to their index, both to themselves and to others, AND they allowed someone to put what ever ads they wanted on those results. That change the landscape dramatically.
Google would carve out their crawler/indexer/ranker business and sell access to themselves and others which would allow that business an income that did NOT go back to the parent company (had to be disbursed inside as capex or opex for the business).
Then front ends would have a good shot, DDG for example could front the index with the value proposition of privacy. Someone else could front the index with a value proposition of no-ads ever. A third party might front that index attuned to specific use cases like literature search.
Ie. Knowing which users clicked which search results.
Without the click stream, one cannot build or even maintain a good ranker. With a larger click stream from more users, one can make a better ranker, which in turn makes the service better so more users use it.
End result: monopoly.
The only solution is to force all players to share click stream data with all others.
Click stream is useful, without a doubt. It isn't essential. We had already started the process at Blekko of moving to alternate ways for ranking the index.
That said, if you run the frontend as proposed, you get to collect the clicks. That gives you the click stream you want. If the index returns you a serp with unwrapped links (which it should if it was unbundled from a given search front end) then you could develop analytics around what your particular customers "like" in their links and have a different ranking than perhaps some other front end. One thing that Blekko made really clear for me that the Google idea that there was always the "best" result for that query (aka the I'm Feeling Lucky link) there was often different shades of intent behind the query that aren't part of the query itself. Google felt they could get it in the first 10 links (back before the first 10 links were sponsored content :-)) and often on the page you could see the two or three inferred "intents" (shopping, information, entertainment were common).
I don't think that's quite true, as competitors like Kagi have been able to compete well with effectively zero clickstream (by comparison). It'll help, but it's not the make-or-break that the index is.
I think a click stream isn't necessary, but Kagi is not a good basis for the argument in my opinion.
Kagi is a primarily meta search engine. The click stream exists on their sources (Bing, Google, Yandex, Marginalia, not sure if they use Brave). They do have Teclis which is their own index that they use, and their systems for reordering the page of results such as downranking heavy ad pages, and based upon user preferences (which I love).
Kagi sends searches to other providers (Bing?) and then simply re-ranks the results, so they're effectively inheriting the click stream data of those other providers.
> Google has two interlocked monopolies, one is the search index
The index is the farthest thing from a monopoly Google has - anyone can recreate it. Heck, you can even just download Commoncrawl to get a massive head start.
I see it a bit differently, many (most?) web sites explicitly deny scraping execept for Google. Further Google has the infrastructure to crawl several trillion web pages and create a relevant index out of the most authoritative 1.5 trillion. To re-create that on your own, you would need both the web to allow it, and the infrastructure to do it. I would agree that this isn't an insurmountable moat but it is a good one.
Most websites only explicitly deny scraping by bad bots (robots.txt). Things like Cloudflare are a completely different matter, and I have a whole batch of opinions about how they are destroying the web.
I'd love to compete directly with OpenAI, but the cost of a half million GPUs is a me problem - not a them problem. Google can't be faulted for figuring out how to crawl the web in an economically viable way.
Then why do we see all of these alt search engines and SEO services building out independent indexes? Why don't the competitors cooperate in this fashion already?
Because everyone worships Thiel's "competition is for losers" and dreams of being a monopoly. Monopolies being the logical outcome of a deregulated environment, for which these companies lobby.
Throughout history there are very few monopolies and they don't normally last that long; that is unless they get are granted special privileges by the government.
Concentration is the default in an unregulated environment. Sure pure monopolies with 100% market control are rare but concentration is rampant. A handful of companies dominating tech, airlines, banks, media.
Concentration seems much more prevalent in heavily regulated markets e.g. utilities / airlines. In many cases regulators have even encouraged this e.g.finance.
There is no default for unregulated markets. It's a question of whether the economies of scale outweigh the added costs from the complexity that scale requires. It costs close to 100x as much to build 100 houses, run 100 restaurants, or operate 100 trucks as it does to do 1. That's why these industries are not very concentrated. Whereas it costs nowhere close to 100x for a software or financial services company to serve 100x thee customers, so software and finance are very concentrated.
The effect of regulation is typically to increase concentration because the cost of compliance actually tends to scale very well. So businesses that grow face an decreasing regulatory compliance cost as a percent of revenue.
You are comparing Apples and Oranges. You just can't compare the barrier of entry for Software business and an Airline, even without any regulations. It's just orders of magnitude more expensive to buy an airplane than a laptop, and most utilities are natural monopolies so they behave fundamentally different.
I can't and I didn't. I never said anything about barriers to entry. I'm talking about concentration here and why the market is dominated by airlines with hundreds of planes instead of airlines with 10 planes. Barriers to entry are inevitable in capital intensive industries.
Home building is interesting because I think a major blocker to monopoly-forming is the vastly heterogenous and complicated regulatory landscape, with building codes varying wildly from place to place. So you get a bunch of locally-specialized builders.
Regulation can increase concentration in a high corruption/cronyism environment
— regulatory capture and regulatory moats. There is plenty of that happening.
In building, I think we have local-concentration, due to both regulatory heterogeneity and then local cronyism - Bob has decades of connections to the city and gets permits easily, whereas Bob’s competitor Steve is stuck in a loop of rejection due to a never ending list of pesky reasons.
Concentration is not monopoly, and furthermore your comment does not begin to address the critical part of parent’s comment : “does not last very long”
Inequality at a point in time , and over time , is not nearly as bad if the winners keep rotating
> unless they get are granted special privileges by the government
That's what all the lobbyists are for.
None of the people or organisations that advocate for "free markets" or competition actually want free markets and competition. It's a smoke screen so they can keep buying politicians to get their special privileges.
They always inevitably end up being given special privileges.
Because, contrary to what we would all like to believe, once a company becomes large we don't want them to go under, even if they're not optimal.
There's a huge amount of jobs, institutional knowledge, processes, capital, etc in these big monopolies. Like if Boeing just went under today, how long would it take for another company to re-figure out how to make airplanes? I mean, take a look at NASA. We went to the moon, but can we do it again? It would be very difficult. Because so many engineers retired and IP was allowed to just... rot.
It's a balancing act. Obviously we want to keep the market as free as possible and yadda yadda invisible hand. But we also have national security to consider, and practicality.
This sounds a solution contrived to advantage companies that want access to this data rather than an actual economically valid business model. If building an index and selling access to it is a viable business, then why isn't someone doing it already? There's minimal barrier to entry. Blekko has an index. Are you selling access to it for profit?
This just in: small search engine company thinks it's a great idea for small search engine companies to have the same search index as Google.
Also, I love this bit: "[Google's] search results are of the best quality among its advertising-driven peers." I can just feel the breath of the guy who jumped in to say "wait, you can't just admit that Google's results are better than Kagi's! You need to add some sorta qualifier there that doesn't apply to us."
On every Kagi comment, there is “Have you used Kagi recently? It’s improved a lot!” — to the level that I suspect they have bots to upgrade the brand image, at least to search which comments to respond do.
I’m saying that because yes, I’ve used Kagi recently, and I switch back to Google every single time because Kagi can’t find anything. Kagi is to Google what Siri is to ChatGPT. Siri can’t even answer “What time is it?”
Maybe you see different comments than I do, but I don't see many comments saying it's improved a lot lately.
As a Kagi user, I would not say it's improved a lot lately. It's a consistent, specific product for what I need. I like the privacy aspects of it, and the control to block, raise or lower sites in my search results. If that's not something you care about then don't use it.
Is it better than Google at finding things? I don't think so, but then, Google is trash these days too
The GP of your comment is literally saying that Kagi is better than Google as of late. You’re not helping the “Kagi doesn’t use bots” case by ignore the context 2 comments up.
They said Kagi works "way better" than google, not that Kagi is better as of late (although they do ask if they've tried kagi lately). Which is consistent with my statement that Kagi is a consistent product and not really improving. They keep adding AI features, but I disable those and don't care about them.
You're welcome to check my post history, I'm certainly not a bot. Or if I am, I'm a very convincing one that runs an astrophotography blog.
> I suspect they have bots to upgrade the brand image
I disagree with the conclusion but I agree with the premise. Man is a rationalizing animal, and one way to validate one’s choice in paying for a search engine (whether it is better or not) is to get others to use it as well. Kagi is also good at PR, they were able to spin a hostile metering plan as a lenient subscription plan.
Word of mouth is often more prevalent than we think, and certainly more powerful than botting. I would not be shocked if the author of that “AirBnBs are blackhats” article was interacting with real users of Craigslist spurred on by some referral scheme.
> one way to validate one’s choice in paying for a search engine (whether it is better or not) is to get others to use it as well.
It's not so much validating, but I'm hoping they grow so I can keep using their service. It would suck for them to close shop because they never got popular enough to be sustainable.
> On every Kagi comment, there is “Have you used Kagi recently? It’s improved a lot!” — to the level that I suspect they have bots to upgrade the brand image
Odd to dismiss a point purely because it's consistently made, especially without much apparent disagreement. Perhaps more likely: there are just _many_ happy Kagi customers in the HN community.
As one data point: I use Kagi, and agree with GP, and I am not a bot (activity of this HN account predates existence of Kagi by many years).
That doesn't dismiss your experience of course, lots of people use search engines in different ways! Personally, I found the ads & other crap of Google drowned out results, and I frequently hit SEO spam etc where site reranking was helpful. I'm sure there's scenarios where that doesn't make sense though, it's not for everybody (not everybody can justify paying for search, just for starters).
It's taken as a given that Siri is inferior to ChatGPT. Both are natural language call-and-response models, but one of them is constantly in the news for diagnosing patients more accurately than actual medical doctors [1] and identifying a picture's location by the species of grass shown in a fifty-pixel-wide clump in the corner, and the other one can turn off your lights and order you a pizza when you ask it what tomorrow's weather forecast is.
Ergo, a person of average scholastic aptitude who is neither trying to ape late night talk show hosts by taking half of each single-colon-pair of an analogy, severing the other pair ends and any remaining context, and repeating the result with a well-rehearsed look of confusion; nor defensive about being called out for doing just that, can readily infer that the message being transmitted is that Kagi is fundamentally a tool very similar to Google, but which delivers inferior results.
Crawling the web is costly. I assume it's cheaper to use the results from someone else's crawling. I don't know what Kagi is using to argue that they should have access to Google's indexes, but I'd guess it's some form of anti trust.
Let me add more: crawling the web is costly for EVERYONE.
The more crawlers out there, the more useless traffic is being served by every single website on the internet.
In an ideal world, there would be a single authoritative index, just as we have with web domains, and all players would cooperate into building, maintaining and improving it, so websites would not need to be constantly hammered by thousands of crawlers everyday.
Yeah not that cheap. There's a few articles on HN now about small, independent websites being essentially DDOS'd by crawlers. Although, to be fair, mostly AI crawlers.
Kagi is just a meta search engine. They are already using Googles search index. They just find it too expensive. Guess they need to show ads to pay for the searches.
Crawling the internet is a natural monopoly. Nobody wants an endless stream of bots crawling their site, so googlebot wins because they’re the dominant search engine.
It makes sense to break that out so everyone has access to the same dataset at FRAND pricing.
My heart just wants Google to burn to the ground, but my brain says this is the more reasonable approach.
This is similar to the natural monopoly of root DNS servers (managed as a public good). There is no reason more money couldn't go into either Common Crawl, or something like it. The Internet Archive can persist the data for ~$2/GB in perpetuity (although storing it elsewhere is also fine imho) as the storage system of last resort. How you provide access to this data is, I argue, similar to how access to science datasets is provided by custodian institutions (examples would be NOAA, CERN, etc).
Build foundations on public goods, very broadly speaking (think OSI model, but for entire systems). This helps society avoid the grasp of Big Tech and their endless desire to build moats for value capture.
The problem with this is in the vein of `Requires immediate total cooperation from everybody at once` if it's going to replace googlebot. Everyone who only allows googlebot would need to change and allow ccbot instead.
It's already the case that googlebot is the common denominator bot that's allowed everywhere, ccbot not so much.
Wouldn’t a decent solution, if some action happened where Google was divesting the crawler stuff, be to just do like browser user agents have always done (in that case multiple times to comical degrees)? Something like ‘Googlebot/3.1 (successor, CommonCrawl 1.0)’
Lots of good replies to your comment already. I'd also offer up Cloudflare offering the option to crawl customer origins, with them shipping the compressed archives off to Common Crawl for storage. This gives site admins and owners control over the crawling, and reduces unnecessary load as someone like Cloudflare can manage the crawler worker queue and network shipping internally.
Wait, is the suggestion here just about crawling and storing the data? That's a very different thing than "Google's search index"... And yeah, I would agree that it is undifferentiated.
Hosting costs are so minimal today that I don't think crawling is a natural monopoly. How much would it really cost a site to be crawled by 100 search engines?
A potentially shocking amount depending on the desired freshness if the bot isn’t custom tailored per site. I worked at a job posting site and Googlebot would nearly take down our search infrastructure because it crawled jobs via searching rather than the index.
Bots are typically tuned to work with generic sites over crawling efficiently.
No, in our case they were indexing job posts by sending search requests. Ie instead of pulling down the JSON files of jobs, they would search for them by sending stuff like “New York City, New York software engineer” to our search. Generally not cached because the searches weren’t something humans would search for (they’d use the location drop down).
I didn’t work on search, but yeah, something like Elasticsearch. Googlebot was a majority of our search traffic at times.
> Vault offers a low-cost pricing model based on a one-time price per-gigabyte/terabyte for data deposited in the system, with no additional annual storage fees or data egress costs.
What's the read throughout to get the data back out, and does it scale to what you'd need to have N search indexes building on top of this shared crawl?
Of all the bad ideas I've heard of where to slice Google to break it up, this... Is actually the best idea.
The indexer, without direct Google influence, is primarily incentivized to play nice with site administrators. This gives them reasons to improve consideration of both network integrity and privacy concerns (though Google has generally been good about these things, I think the damage is done regarding privacy that the brand name is toxic, regardless of the behaviors).
A caching proxy costs you almost nothing and will serve thousands of requests per second on ancient hardware. Actually there's never been a better time in the history of the Internet to have competing search engines since there's never been so much abundance of performance, bandwidth, and software available at historic low prices or for free.
There are so many other bots/scrapers out there that literally return zero that I don’t blame site owners for blocking all bots except googlebot.
Would it be nice if they also allowed altruist-bot or common-crawler-bot? Maybe, but that’s their call and a lot of them have made it on a rational basis.
> that I don’t blame site owners for blocking all bots except googlebot.
I doubt this is happening outside of a few small hobbyist websites where crawler traffic looks significant relative to human traffic. Even among those, it’s so common to move to static hosting with essentially zero cost and/or sign up for free tiers of CDNs that it’s just not worth it outside of edge cases like trying to host public-facing Gitlab instances with large projects.
Even then, the ROI on setting up proper caching and rate limiting far outweighs the ROI on trying to play whack-a-mole with non-Google bots.
Even if someone did go to all the lengths to try to block the majority of bots, I have a really hard time believing they wouldn’t take the extra 10 minutes to look up the other major crawlers and put those on the allow list, too.
This whole argument about sites going to great lengths to block search indexers but then stopping just short of allowing a couple more of the well-known ones feels like mental gymnastics for a situation that doesn’t occur.
> sites going to great lengths to block search indexers
That's not it. They're going to great lengths to block all bot traffic because of abusive and generally incompetent actors chewing through their resources. I'll cite that anubis has made the front page of HN several times within the past couple months. It is far from the first or only solution in that space, merely one of many alternatives to the solutions provided by centralized services such as cloudflare.
Regarding allowlisting the other major crawlers: I've never seen any significant amount of traffic coming from anything but Google or Bing. There's the occasional click from one of the resellers (ecosia, brave search, duckduckgo etc), but that's about it. Yahoo? haven't seen them in ages, except in Japan. Baidu or Yandex? might be relevant if you're in their primary markets, but I've never seen them. Huawei's Petal Search? Apple Search? Nothing. Ahrefs & friends? No need to crawl _my_ website, even if I wanted to use them for competitor analysis.
So practically, there's very little value in allowing those. I usually don't bother blocking them, but if my content wasn't easy to cache, I probably would.
In the past month there were dozens of posts about using proof of work and other methods to defeat crawlers. I don't think most websites tolerate heavy crawling in the era of Vercel/AWS's serverless "per request" and bandwidth billing.
You don't get to tell site owners what to do. The actual facts on the ground are that they're trying to block your bot. It would be nice if they didn't block your bot, but the other, completely unnatural and advertising-driven, monopoly of hosting providers with insane per-request costs makes that impossible until they switch away.
You wouldn't have to make them micropayments, you can pay out once some threshold is reached.
Of course, it would incentivize the sites to make you want to crawl them more, but that might be a good thing. There would be pressure on you to focus on quality over quantity, which would probably be a good thing for your product.
Google search is a monopoly not because of crawling. It's because of the all the data it knows about website stats and user behavior. Original Google idea of ranking based on links doesn't work because it's too easily gamed. You have to know what websites are good based on user preferences and that's where you need to have data. It's impossible to build anything similar to Google without access to large amounts of user data.
Sounds like you're implying that they are using Google Analytics to feed their ranking, but that's much easier to game than links are. User-signals on SERP clicks? There's a niche industry supplying those to SEOs (I've seen it a few times, I haven't seen it have any reliable impact).
> so googlebot wins because they’re the dominant search engine.
I think it's also important to highlight that sites explicitly choose which bots to allow in their robots.txt files, prioritizing Google which reinforces its position as the de-facto monopoly. Even when other bots are technically able to crawl them.
> Crawling the internet is a natural monopoly. Nobody wants an endless stream of bots crawling their site,
Companies want traffic from any source they can get. They welcome every search engine crawler that comes along because every little exposure translates to incremental chances at revenue or growing audience.
I doubt many people are doing things to allow Googlebot but also ban other search crawlers.
> My heart just wants Google to burn to the ground
I think there’s a lot of that in this thread and it’s opening the door to some mental gymnastics like the above claim about Google being the only crawler allowed to index the internet.
> I doubt many people are doing things to allow Googlebot but also ban other search crawlers.
Sadly this is just not the case.[1][2] Google knows this too so they explicitly crawl from a specific IP range that they publish.[3]
I also know this, because I had a website that blocked any bots outside of that IP range. We had honeypot links (hidden to humans via CSS) that insta-banned any user or bot that clicked/fetched them. User-Agent from curl, wget, or any HTTP lib = insta-ban. Crawling links sequentially across multiple IPs = all banned. Any signal we found that indicated you were not a human using a web browser = ban.
We were listed on Google and never had traffic issues.
Are sites really that averse to having a few more crawlers than they already do? It would seem that it’s only a monopoly insofar as it’s really expensive to do and almost nobody else thinks they can recoup the cost.
We routinely are fighting off hundreds of bots at any moment. Thousands and Thousands per day, easily. US, China, Brazil from hundreds of different IPs, dozens of different (and falsified!) user agents all ignoring robots.txt and pushing over services that are needed by human beings trying to get work done.
EDIT: Just checked our anubis stats for the last 24h
CHALLENGE: 829,586
DENY: 621,462
ALLOW: 96,810
This is with a pretty aggressive "DENY" rule for a lot of the AI related bots and on 2 pretty small sites at $JOB. We have hundreds, if not thousands of different sites that aren't protected by Anubis (yet).
Anubis and efforts like it are a xesend for companies that don't want to pay off Cloudflare or some other "security" company peddling a WAF.
One is, suppose there are a thousand search engine bots. Then what you want is some standard facility to say "please give me a list of every resources on this site that has changed since <timestamp>" so they can each get a diff from the last time they crawled your site. Uploading each resource on the site to each of a thousand bots once is going to be irrelevant to a site serving millions of users (because it's a trivial percentage) and to a site with a small amount of content (because it's a small absolute number), which together constitute the vast majority of all sites.
The other is, there are aggressive bots that will try to scrape your entire site five times a day even if nothing has changed and ignore robots.txt. But then you set traps like disallowing something in robots.txt and then ban anything that tries to access it, which doesn't affect legitimate search engine crawlers because they respect robots.txt.
> then you set traps like disallowing something in robots.txt and then ban anything that tries to access it
That doesn't work at all when the scraper rapidly rotates IPs from different ASNs because you can't differentiate the legitimate from the abusive traffic on a per-request basis. All you can be certain of is that a significant portion of your traffic is abusive.
That results in aggressive filtering schemes which in turn means permitted bots must be whitelisted on a case by case basis.
> That doesn't work at all when the scraper rapidly rotates IPs from different ASNs because you can't differentiate the legitimate from the abusive traffic on a per-request basis.
Well sure you can. If it's requesting something which is allowed in robots.txt, it's a legitimate request. It's only if it's requesting something that isn't that you have to start trying to decide whether to filter it or not.
What does it matter if they use multiple IP addresses to request only things you would have allowed them to request from a single one?
> If it's requesting something which is allowed in robots.txt, it's a legitimate request.
An abusive scraper is pushing over your boxes. It is intentionally circumventing rate limits and (more generally) accurate attribution of the traffic source. In this example you have deemed such behavior to be abusive and would like to put a stop to it.
Any given request looks pretty much normal. The vast majority are coming from residential IPs (in this example your site serves mostly residential customers to begin with).
So what if 0.001% of requests hit a disallowed resource and you ban those IPs? That's approximately 0.001% of the traffic that you're currently experiencing. It does not solve your problem at all - the excessive traffic that is disrespecting ratelimits and gumming up your service for other well behaved users.
Why would it be only 0.001% of requests? You can fill your actual pages with links to pages disallowed in robots.txt which are hidden from a human user but visible to a bot scraping the site. Adversarial bots ignoring robots.txt would be following those links everywhere. It could just as easily be 50% of requests and each time it happens, they lose that IP address.
I mean sure but if there were 3 search engines instead of one would you disallow two of them? The spam problem is one thing but I dont think having a ten search engines rather than two is going to destroy websites.
The claim that search is a natural monopoly because of the impact on websites of having a few more search competitors scanning them seems silly. I don’t think it’s a natural monopoly at all.
A "few" more would be fine - but the sheer scale of the malicious AI training bot crawling that's happening now is enough to cause real availability problems (and expense) for numerous sites.
One web forum I regularly read went through a patch a few months ago where it was unavailable for about 90% of the time due to being hammered by crawlers. It's only up again now because the owner managed to find a way to block them that hasn't yet been circumvented.
So it's easy to see why people would allow googlebot and little else.
Assuming the simplified diagram of Google’s architecture, sure, it looks like you’re just splitting off a well-isolated part, but it would be a significant hardship to do it in reality.
Why not also require Apple to split off only the phone and messaging part of its iPhone, Meta to split off only the user feed data, and for the U.S. federal government to run only out of Washington D.C.?
This isn’t the breakup of AT&T in the early 1980s where you could say all the equipment and wiring just now belongs to separate entities. (It wasn’t that simple, but it wasn’t like trying to extract an organ.)
I think people have to understand that and know that what they’re doing is killing Google, and it was already on its way into mind-numbed enterprise territory.
> Apple to split off only the phone and messaging part of its iPhone
Ooh, can we? My wife is super jealous of my ability to install custom apps for phone calls and messaging on Android, it'd be great if Apple would open theirs up to competition. Competition in the SMS app space would also likely help break up the usage of iMessage as a tool to pressure people into getting an iPhone so they get the blue bubble.
If the dream of a Star Trek future reputation-based government run by AI which secretly manipulates the vote comes true, yes we can!
Either that or we could organize competitors to lobby the US or EU for more lawsuits in exchange for billions in kickbacks! (Not implying anything by this.)
You jest, but splitting out just certain Internet Explorer features was part of the Microsoft antitrust resolution. It's what made Chrome's ascendancy possible.
I mean it's just data. You can just copy it and hand it over to a newly formed competing entity.
You're not even really dealing with any of these shared infrastructure public property private property merged infrastructure issues.
Yeah sure. There's mountains of racks of servers, but those aren't that hard to get tariffs TBD.
I think it'll be interesting just to try and find some collection of ex Google execs who had actually like to go back to the do no evil days, and just hand them a copy of all the data.
I simply don't think we have the properly and elected set of officials to implement antitrust of any scale. DOJ is now permanently politicized and corrupt, and citizens United means corps can outspend "the people" lavishly.
Antitrust would mean a more diverse and resilient supply chain, creativity, more employment, more local manufacturing, a reversal of the "awful customer service" as a default, better prices, a less corrupt government, better products, more economic mobility, and, dare I say it, more freedom.
Actually, let me expound upon the somewhat nebulous idea of more freedom. I think we all hear about Shadow banning or outright banning with utter silence and no appeals process for large internet companies that have a complete monopoly on some critical aspect of Internet usage.
If these companies enabled by their cartel control, decide they don't like you or are told by a government not to like you, it is approaching a bigger burden as being denied the ability to drive.
Not a single one of those is something oligarchs or a corporatocracy has the slightest interest in
This strikes me like "two easy steps to draw an owl. First draw the head, then draw the body". I generally support some sort of breakup, but hand waving the complexities away is not going to do anybody any good
This solution would also yield search engines that will actually be useful and powerful like old Google search was. They have crippled it drastically over the years. Used to be I could find exact quotes of forum posts from memory verbatim. I can't do that on Google or YouTube anymore. It's really dumbed down and watered down.
I feel like there's some conceptual drift going on in Kagi's blog post wrt their proposed remedy.
They argue that the search index is an essential facility, and per their link "The essential facilities doctrine attacks a form of exclusionary conduct by which an undertaking controls the conditions of access to an asset forming a ‘bottleneck’ for rivals to compete".
But unlike physical locations where bridges/ports can be built, the ability to crawl the internet is not excludable by Google.
They do argue that the web is not friendly to new crawlers, but what Kagi wants is not just the raw index itself, but also all the serving/ranking built on top of it so that they do not have to re-engineer it themselves.
It's also worth noting that Bing exists, and presumably has it's own index of the web and no evidence has been presented that the raw index content itself is the reason that Bing is not competitive.
That's like asking the foxes how the farmer should manage his chickens. Kagi is a (wannabe) competitor. Likewise, YC's interest here is in making money by having viable startups and having them acquired.
I also don't think crawling the Web is the hard part. It's extraordinarily easy to do it badly [1] but what's the solution here? To have a bunch of wannabe search engines crawl Google's index instead?
I've thought about this and I wonder if trying to replicate a general purpose search engine here is the right approach or not. Might it not be easier to target a particular vertical, at least to start with? I refuse to believe Google cannot be bested in every single vertical or that the scale of the job can't be segmented to some degree.
Googles c suite is clearly not thinking ahead here. They could have helped to slow down the anti-trust lawsuits by opening up their search index to whichever AI company wants to pay for it. Web crawling is expensive, and lots of companies are spending wild amounts of money on it. There is a very clear market arbitrage opportunity between the cost of crawling the web and Google's cost of serving up their existing data.
Woudl the search index contain only raw data about the websites? Or would some sort of ranking be there?
If it's teh latter, its a neat way to ask a company to sell their users data to a third party because any kind of ranking comes via aggregation of users' actions. Without involving any user consent at all.
Then you'd just end up with all the ads being scams, and people not wanting to search on Google, because all the top results are scams instead of things they might actually be interested in that are not scams.
Separating the index creates a commodity data layer that preserves Google's crawling investment while enabling innovation at the ranking/interface layer, similar to how telecom unbundling worked for ISPs.
It's such a ridiculous proposal that would completely destroy Google's business. If that's the goal fine, but let's not pretend that any of those remedies are anything beyond a death sentence.
If they're dominating or one of only two or three important options in multiple other areas and the index is the only reason... I mean, that's a strong argument both that they're monopolists and that they're terrible at allocating the enormous amount of capital they have. That's really the only thing keeping them around? All their other lines of business collectively aren't enough to keep them alive? Yikes, scathing indictment.
> It's such a ridiculous proposal that would completely destroy Google's business.
it won't. My bet is that bing and some other indexes are 95% Ok for average Joe. But relevance ranking is much tougher problem, and "google.com" is household brand with many other functions(maps, news, stocks, weather, knowledge graph, shopping, videos), and that's what is foundation of google monopoly.
I think this shared index thing will actually kill competition even more, since every players will use only index owned by google now.
I mean, they're still going to be the number 1 name in adtech and analytics. And they're still gonna have pretty decent personalized ads because of analytics.
Plus, that just one part of their business. There's also Android, which is a money printing machine with the Google Store (although that's under attack too).
Sorry, but corporations are not people despite what some people will tell you.
They would definitely NOT survive in any recognizable form with "only a few billion dollars", because the stock price is a function of profits. Take away most of the profits, and most of the company's value gets wiped out, most of the employees would leave or get laid off, and anything of value that remains would quickly become worthless. Users would all move to the government-sanctioned replacement monopoly, likely X. To say nothing about the thousands of ordinary people who have large Alphabet holdings in their retirement portfolios and would be wiped out.
Google is practically the definition of a "too big to fail" company. They need to be reigned in to allow more competition, but straight up destroying the company would be a move so colossally stupid I could just see the Trump regime doing it.
Paint.net isn't free software, it's proprietary freeware. It used to be free a long time ago, but the author is a tool and made it closed source because people were creating other versions of it.
Some computers are shared by six family members, each with their own MS account and 1TB of storage. OneDrive makes it so the computers can have a 1TB hard drive but still give everyone ready access to their files by not storing them all on the hard drive.
If I don't want something synced to Onedrive then I usually save it in my downloads folder or another separate folder I create outside Onedrive. I've never had any problems.
The comments by people who hate make me wonder why my experience is so different than theirs.
reply