Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

https://www.shopify.com/robots.txt lists a lot of sitemap files, which tend to be a good starting point.


Did this suddenly get changed? Nothing but "# ,: # ,' | # / : # --' / # \/ />/ # /" is shown now.


It's just your browser's HTML parser. Line 6:

  #                         / <//_\
This is being interpreted as a malformed HTML closing tag, which (according to the HTML5 parsing algorithm published by WHATWG) gets treated as a comment. The file doesn't contain any > past this point. This leaves the uncommented contents from lines 1–6:

  #                               ,:
  #                             ,' |
  #                            /   :
  #                         --'   /
  #                         \/ />/
  #                         /
Or, with whitespace collapsed:

  # ,: # ,' | # / : # --' / # \/ />/ # /
Which should be exactly what you observe.

Ref: https://html.spec.whatwg.org/multipage/parsing.html https://developer.mozilla.org/en-US/docs/Web/CSS/white-space...


Weird. I think it did change. Google cache shows a 2229 line file: https://webcache.googleusercontent.com/search?q=cache%3Ahttp...


Seems it might be looking at the referrer. Loading https://www.shopify.com/robots.txt from clicking the link shows the weird line while opening it in a private browser window shows the right one.


For some reason, "view source" gets the right list. Maybe a referer issue like someone else said.


Looks like it's just Shopify's own pages and not anything related to actual stores.


It seems sort of questionable to use the list of things to not scrape as a starting point for scraping.... I mean, I get it's not actually enforced.


Not really sure why all the answers here are flagged, but you may be mistaken.

The robots.txt does not exclusively list what not to scrape.

It provides information on which parts are allowed and wich are not (disallowed).

It also provides sitemaps for crawlers as a starting point with more information (eg. which sites are available and how often are they updated, etc.)


Since ~2009 many crawlers recognize "Sitemap:" directives in robots.txt to link to sitemaps: https://en.wikipedia.org/wiki/Robots.txt#Sitemap




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: