Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

How is it a misconception? My overall point was that shell oneliners are often much faster to quickly bang out for a one-off use case than writing a full program from the ground up to accomplish the same thing. This is demonstrated to a very exaggerated degree in the Knuth vs. McIlroy example, but it also holds true for non-exaggerated real-world use cases. (I had a coworker who was totally shell illiterate and would write a Python script every time they had to do a simple task like count the number of unique words in a file. This took at least 10 times longer than someone proficient at the shell, which one could argue is itself a highly optimized domain-specific language for text processing.)

If your point is that the shell script isn't really the same thing as Knuth's program: sure, the approaches weren't algorithmically identical (assuming constant time insertions on average, Knuth's custom trie yields an O(N) solution, which is faster than McIlroy's O(N*log N) sort, though this point is moot if you use awk's hashtables to tally words rather than `sort | uniq -c`), but both approaches accomplish the exact same end result, and both fail to handle the exact same edge cases (e.g. accent marks).



The task Knuth was given was to illustrate his literate programming system (WEB) on the task given to him by Bentley, which meant writing a Pascal program "from the ground up", and (ideally) the program containing something of interest to read.

If instead of writing a full program as asked, he had given some cop-out like “actually, instead of writing this in WEB as you asked, I propose you just go to Bell Labs or some place where Unix is available, where it so happens that other people have written some programs like 'tr' and 'sort', then you can combine them in the following way”, that would have been an inappropriate reply, hardly worth publishing in the CACM column. (McIlroy, as reviewer, had the freedom to spend a section of his review advertising Unix and his invention of Unix pipelines, then not yet as well-known to the general CACM reader.)

So while of course shell one-liners are faster to bang out for a one-off use-case, they obviously cannot accomplish the task that was given (of demonstrating WEB). (BTW, I don't want to too much repeat the earlier discussion, but see https://codegolf.stackexchange.com/questions/188133/bentleys... — on that input, the best trie-based approach is 8x faster than awk and about 200x faster than the tr-sort script.)


Seems like you should just go with what you you know best.

Taking 10x longer doesn't seem like a language problem. If you don't know bash well you're going to take even longer to do it in bash than in python.

In any case the task you described is pretty much the same in python as in bash. At worst the python is going to be more more verbose.

   python -c "print(len(set(w for l in list(open('test.txt')) for w in l.split())))"

vs

   tr ' ' '\n' < file_name | sort | uniq -c | wc -l


The shell's advantage is that of the pipeline components don't need to suck the whole file in so it can potentially operate on much larger files without running out of memory. I think only "sort" is problematic and at least it's a merge sort.

In Python you could use a generator but it would get a little more complicated and you'd still have to add all the words to set() but hopefully the number of different words is not that great.

The trie approach is quite memory efficient and that can matter.


I'm fairly sure `open` is a generator and doesn't load the whole file into memory. So you wouldn't hit a memory error unless like you said the amount of unique words is high enough.


I think you're right but I believe that wrapping it in List(...) is where that would force the whole file into memory.


Yeah, you're right, that's my mistake.

I think you can just omit it but yeah...


> one-off use case

And fortunately nobody is foolish enough to think that shell scripting is robust enough to use for more than one-off uses! /s


Hah, I have a friend who spent a large chunk of an undergraduate summer internship at Google porting a >50k line bash script (that was used in production!) to Python. It was not their most favorite summer, to say the least.


How does one do that? I mean you just can type 50k lines in 2.5 month if you type 1000 lines per day. Which sound a lot to me.


It's only possible if you can identify large portions of the 50k original lines as having been previously implemented by other components (python modules, microservices, etc.), or that large portions are dealing with cases that are guaranteed to no longer arise (so you either produce different results or error out if you detect them).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: