Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Principles and Techniques of Data Science (ds100.org)
347 points by charlysl on April 26, 2019 | hide | past | favorite | 20 comments


Homeworks, labs, projects: https://github.com/DS-100/sp19

Course design: https://youtu.be/HITIm3KoU2U

Course website: http://www.ds100.org/sp19/


This looks great! Thanks for sharing. Interestingly enough, from looking at the table of contents, it seems this book starts with a more (and welcome) pragmatic approach, where you write some python code before, look at data visualisation techniques, etc, before delving into stats.

Is there any chapter that stands out to you?


You're welcome!

I haven't done the course yet, I've just found it. But, from the rationale video, the course seems to be more about weaving recurrent fundamental data science concepts throughout, emphasizing one particular concept or technique in each chapter, so I guess that it would make more sense to take it as a whole.

It is intended as a "glue" course, having completed CS fundamentals and before core data science courses, like statistics, machine learning and databases, giving students a context for what lies ahead, and just enough to be dangerous and start doing data science stuff.

If this is what you are after, you may also want to consider CMU's "Practical Data Science", which seems to have a similar approach, videos, much more machine learning and big data, and is also very current, but doesn't have such a nice companion online book (but the notes look great) and has much less statistics: http://datasciencecourse.org

Both look like great DS intro courses from top universities, we are spoilt.

And then, also from Berkeley, there is "Data 8", which is intended for those who want an intro to data science, but don't have any programming or college math knowledge yet; it also has a similar online book with working links to Jupyter notebooks: http://data8.org/sp19/ (and videos: https://www.youtube.com/playlist?list=PLXbeRfilLvMoC3QZKxRrp...)


Is this related to the course Berkeley have on EdX? https://www.edx.org/professional-certificate/berkeleyx-found...


Yes, that edx course is a prerequisite for this one, and is based on Data 8.


Hey, thanks for sharing this!


Shouldn't joins be explained as cartesion product instead of venn diagrams [1] when co-relating with sets?

[1] https://www.textbook.ds100.org/ch/05/cleaning_structure.html...


As I understand things, the Cartesian product (AKA the cross join) cannot be nicely depicted using Venn diagrams, you're right. However, Venn diagrams are a great way to depict the set logic that applies to the join keys of left, right, inner, and outer joins.


See here for an example of an argument against using Venn diagrams to depict joins: https://dzone.com/articles/say-no-to-venn-diagrams-when-expl...


Thanks! That link sent me down a rabbit hole in which I learned valuable things about SQL that I didn't even realize I lacked.


I thought a cartesian (cross) product produced an ordered output (tuple), an element from each set?

I don't have any experience with data science, but my brain wants to apply linear algebra and set theory...

So, in the above linked example, to clean we would first do an intersect operation on user names to remove people who don't appear in each set.

Then, to put the tables together (to append emails to appropriate names) we do a cross product between the filtered sets (assuming the sets have been ordered).

Is my intuition correct? I also have zero experience with DBs.


Any way to print this manual or buy a hard copy?


It should be possible to build a PDF from source. The setup guide is here: https://github.com/DS-100/textbook/blob/master/SETUP.md

I tried the following in a python 3.7 virtual environment, but it didn't quite work:

  sudo apt-get update
  sudo apt-get install -y --no-install-recommends npm calibre jekyll ca-certificates
  git clone https://github.com/DS-100/textbook
  cd textbook
  pip install -r requirements.txt
  pip install datascience # due to version conflict
  pip install --upgrade folium # due to version conflict
  pip install beautifulsoup4
  pip install lxml py-mathjax # not sure if these are needed
  sudo npm install -g gitbook-cli
  sudo gitbook fetch
  sudo gitbook install
  make build


I'd assume because you have the --no-install-recommends flag on your apt-get call. Maybe something you're doing requires the recommended (but not dependent) packages. I haven't done it yet, but that's my assumption at first glance, so take it with a grain of salt.


Sorry, I misled you a bit there. I didn't actually use that flag when I did it, as I already had the packages installed.


I haven’t looked at the material yet, but I did try to read Deborah Nolan’s book Data Science in R, and it was a confounding experience. I remember thinking “the material in this book is so far from anything that I’ve ever heard described as ‘Data Science’ that it renders the phrase useless”


I've never looked at Data Science in R, but Hadley Wickam's R for Data Science is great in my opinion. Really applicable, down to earth, and focuses much more on the meat of data science (data manipulation and munging, visualization, relational data, and efficient programing) more than the typical "fit a neural network to this idealized toy data set!"

Its also available for free online at https://r4ds.had.co.nz/


The data-8 videos are online. Are the data-100 videos online too? Thanks.


No, I searched and searched, but to no avail.


Me too. Thanks for replying.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: