Principles and Techniques of Data Science

charlysl · on April 26, 2019

Homeworks, labs, projects: https://github.com/DS-100/sp19

Course design: https://youtu.be/HITIm3KoU2U

Course website: http://www.ds100.org/sp19/

edshiro · on April 26, 2019

This looks great! Thanks for sharing. Interestingly enough, from looking at the table of contents, it seems this book starts with a more (and welcome) pragmatic approach, where you write some python code before, look at data visualisation techniques, etc, before delving into stats.

Is there any chapter that stands out to you?

charlysl · on April 26, 2019

You're welcome!

I haven't done the course yet, I've just found it. But, from the rationale video, the course seems to be more about weaving recurrent fundamental data science concepts throughout, emphasizing one particular concept or technique in each chapter, so I guess that it would make more sense to take it as a whole.

It is intended as a "glue" course, having completed CS fundamentals and before core data science courses, like statistics, machine learning and databases, giving students a context for what lies ahead, and just enough to be dangerous and start doing data science stuff.

If this is what you are after, you may also want to consider CMU's "Practical Data Science", which seems to have a similar approach, videos, much more machine learning and big data, and is also very current, but doesn't have such a nice companion online book (but the notes look great) and has much less statistics: http://datasciencecourse.org

Both look like great DS intro courses from top universities, we are spoilt.

And then, also from Berkeley, there is "Data 8", which is intended for those who want an intro to data science, but don't have any programming or college math knowledge yet; it also has a similar online book with working links to Jupyter notebooks: http://data8.org/sp19/ (and videos: https://www.youtube.com/playlist?list=PLXbeRfilLvMoC3QZKxRrp...)

touristtam · on April 27, 2019

Is this related to the course Berkeley have on EdX? https://www.edx.org/professional-certificate/berkeleyx-found...

charlysl · on April 28, 2019

Yes, that edx course is a prerequisite for this one, and is based on Data 8.

tallon · on April 26, 2019

Hey, thanks for sharing this!

shubh2336 · on April 26, 2019

Shouldn't joins be explained as cartesion product instead of venn diagrams [1] when co-relating with sets?

[1] https://www.textbook.ds100.org/ch/05/cleaning_structure.html...

EForEndeavour · on April 26, 2019

As I understand things, the Cartesian product (AKA the cross join) cannot be nicely depicted using Venn diagrams, you're right. However, Venn diagrams are a great way to depict the set logic that applies to the join keys of left, right, inner, and outer joins.

robgt · on April 26, 2019

See here for an example of an argument against using Venn diagrams to depict joins: https://dzone.com/articles/say-no-to-venn-diagrams-when-expl...

EForEndeavour · on April 26, 2019

Thanks! That link sent me down a rabbit hole in which I learned valuable things about SQL that I didn't even realize I lacked.

all2 · on April 26, 2019

I thought a cartesian (cross) product produced an ordered output (tuple), an element from each set?

I don't have any experience with data science, but my brain wants to apply linear algebra and set theory...

So, in the above linked example, to clean we would first do an intersect operation on user names to remove people who don't appear in each set.

Then, to put the tables together (to append emails to appropriate names) we do a cross product between the filtered sets (assuming the sets have been ordered).

Is my intuition correct? I also have zero experience with DBs.

tronko · on April 26, 2019

Any way to print this manual or buy a hard copy?

rahimnathwani · on April 26, 2019

It should be possible to build a PDF from source. The setup guide is here: https://github.com/DS-100/textbook/blob/master/SETUP.md

I tried the following in a python 3.7 virtual environment, but it didn't quite work:

  sudo apt-get update
  sudo apt-get install -y --no-install-recommends npm calibre jekyll ca-certificates
  git clone https://github.com/DS-100/textbook
  cd textbook
  pip install -r requirements.txt
  pip install datascience # due to version conflict
  pip install --upgrade folium # due to version conflict
  pip install beautifulsoup4
  pip install lxml py-mathjax # not sure if these are needed
  sudo npm install -g gitbook-cli
  sudo gitbook fetch
  sudo gitbook install
  make build

halfeatenpie · on April 26, 2019

I'd assume because you have the --no-install-recommends flag on your apt-get call. Maybe something you're doing requires the recommended (but not dependent) packages. I haven't done it yet, but that's my assumption at first glance, so take it with a grain of salt.

rahimnathwani · on April 26, 2019

Sorry, I misled you a bit there. I didn't actually use that flag when I did it, as I already had the packages installed.

iron0013 · on April 26, 2019

I haven’t looked at the material yet, but I did try to read Deborah Nolan’s book Data Science in R, and it was a confounding experience. I remember thinking “the material in this book is so far from anything that I’ve ever heard described as ‘Data Science’ that it renders the phrase useless”

thousandautumns · on April 26, 2019

I've never looked at Data Science in R, but Hadley Wickam's R for Data Science is great in my opinion. Really applicable, down to earth, and focuses much more on the meat of data science (data manipulation and munging, visualization, relational data, and efficient programing) more than the typical "fit a neural network to this idealized toy data set!"

Its also available for free online at https://r4ds.had.co.nz/

mleonard · on April 26, 2019

The data-8 videos are online. Are the data-100 videos online too? Thanks.

charlysl · on April 28, 2019

No, I searched and searched, but to no avail.

mleonard · on April 28, 2019

Me too. Thanks for replying.