Agile Data Science 2.0: Building Full-Stack Data Analytics Applications with Spark
J**E
clear, useful, motivational
As a R language programmer, I'd like a book so clear, concise, to the task and motivational as Russell's but for R . Worth every dollar
T**R
Only Buy for the Architecture Explanation and not to Follow Along
I saw the good and bad reviews on this book, but considering my day to day job has me setting up similar infrastructure, I wanted to see how a pro does it. The good thing about this book is that Russell Jurney does seem to have a great end to end product solution that is similar to what we push into production. Customer Front End -> Message Broker -> Big Data ETL -> Front End for Product Managers with Insights. The only issue is that the code used to follow along is so patchy and broken (not due to the fault of the author, but due to different technologies upgrading) that actually following along with code becomes impossible.Russel gives two options to follow along in the book, AWS (which costs at MINIMUM $0.50/hour) or a Virtual Machine (which becomes difficult to manage and takes up MINIMUM 50GB on your laptop). AWS seems to be the most stable so far, but comes at a cost, literally. I don't really want to be paying to have a machine running when I am trying to figure out what went wrong or debugging. You can shut the machine off and save the state, but you are still charged for the full hour, so might as well use it. After two days I have a $5 bill and I am barely through chapter 3, and there are still out of date package issues in python (i.e. `pymongo_spark` library doesn't work because the last time it was updated was 2015) . I estimate by the time I done with the book the charges will be over $100.The VM option seems to be more stable as long as you change some of the versions in the VagrantFile bash file provided on the books github (i.e. libsvm needs to be changed from 1.0.0 to 1.1) but then this causes compatability issues which then need to be resolved by installing Vagrant plugins. Once you get past all of that, the NAT from your VM to the host machine was improperly configured so you can't forward any of the ports to do the exercises in the book.The third option (not in the book) was a Dockerfile I found on the book's github. I was like "Thank You! A modern free solution" but the file doesn't work and gets stuck when installing one of the many libraries, but it won't tell you which one or why, so that option is out...I was debating between actually giving 1 star or 3 stars, but decided for 1. I am assuming I am like most people and review the 5 Stars to see what people loved about it and 1 stars to see what people didn't like about it. I also thought that people who may not have access to some of the production versions of these technologies will be very disappointed if they never get an opportunity to use them.All in All the architecture and explanations are good, not great (get "Designing Data Intensive Applications" for a better overview), but you can use this outline to setup a pipeline if you are willing to spend a lot of money on AWS.
E**Y
Good book for thinking about innovating data science in a more agile fashion
As a erstwhile data scientist, this reviewer doesn't agree entirely with the premise as suggested in the book that "the most effective output of the data science process suitable for effecting change in an organization is the web application. It asserts that application development is a fundamental skill of a data scientist." From my perspective, being a good data scientist means really understanding the data, and figuring out what to highlight. Sometimes an application (web based or other) may be the most effective method, but not always. And occasionally, there may be very important data security issues that need to be addressed if using the web as a medium for data transfer, display or analysis. So, this reader is not sold on the exact idea, but it provides an interesting way to consider data science, and whether there are more and better approaches that can be used. Keep in mind that people think of data as a medium with out bias, and machine learning in a similar vein; however, if data are collected in a biased fashion, or questions or slanted in such a way to capture answers in a desired fashion, the data itself is not bias free, and machine learning turns out those flaws as "facts," so there are many facets that need to be considered when using data science with only an empiric framework for exploratory data analysis.Agile methods are becoming the gold standard of approaching technologic solutions, but, as Jurney points out, may be difficult when approaching data analytics and big data. understanding the nuances and caveats of each dataset make the "agile" approach sometimes a little dangerous and difficult to create as a gold standard for data analytics vs some other methods (such as scrum) that currently exist in a software development environment. In some cases, it's also difficult to weigh the impact of "failed" experiments (data was wrong), as the onus is usually on the analyst to present the "right" answer.There's a lot of good stuff in here, though, as analysts and managers of analysts think about how to transform the work they do. With more and mroe data available, in structured and more unstructured formats, how do you iterate and improve upon the data understanding? Consider the importance of the visuals. Move out of the framework of only relational tables. This book is relatively up-to-date with the most commonly used tools (Spark instead of Mapreduce); examples of how to use Spark in Python (load PySpark), MongoDB, Elasticsearch, etc. There are good code samples and a very well indexed appendix.Overall, some interesting concepts that a data team will enjoy discussing, even if they don't agree with all of Jurney's points of view.
J**R
Seriously moving in the right direction
I am reviewing a copy of "Agile Data Science 2.0" by Russell Jurney that I received at no cost through the Amazon Vine program.Working in a data science group in I.T., we've had a lot of conversations about how I.T. operating approaches - agile, devOps, PMO - apply to data science. Data Science tasks are different in that not all work is intended to lead to functioning software, as well as the strongly-iterative approach that is necessary to deliver results to stakeholders in a way that discrete units of software might not otherwise be reviewed.Russell Jurney's "Agile Data Science 2.0" goes a long way in moving that conversation in the right direction. I had three target audiences in mind when I acquired this book. The first was our PM, who had worked in I.T. for years as a director and project manager but continued to try to wrap his head around the data science process. The second was a director who was new to the data science process and wanted a better grasp of how to communicate expectations to the team. The third was myself, having spent time in both IT and in research, I had seen the two worlds and wanted a way to help explain how the two mesh.Jurney has offered, as have many data science books, a suggested stack and how to implement it, but the most valuable part of the book I thought was the first two chapters for their emphasis on the agile manifesto for data science, a description of the many roles that go into a team, and highlights of how agile can make for better data science both in terms of research and in terms of products.This is not a text to learn Spark from a developer's perspective but rather to understand how spark can fit in. Spark isn't the only platform, so those using Dask or other tools will still find value here.If the book has a weakness it's the focus on developing a web portal to expose the data science product; this isn't a bad way to do things, not at all, but it's not where our work is going at the moment, so this limits the applicability of some of the chapters. But there's nothing that keeps the book from being useful .. so much so that I honestly don't know whose desk it's sitting on at the moment, since as soon as our PM finished it he gave it to a BA, who gave it to another PM...
L**N
Good technical discussion, but frequently glosses over fundamental concepts
Overall, this book is pretty good.The first part of the book is basically the author trying to convince you what literally everyone who's ever been in a dev environment knows is false; Agile environments aren't the great savior of development, and it's really nothing more than today's buzzword-filled tech management style, and please don't let this book fill your head with illusions.But when the book is focused more on technical matters than the author's opinion, it does a pretty decent job of covering important data science concepts.The introductory/getting set up sections are a bit light, but honestly, I struggle to envision someone who's tech-savvy and mathematically sophisticated enough that they're attempting to set up something like a machine learning system with Spark, but still need to be walked through a Jupyter Notebook.The reason I took off a star, though, is because this book suffers from the same problem that a lot of data science books do: it casts a wide net, and covers concepts in a mile-wide, inch-deep fashion. Data science is a huge field, and trying to condense it into a handbook like this certainly isn't easy, but the end result of trying to cover it all in under 350 pages is that you get things like "Introduction to Predictive Analytics," where the author spends less than three pages introducing the idea (or rather, just regression and classification) and then jumps straight into some functional examples. The examples are good, but I can't imagine someone feeling competent (or even intro-level capable) if that were your only introduction to the concept.
Trustpilot
1 week ago
1 month ago