The academic reference for mining massive datasets: extracting gold out of data
Want to know the secrets of Mining Massive Datasets? The following is a review of the book Mining of Massive Datasets by Jure Leskovec, Anand Rajaraman and Jeffrey David Ullman, that will teach you how to do just this.
Review of Mining of Massive Datasets
With the rise of the web and internet technologies, the amount of data created per day, and thus available to exploit has grown significantly over the years. This data is incredibly rich, and if treated correctly can provide immense value. Doing so, however, is not so easy, as the amount and structure of this data make it difficult to exploit it efficiently.
This is why we need to know how to mine massive datasets, in order to extract gold out of this vast ocean of heterogeneous data.
Mining of Massive Datasets by Leskovec et al is the go to book in the world’s top universities to teach Data Mining. It focus on the practical algorithms that have been consistently proved to operate well on very large data sets, like MapReduce, the main tool behind Hadoop and Spark ecosystems, teaching the readers how to parallelize their data processing automatically.
The book also covers algorithms designed to treat streaming data (data that has to be analysed and processed on real time), as well as the main works of how search engines like Google work via algorithms like PageRank. If you want to play a little bit with the PageRank algorithm, the son of Hubs and Authorities, you can go to the following PageRank calculator.
Mining of massive datasets also covers the topics of finding frequent itemsets, clustering, and new content on Decision Trees, Deep Learning and Mining Social Network graph. There is a lot of work done here, so if you are interested in Social Network Analysis using Machine Learning check out these posts on Medium.
This book is based on this Stanford Computer Science course, and like the course The book is designed at the undergraduate computer science level with no formal prerequisites. To support deeper explorations, most of the chapters are supplemented with further reading references.
Awesome, lets see what it contains!
Contents of the book
The contents of Mining Massive Datasets are:
- Chapter 1: Data Mining – The essence of Data Mining, what it is, in which field it is used, the most common concepts, and topics that are not data mining per se like TF-IDF but that are used in the field.
- Chapter 2: Map Reduce and the New Software Stack – How to manage immense amounts of data quickly using the most well known software frameworks.
- Chapter 3: Finding Similar Items – One of the fundamental data-mining problems, finding ‘similar’ items, like near duplicate web pages.
- Chapter 4: Mining Data Streams – how to separate yourself from static data-base mining to on real time processing of data.
- Chapter 5: Link Analysis – an introduction to PageRank and how it is computed.
- Chapter 6: Frequent Itemsets – an explanation of one of the major families of techniques for characterising data, the discovery of frequent itemsets.
- Chapter 7: Clustering – an introduction to one the most used family of Unsupervised Machine Learning models.
- Chapter 8: Advertising on the web – this chapter is devoted to the most effective algorithms used to match queries to advertisements.
- Chapter 9: Recommendation Systems – this chapter explains the technology behind Netflix, Amazon, and all kind of recommender systems.
- Chapter 10: Mining social network graphs – social network analysis, an area that is growing a lot right now and where incredibly insightful information can be extracted from.
- Chapter 11: Dimensionality Reduction – how to reduce the dimensions of your data so that it can be processed or stored efficiently, and used in your Machine Learning models. If you don’t know why dimensionality reduction is so important, check out this article, it is amazing.
- Chapter 12: Large-Scale Machine Learning – this chapter contains a discussion an explanation of the main supervised machine learning models, how to split your data into training and test sets, feature selection, and more.
- Chapter 13: Neural Nets and Deep Learning – the book ends with one of the most exciting families of machine learning models – Artificial Neural Networks.
The book also provides in each section a myriad of further resources to go to in case you want to go deeper in any of the covered topics.
Summary of Mining Massive Datasets:
This book is a go-to reference on Data mining methods. It covers the theory and practical aspects of most of the well known techniques, setting the theoretical foundations as well as providing insight onto their limitations and possible failures.
We think it is a great book to introduce those that are keen on the topic into the amazing world of Big Data, and recommend it dearly. Find Mining Massive Datasets on Amazon here:
- Hardcover Book
- Leskovec, Jure (Author)
- English (Publication Language)
- 565 Pages - 02/13/2020 (Publication Date) - Cambridge University Press (Publisher)
While this book covers a lot on how to pre-process data efficiently, and touches Machine Learning at the end, if you want to go deeper into Machine Learning models and techniques, check out the following Machine Learning books: ‘The Elements of Statistical Learning‘ or even simpler ‘The Hundred Page Machine Learning book‘. The previous links will take you to reviews about these two where you can decide if the books are the right for you.
Like the guy on the cover illustrates, Data is the new gold, become rich with data.
Thank you very much for reading How to Learn Machine Learning, we hope you enjoyed the review and that we covered any doubts you might have about the book. Leave us a comment if you liked it or if you didn’t, we are really looking forward to engaging with you guys and building a community!
Also, if you want to keep up with the content we produce, the latest news in the world of Artificial Intelligence, and more, follow us on Twitter. Have a great day!