Data Algorithms: Recipes for Scaling Up with Hadoop and Spark
- Length: 778 pages
- Edition: 1
- Language: English
- Publisher: O'Reilly Media
- Publication Date: 2015-05-07
- ISBN-10: 1491906189
- ISBN-13: 9781491906187
- Sales Rank: #765207 (See Top 100 Books)
Learn the algorithms and tools you need to build MapReduce applications with Hadoop and Spark for processing gigabyte, terabyte, or petabyte-sized datasets on clusters of commodity hardware. With this practical book, author Mahmoud Parsian, head of the big data team at Illumina, takes you step-by-stepthrough the design of machine-learning algorithms, such as Naive Bayes and Markov Chain, and shows you how apply them to clinical and biological datasets, using MapReduce design patterns.
- Apply MapReduce algorithms to clinical and biological data, such as DNA-Seq and RNA-Seq
- Use the most relevant regression/analytical algorithms used for different biological data types
- Apply t-test, joins, top-10, and correlation algorithms using MapReduce/Hadoop and Spark
Table of Contents
Chapter 1 Secondary Sort: Introduction
Chapter 2 Secondary Sorting: Detailed Example
Chapter 3 Top 10 List
Chapter 4 Left Outer Join in MapReduce
Chapter 5 Order Inversion Pattern
Chapter 6 Moving Average
Chapter 7 Market Basket Analysis
Chapter 8 Common Friends
Chapter 9 Recommendation Engines using MapReduce
Chapter 10 Content-Based Recommendation: Movies
Chapter 11 Smarter Email Marketing with Markov Model
Chapter 12 K-Means Clustering
Chapter 13 kNN: k-Nearest-Neighbors
Chapter 14 Naive Bayes
Chapter 15 Sentiment Analysis
Chapter 16 Finding, Counting and Listing all Triangles in Large Graphs
Chapter 17 K-mer Counting
Chapter 18 DNA-Sequencing
Chapter 19 Cox Regression
Chapter 20 Cochran-Armitage Test for Trend
Chapter 21 Allelic Frequency
Chapter 22 The T-Test
Chapter 23 Computing Pearson Correlation
Chapter 24 DNA Base Count
Chapter 25 RNA-Sequencing
Chapter 26 Gene Aggregation
Chapter 27 Linear Regression
Chapter 28 MapReduce and Monoids
Chapter 29 The Small Files Problem
Chapter 30 Huge Cache for MapReduce
Chapter 31 Bloom Filter