Hadoop noSQL

LevSelector.com

New York

home > Hadoop, etc

Hadoop and other noSQL technologies

On This Page More Other Pages

- intro -
- list_of_technologies -
- good_websites -
- Microsoft and BigData -
- cloud -
- misc -
- seminal_articles -
- video_tutorials -
- x -
- x -
-

-

Intro ------------------------------

In 15 years since 1998 the number of servers in Google grew to approx 3 million servers in 2013.
Microsoft, Amazon, and others are not far behind. How do you manage millions of servers?
How do you distribute data between them?
How you query this data?
How Google returns results from millions of servers in less than a second?
What technologies power those huge distributed databases?

Google - how it works

http://en.wikipedia.org/wiki/Google -
http://en.wikipedia.org/wiki/PageRank -
http://infolab.stanford.edu/~backrub/google.html - original paper (1998)
http://infolab.stanford.edu/pub/papers/google.pdf - original paper as pdf (1998)
http://research.google.com/archive/mapreduce.html - (2004) - intro to map-reduce
http://en.wikipedia.org/wiki/BigTable - BigTable
BigTable - 2006 paper - pdf

List of some well known BigData technologies (as of Summer 2014):

List below contains both open source and commercial solutions.
Which ones to choose?

"Open source or open wallet.
Vibrant innovative engineering community or marketing.
It's not about a great product.
It is about unbridled thinking and freedom from single company control.
The choice is easy."
-- David Garrison, Hortonworks

Hadoop	Open Source system to store data distributed over multiple (thousands) servers. Created in 2005 by Doug Cutting and Mike Cafarella for "Nutch" search engine project. Name "Hadoop" comes from the name of Cutting's little son's toy elephant. Cutting was working at Yahoo! at the time. Hadoop is written in Java.
Hadoop distributions	Main distributions (package, develop, support): - Claudera -Palo Alto, CA - MapR - San Hose, CA - Hortonworks - Palo Alto, CA, funded by Yahoo, Teradata, etc.
HDFS	Hadoop Distributed File System
YARN	Yet Another Resource Negotiator - for better resource management and map reduce in Hadoop 2.x
Storm	Distributed processing and streaming of data. Open source. Written in Clojure. Uses "spouts" and "bolts" to define topologies of moving data. Integrates well with many common messaging systems (RabbitMQ, Kestrel, Kafka, etc).
Samza	Open Source distributed stream processing framework. Uses Apache Kafka for messaging. Written in Java and Scala.
Hive	SQL-like language called HiveQL to query Hadoop HDFS data. DataWarehouse infrastructure. Can run on top of Spark for faster performance.
HBase	Open Source distributed database on top of HDFS (Hadoop Distributed File System). Non-relational. Modeled after Google's BigTable - storing large quantities of sparse data. Written in Java. Good for analyzing huge 2-dim data (billions of rows, millions of columns - searches, log processing, etc.). HTTP or Thrift interface.
Cassandra	Open source distributed high-performance database to store and query huge amounts of data. Originally from Facebook. Has its own data model (not Hadoop). Many users load data from Hadoop into Cassandra to do analytics. Uses SQL-like query language (CQL3). Written in Java. Great for web analytics, transaction logs, etc.
Spark	Open source, written in Scala, runs MapReduce up to 100x faster than Hadoop on top of Hadoop Distributed File System (HDFS). Originally developed in the AMPLab at UC Berkeley. Company - Databricks.
SparkSQL	Open source, allows to use SQL (or HiveQL or Scala) over Spark. It is recommended to move from Hive to SparkSQL.
Spark RDD	Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable, partitioned collection of elements that can be operated on in parallel.
Tachyon	Open source memory-caching for distributed file systems (HDFS, S3, GlusterFS, etc.). Hadoop compatible. Existing Spark and MapReduce programs can run on top of it without any code change. Used in Berkeley Data Analytics Stack (BDAS).
DataPad	Data Analysis tools (Wes McKinney et al) - PyData, June 2014, 40min - GraphLab Conf, July 2014, 22min
Pig	provides language (Pig Latin) for MapReduce over HDFS. Originally from Yahoo, 2006.
-	-
Mahout	machine learning algorythm on the Hadoop platform. Also provides math and statistics.
Oozie	A workflow scheduler system to manage Hadoop jobs (web app, Java servlets)
Flume	distributed service for collecting, aggregating, and moving large amounts of log data, good for analytic applications
Sqoop	tool for moving bulk data between Apache Hadoop and other storages (for example, relational databases)
Claudera Hue	Cloudera web interface for using Hadoop and analyzing data.
Claudera CDH	Cloudera Hadoop Distribution (free download)
Cloudera Express	CDH + Cloudera Manager (free download) - best way to start with Hadoop
Cloudera Impala	fast SQL queying of HDFS data (massive parrallel processing) - good for analytics, eliminates the need of moving data into some other datastore for analytics. Impala = medium-sized African antelope known for fast running and jumping.
MongoDB	Open Source NoSQL database for JSON-like documents. Horizontally scalable. Written in C++. Queries (dynamic) in Javascript. You can define indexes. Fairly good performance. (name "mongo" comes from "humongous", not from "Blazing Saddles" Mongo).
CouchDB	Open source NoSQL database, uses JSON to store data, JavaScript as query language, HTTP for an API. Good for mobile devices.
Redis	Open source, VERY FAST (written in C) in-memory (disk-backed) key-value database. REDIS = REmote DIctionary Server. DB should fit in RAM (or span RAMs of several computers using clustering). Supports bits, sets, lists, hashes. Support transactions. Great for real time stock prices, analytics, etc.
Kyoto Tycoon - Kyoto Cabinet	Open Source fast and lightweight network DBM over HTTP, can do more than 1 million insert/select per second. Multiple storage options (hash, tree, dir, etc.). Lua on server side. Can be used with C, Java, Python, Ruby, Perl, Lua, etc. Great for real-time data (cache).
Riak	Open source distributed NoSQL key-value data store. Main benefit - high availability & fault tolerance. Simple to operate and to scale by adding more servers. Written in Erlang, C, and some Javascript. HTTP or custom binary interface. Map/reduce in JavaScript or Erlang.
Couchbase (Membase)	Open Source distributed NoSQL document-oriented database to serve many concurrent users. Easy-to-scale key-value or document access with low latency and high sustained throughput. Designed to be clustered to very large scale deployments.
Neo4j	Open source graph database for connected data (nodes and relationships). Transactional. Advanced path finding. Good for road maps, topologies, etc. (graphs). Written in Java.
Hypertable	Open source database - basically a faster smaller HBase implemented in C++, uses ideas from Google's BigTable. Runs on HDFS (or ClusterFS or KFS (Kosmos File System). Uses its own, "SQL-like" language, HQL.
ElasticSearch	Open source distributed full-text search engine with HTTP interface and JSON documents (parent/children docs). Based on Java Lucene library. Great when you need advanced flexible fuzzy search.
Solr	Open source popular blazing fast search platform. Based on Java Lucene library. Used by SalesForce.com over pure-java Jetty web server/servlet container.
Accumulo	Open source - similar to HBase, but provides cell-level security (access labels). Sorted, distributed key/value storage. Built on top of Hadoop, ZooKeeper, and Thrift. Written in Java and C++.
VoltDB	The world’s fastest in-memory relational database. Fast ingestion and export, massive scalability, real-time analytics. SQL access from within pre-compiled Java stored procedures. Open Source and commercial versions.
Memcached	Open source distributed memory caching system. Can be used to cache results of DB queries.
Scalaris	Open source scalable, distributed, transactional key-value store. Written in Erlang, accessible from Python, Ruby, Java, etc.
BigTable	Google proprietary data storage system. Compressed, high performance. Uses Google File System, Chubby Lock Service, SSTable (log-structured storage like LevelDB).
LevelDB	open source on-disk key-value store (by Google).
GoogleFS	Google File System - proprietary
QFS	Quantcast File System (QFS) is an open-source distributed file system software package for large-scale MapReduce or other batch-processing workloads. It was designed as an alternative to Apache Hadoop’s HDFS, intended to deliver better performance and cost-efficiency for large-scale processing clusters.
Oracle noSQL database	Oracle NoSQL Database (ONDB) provides network-accessible multi-terabyte distributed key/value pair storage with predictable latency.
Oracle BigData SQL	Oracle Big Data SQL extends Oracle SQL to Hadoop and NoSQL
Oracle Data Integrator	Oracle Data Intergator (ODI) is an ETL (Extract-Transform-Load) platform for high volume/high performance batch loads, event-driven, trickle-feed integration, SOA-enabled data service, etc. BigData support, parallelism. Monitoring.
Talend	Open source ETL / integration tools, supports BigData. Look at Talend Jumpstart Sandbox.
Ab Initio	Applications and custom soultions for very high-volume data processing and data integration. Record-breaking.
kiji	Open Source framework for collecting, analyzing, and serving entity data in real time.
HPCC	HPCC (High Performance Computing Cluster) - open source platform for massive parallel-processing to solve Big Data problems
Pivotal	Big Data solutions, Hadoop distribution
IBM Big Data Platform	IBM's enterprise class big data and analytics platform – Watson Foundations.
HP Haven	HP HAVEn Big Data platform = (Hadoop, Autonomy Corporation, Vertica, HP Enterprise Security Products).
Amazon	Cloud Platform. Amazon Elastic Cloud Compute (Amazon EC2), Amazon Simple Storage Service (Amazon S3), Amazon Web Services (AWS), Amazon Redshift (Datawarehousing solutions), Amazon Elastic MapReduce.
Intel	Partners with Claudera to provide server platforms for Big Data.
Signiant	System to move large data sets into and out of the cloud (accross datacenters, parallel, encrypted). Signiant Media Shuttle, Signiant Media Exchange and Signiant Manager+Agents, Signiant SkyDrop.
BitYota	Datawarehouse as a service
DataStax	Delivers certified version of Apache Cassandra that is ready for heavy-duty production environments
Greenplum	Bigdata analytics, in 2012 became a part of Pivotal.
Teradata	Expandable relational datawarehouse system (since 1979), "shared nothing" architecture.
Splunk	Captures, indexes and correlates real-time data in a searchable repository from which it can generate graphs, reports, alerts, dashboards and visualizations
Sumo Logic	Cloud-based log management and analytics service. Uses advanced machine learning algorithms to whittle down mountains of log file data into common groupings.
qubole	Creators of Facebook’s Big Data infrastructure and Apache Hive have leveraged their experience to deliver Qubole Data Service (QDS) – a cloud Big Data service offering the same advanced capabilities used by Big Data savvy organizations.
altiscale	Altiscale Data Cloud is the first cloud service purpose-built to run Hadoop. We offer an on-demand, elastic solution on a pay-as-you-go basis

Good WebSites

- http://www.kdnuggets.com - Data Mining Community. Analytics, Data Mining, and Data Science Software, etc.
- http://www.cyc.com/platform/opencyc ( also http://en.wikipedia.org/wiki/Cyc ) - codify human common sense
- http://www.datasciencecentral.com - great site

Microsoft and Big Data ------------------------------

Hadoop and HDInsight: Big Data in Windows Azure:
- http://msdn.microsoft.com/en-us/magazine/dn385705.aspx

HDInsight is an Apache Hadoop implementation that runs in globally distributed Microsoft datacenters.
It’s a service that allows you to easily build a Hadoop cluster in minutes when you need it.

Windows Azure Storage - can include NoSQL stores, SQL database, blobs, etc. REST-ful API http). Support virtual machines to run Hadoop on Linux.

Cloud ------------------------------

Cloud Services:

http://talkincloud.com/talkin-cloud-top-100-cloud-services-providers-0 - top 100 Cloud service providers

Here are the top 10 (number = Rank, 1 - top rank)

1. Salesforce.com, San Francisco, CA
2. Amazon, Seattle, WA
3. Microsoft, Redmond, WA
4. Oracle, Redwood City, CA
5. Google, Mountain View, CA
6. SAP, Walldorf, Germany
7. SoftLayer (IBM), Dallas, TX
8. Terremark (a Verizon Company), Miami, FL
9. Rackspace, San Antonio, TX
10. NetSuite, San Mateo, CA

Misc ------------------------------

Some technologes to consider:

http://nginx.org - very fast web server / proxy for serving static contents
http://www.getchef.com/chef/ - automate how you build, deploy, and manage your infrastructure
http://www.vagrantup.com - Create and configure lightweight, reproducible, and portable development environments
http://berkshelf.com/ - Manage a Cookbook or an Application's Cookbook dependencies
https://qpid.apache.org - Messaging built on AMQP

38 Seminal Articles Every Data Scientist Should Read (from datasciencecentral.com, August 2014)

External Papers

DSC Internal Papers

---

Video Tutorials

Cloudera Training

- Claudera Live -
- MapReduce and HDFS -
- Introduction to Hadoop Administration -
- Install Hadoop (CDH4) on 5 nodes with VMWare, CDH4, Cloudera Manager 4 -
- Hadoop Tutorial: Intro To Hadoop Developer Training -
- Big Data HADOOP Training : Cloudera VM -
- Hadoop Distribution Comparison and Overview: Cloudera, MapR, and Hortonworks -
- Technical Overview of Cloudera Impala -

Big Data

Introducing Apache Hadoop: The Modern Data Operating System - https://www.youtube.com/watch?v=d2xeNpfzsYI
What is Big Data? - https://www.youtube.com/watch?v=PlaJsseTgk4
Big Data - General Introduction - https://www.youtube.com/watch?v=3twBv2v4Ip0
Big Data Architecture Patterns - https://www.youtube.com/watch?v=-N9i-YXoQBE

Hadoop - 15 video tutorials

Hadoop - Just the Basics for Big Data Rookies - https://www.youtube.com/watch?v=xYnS9PQRXTg
Hadoop Tutorial 1 - What is Hadoop? - https://www.youtube.com/watch?v=xWgdny19yQ4
Hadoop Tutorial 2 - Challenges Created by Big Data - https://www.youtube.com/watch?v=cA2btTHKPMY
Hadoop Tutorial 3 - History Behind Creation of Hadoop (Google, Yahoo, and Apache) - https://www.youtube.com/watch?v=jA7kYyHKeX8
Hadoop Tutorial 4 - Overview of Hadoop Projects - https://www.youtube.com/watch?v=5cle0xSdrtU
Hadoop Tutorial 5 - Steps to Install Hadoop on a Personal Computer (Windows/OS X) - https://www.youtube.com/watch?v=rO-V1mxhzcM
Hadoop Tutorial 6 - Downloading and Installing Oracle VirtualBox - https://www.youtube.com/watch?v=K70NyXNj9qc
Hadoop Tutorial 8 - Importing Hadoop Appliance into Oracle VirtualBox - https://www.youtube.com/watch?v=FZ8u73pHLu4
Hadoop Tutorial 9 - Configuring Hadoop Virtual Machine - https://www.youtube.com/watch?v=Nc7HngjMD58
Hadoop Tutorial 10 - Starting Hadoop Instance and Testing the Connection - https://www.youtube.com/watch?v=KHPFGRv7ZCM
Hadoop Tutorial 11 - Limitations of Network File System - https://www.youtube.com/watch?v=Fs5-lu3CRKM
Hadoop Tutorial 12 - Adressing Limitations of Distributed File System - https://www.youtube.com/watch?v=f35RwfhH024
Hadoop Tutorial 13 - Limitations of Hadoop File System - https://www.youtube.com/watch?v=w4fed_Vgo80
Hadoop Tutorial 14 - Block Structured File System -https://www.youtube.com/watch?v=uzGljM77elM
Hadoop Tutorial 15 - Replication in Hadoop File System - https://www.youtube.com/watch?v=CwoYV9EdCi0

NoSQL

What is NoSQL Database? - https://www.youtube.com/watch?v=pHAItWE7QMU
NoSQL or SQL: What Is Best For My Application (56:32) - https://www.youtube.com/watch?v=sDtnlPdqwWI
Introduction to NoSQL by Martin Fowler (54:51) - https://www.youtube.com/watch?v=qI_g07C_Q5I
NoSQL Distilled to an hour by Martin Fowler (1:03:19) - https://www.youtube.com/watch?v=ASiU89Gl0F0

MongoDB

- nosql mongo db tutorial (multiple videos)

Hadoop MapReduce - 5 videos

Hadoop MapReduce Fundamentals 1 of 5 - https://www.youtube.com/watch?v=7FcMhTTG1Cs
Hadoop MapReduce Fundamentals 2 of 5 - https://www.youtube.com/watch?v=pDGLe4CsrhY
Hadoop MapReduce Fundamentals 3 of 5 - https://www.youtube.com/watch?v=9h_WLsmRfFM
Hadoop MapReduce Fundamentals 4 of 5 - https://www.youtube.com/watch?v=iiIDZTpdcuU
Hadoop MapReduce Fundamentals 5 of 5 - https://www.youtube.com/watch?v=1aen3JsxkuM

HDFS Architecture (1:06:32) - https://www.youtube.com/watch?v=DLutRT6K2rM
MapReduce Flow Chart (55:23) - https://www.youtube.com/watch?v=6OemZEJdMp8
Top Interview questions in Hadoop (30:09) - https://www.youtube.com/watch?v=sQCNZoWLIVo
BigData-Apache-Hadoop free training video lesson 18-Cloudera CDH

Cassandra

Introduction To Apache Cassandra (1:15:05) - https://www.youtube.com/watch?v=B_HTdrTgGNs
Cassandra 1 | Cassandra Tutorial 1 | Cassandra Tutorial for Beginners -1 (2:46:16) - https://www.youtube.com/watch?v=N9QllqXI1sE
DataStax Cassandra Tutorials - Apache Cassandra (8 videos) - https://www.youtube.com/watch?v=5qEoEAfAer8&index=7&list=PLkm0HTZslJY2SD8PbxEgke9weNINb4a_q
Cassandra Training | Cassandra Online Training | Cassandra Tutorial | Youtub (2:01:23) - https://www.youtube.com/watch?v=qISzEVuSfQ4
Tech Talk: Cassandra Data Modeling (42:40) - https://www.youtube.com/watch?v=tg6eIht-00M
cassandra data modeling - Practical considerations @ netflix (1:35:56) - https://www.youtube.com/watch?v=-zyZ35YyT_8

Cloudera Video Tutorials

1. Hadoop Tutorial: Intro To Hadoop Developer Training | Cloudera (1:00:32)
2. Hadoop Tutorial: Introduction To Data Analyst Training | Cloudera (57:02)
3. Hadoop Tutorial: Intro To Hadoop Administrator Training | Cloudera (50:53)
4. Cloudera: Training A New Generation Of Data Scientists (33:03)
5. Teaching Hadoop To The Next Generation Of Data Professionals | Cloudera Academic Partnership (37:29)

Real-Time Analytics - videos

Real-time Analytics using Cassandra, Spark and Shark at Ooyala by Evan Chan (31:26)
Real Time Analytics with Open Source Technologies (33:35)
Beyond Hadoop MapReduce: Interactive Analytic Insights Using Spark (38:27)

---

On This Page	More	Other Pages
- intro - - list_of_technologies - - good_websites - - Microsoft and BigData - - cloud - - misc - - seminal_articles - - video_tutorials - - x - - x -	-	-