LevSelector.com New York
home > Hadoop, etc

Hadoop and other noSQL technologies
On This Page More Other Pages

- intro -
- list_of_technologies -
- good_websites -
- Microsoft and BigData -
- cloud -
- misc -
- seminal_articles -
- video_tutorials -
- x -
- x -



Intro ------------------------------

In 15 years since 1998 the number of servers in Google grew to approx 3 million servers in 2013.
Microsoft, Amazon, and others are not far behind. How do you manage millions of servers?
How do you distribute data between them?
How you query this data?
How Google returns results from millions of servers in less than a second?
What technologies power those huge distributed databases?

Google - how it works


List of some well known BigData technologies (as of Summer 2014):

List below contains both open source and commercial solutions.
Which ones to choose?

"Open source or open wallet.
Vibrant innovative engineering community or marketing.
It's not about a great product.
It is about unbridled thinking and freedom from single company control.
The choice is easy."
          -- David Garrison, Hortonworks


Open Source system to store data distributed over multiple (thousands) servers. Created in 2005 by Doug Cutting and Mike Cafarella for "Nutch" search engine project. Name "Hadoop" comes from the name of Cutting's little son's toy elephant. Cutting was working at Yahoo! at the time. Hadoop is written in Java.

Hadoop distributions

Main distributions (package, develop, support):
- Claudera -Palo Alto, CA
- MapR - San Hose, CA
- Hortonworks - Palo Alto, CA, funded by Yahoo, Teradata, etc.

HDFS Hadoop Distributed File System
YARN Yet Another Resource Negotiator - for better resource management and map reduce in Hadoop 2.x
Storm Distributed processing and streaming of data. Open source. Written in Clojure. Uses "spouts" and "bolts" to define topologies of moving data. Integrates well with many common messaging systems (RabbitMQ, Kestrel, Kafka, etc).
Samza Open Source distributed stream processing framework. Uses Apache Kafka for messaging. Written in Java and Scala.
Hive SQL-like language called HiveQL to query Hadoop HDFS data. DataWarehouse infrastructure. Can run on top of Spark for faster performance.

Open Source distributed database on top of HDFS (Hadoop Distributed File System). Non-relational. Modeled after Google's BigTable - storing large quantities of sparse data. Written in Java. Good for analyzing huge 2-dim data (billions of rows, millions of columns - searches, log processing, etc.). HTTP or Thrift interface.


Open source distributed high-performance database to store and query huge amounts of data. Originally from Facebook. Has its own data model (not Hadoop). Many users load data from Hadoop into Cassandra to do analytics. Uses SQL-like query language (CQL3). Written in Java. Great for web analytics, transaction logs, etc.

Spark Open source, written in Scala, runs MapReduce up to 100x faster than Hadoop on top of Hadoop Distributed File System (HDFS). Originally developed in the AMPLab at UC Berkeley. Company - Databricks.
SparkSQL Open source, allows to use SQL (or HiveQL or Scala) over Spark. It is recommended to move from Hive to SparkSQL.
Spark RDD Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable, partitioned collection of elements that can be operated on in parallel.

Open source memory-caching for distributed file systems (HDFS, S3, GlusterFS, etc.). Hadoop compatible. Existing Spark and MapReduce programs can run on top of it without any code change. Used in Berkeley Data Analytics Stack (BDAS).

DataPad Data Analysis tools (Wes McKinney et al)
- PyData, June 2014, 40min
- GraphLab Conf, July 2014, 22min
Pig provides language (Pig Latin) for MapReduce over HDFS. Originally from Yahoo, 2006.
- -
Mahout machine learning algorythm on the Hadoop platform. Also provides math and statistics.
Oozie A workflow scheduler system to manage Hadoop jobs (web app, Java servlets)

distributed service for collecting, aggregating, and moving large amounts of log data, good for analytic applications

Sqoop tool for moving bulk data between Apache Hadoop and other storages (for example, relational databases)
Claudera Hue Cloudera web interface for using Hadoop and analyzing data.
Claudera CDH Cloudera Hadoop Distribution (free download)
Cloudera Express CDH + Cloudera Manager (free download) - best way to start with Hadoop
Cloudera Impala fast SQL queying of HDFS data (massive parrallel processing) - good for analytics, eliminates the need of moving data into some other datastore for analytics. Impala = medium-sized African antelope known for fast running and jumping.

Open Source NoSQL database for JSON-like documents. Horizontally scalable. Written in C++. Queries (dynamic) in Javascript. You can define indexes. Fairly good performance. (name "mongo" comes from "humongous", not from "Blazing Saddles" Mongo).

CouchDB Open source NoSQL database, uses JSON to store data, JavaScript as query language, HTTP for an API. Good for mobile devices.

Open source, VERY FAST (written in C) in-memory (disk-backed) key-value database. REDIS = REmote DIctionary Server. DB should fit in RAM (or span RAMs of several computers using clustering). Supports bits, sets, lists, hashes. Support transactions. Great for real time stock prices, analytics, etc.

Kyoto Tycoon - Kyoto Cabinet

Open Source fast and lightweight network DBM over HTTP, can do more than 1 million insert/select per second. Multiple storage options (hash, tree, dir, etc.). Lua on server side. Can be used with C, Java, Python, Ruby, Perl, Lua, etc. Great for real-time data (cache).


Open source distributed NoSQL key-value data store. Main benefit - high availability & fault tolerance. Simple to operate and to scale by adding more servers. Written in Erlang, C, and some Javascript. HTTP or custom binary interface. Map/reduce in JavaScript or Erlang.


Open Source distributed NoSQL document-oriented database to serve many concurrent users. Easy-to-scale key-value or document access with low latency and high sustained throughput. Designed to be clustered to very large scale deployments.


Open source graph database for connected data (nodes and relationships). Transactional. Advanced path finding. Good for road maps, topologies, etc. (graphs). Written in Java.


Open source database - basically a faster smaller HBase implemented in C++, uses ideas from Google's BigTable. Runs on HDFS (or ClusterFS or KFS (Kosmos File System). Uses its own, "SQL-like" language, HQL.


Open source distributed full-text search engine with HTTP interface and JSON documents (parent/children docs). Based on Java Lucene library. Great when you need advanced flexible fuzzy search.

Solr Open source popular blazing fast search platform. Based on Java Lucene library. Used by SalesForce.com over pure-java Jetty web server/servlet container.

Open source - similar to HBase, but provides cell-level security (access labels). Sorted, distributed key/value storage. Built on top of Hadoop, ZooKeeper, and Thrift. Written in Java and C++.

VoltDB The world’s fastest in-memory relational database. Fast ingestion and export, massive scalability, real-time analytics. SQL access from within pre-compiled Java stored procedures. Open Source and commercial versions.

Open source distributed memory caching system. Can be used to cache results of DB queries.


Open source scalable, distributed, transactional key-value store. Written in Erlang, accessible from Python, Ruby, Java, etc.

BigTable Google proprietary data storage system. Compressed, high performance. Uses Google File System, Chubby Lock Service, SSTable (log-structured storage like LevelDB).
LevelDB open source on-disk key-value store (by Google).
GoogleFS Google File System - proprietary
QFS Quantcast File System (QFS) is an open-source distributed file system software package for large-scale MapReduce or other batch-processing workloads. It was designed as an alternative to Apache Hadoop’s HDFS, intended to deliver better performance and cost-efficiency for large-scale processing clusters.
Oracle noSQL database Oracle NoSQL Database (ONDB) provides network-accessible multi-terabyte distributed key/value pair storage with predictable latency.
Oracle BigData SQL Oracle Big Data SQL extends Oracle SQL to Hadoop and NoSQL
Oracle Data Integrator Oracle Data Intergator (ODI) is an ETL (Extract-Transform-Load) platform for high volume/high performance batch loads, event-driven, trickle-feed integration, SOA-enabled data service, etc. BigData support, parallelism. Monitoring.
Talend Open source ETL / integration tools, supports BigData. Look at Talend Jumpstart Sandbox.
Ab Initio Applications and custom soultions for very high-volume data processing and data integration. Record-breaking.
kiji Open Source framework for collecting, analyzing, and serving entity data in real time.
HPCC HPCC (High Performance Computing Cluster) - open source platform for massive parallel-processing to solve Big Data problems
Pivotal Big Data solutions, Hadoop distribution
IBM Big Data Platform IBM's enterprise class big data and analytics platform – Watson Foundations.
HP Haven HP HAVEn Big Data platform = (Hadoop, Autonomy Corporation, Vertica, HP Enterprise Security Products).
Amazon Cloud Platform. Amazon Elastic Cloud Compute (Amazon EC2), Amazon Simple Storage Service (Amazon S3), Amazon Web Services (AWS), Amazon Redshift (Datawarehousing solutions), Amazon Elastic MapReduce.
Intel Partners with Claudera to provide server platforms for Big Data.
Signiant System to move large data sets into and out of the cloud (accross datacenters, parallel, encrypted). Signiant Media Shuttle, Signiant Media Exchange and Signiant Manager+Agents, Signiant SkyDrop.
BitYota Datawarehouse as a service
DataStax Delivers certified version of Apache Cassandra that is ready for heavy-duty production environments
Greenplum Bigdata analytics, in 2012 became a part of Pivotal.
Teradata Expandable relational datawarehouse system (since 1979), "shared nothing" architecture.
Splunk Captures, indexes and correlates real-time data in a searchable repository from which it can generate graphs, reports, alerts, dashboards and visualizations
Sumo Logic Cloud-based log management and analytics service. Uses advanced machine learning algorithms to whittle down mountains of log file data into common groupings.
qubole Creators of Facebook’s Big Data infrastructure and Apache Hive have leveraged their experience to deliver Qubole Data Service (QDS) – a cloud Big Data service offering the same advanced capabilities used by Big Data savvy organizations.
altiscale Altiscale Data Cloud is the first cloud service purpose-built to run Hadoop. We offer an on-demand, elastic solution on a pay-as-you-go basis


Good WebSites



Microsoft and Big Data ------------------------------

Hadoop and HDInsight: Big Data in Windows Azure:
  - http://msdn.microsoft.com/en-us/magazine/dn385705.aspx

HDInsight is an Apache Hadoop implementation that runs in globally distributed Microsoft datacenters.
It’s a service that allows you to easily build a Hadoop cluster in minutes when you need it.

Windows Azure Storage - can include NoSQL stores, SQL database, blobs, etc. REST-ful API http). Support virtual machines to run Hadoop on Linux.


Cloud ------------------------------

Cloud Services:

http://talkincloud.com/talkin-cloud-top-100-cloud-services-providers-0 - top 100 Cloud service providers

Here are the top 10 (number = Rank, 1 - top rank)

1. Salesforce.com, San Francisco, CA
2. Amazon, Seattle, WA
3. Microsoft, Redmond, WA
4. Oracle, Redwood City, CA
5. Google, Mountain View, CA
6. SAP, Walldorf, Germany
7. SoftLayer (IBM), Dallas, TX
8. Terremark (a Verizon Company), Miami, FL
9. Rackspace, San Antonio, TX
10. NetSuite, San Mateo, CA


Misc ------------------------------

Some technologes to consider:


38 Seminal Articles Every Data Scientist Should Read (from datasciencecentral.com, August 2014)

External Papers

  1. Bigtable: A Distributed Storage System for Structured Data
  2. A Few Useful Things to Know about Machine Learning
  3. Random Forests
  4. A Relational Model of Data for Large Shared Data Banks
  5. Map-Reduce for Machine Learning on Multicore
  6. Pasting Small Votes for Classification in Large Databases and On-Line
  7. Recommendations Item-to-Item Collaborative Filtering
  8. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank
  9. Spanner: Google's Globally-Distributed Database
  10. Megastore: Providing Scalable, Highly Available Storage for Interactive Services
  11. F1: A Distributed SQL Database That Scales
  12. APACHE DRILL: Interactive Ad-Hoc Analysis at Scale
  13. A New Approach to Linear Filtering and Prediction Problems
  14. Top 10 algorithms on Data mining
  15. The PageRank Citation Ranking: Bringing Order to the Web
  16. MapReduce: Simplified Data Processing on Large Clusters
  17. The Google File System
  18. Amazon's Dynamo

DSC Internal Papers

  1. How to detect spurious correlations, and how to find the ...
  2. Automated Data Science: Confidence Intervals
  3. 16 analytic disciplines compared to data science
  4. From the trenches: 360-degree data science
  5. 10 types of regressions. Which one to use?
  6. Practical illustration of Map-Reduce (Hadoop-style), on real data
  7. Jackknife logistic and linear regression for clustering and predict...
  8. A synthetic variance designed for Hadoop and big data
  9. Fast Combinatorial Feature Selection with New Definition of Predict...
  10. Internet topology mapping
  11. 11 Features any database, SQL or NoSQL, should have
  12. 10 Features all Dashboards Should Have
  13. Clustering idea for very large datasets
  14. Hidden decision trees revisited
  15. Correlation and R-Squared for Big Data
  16. What Map Reduce can't do
  17. Excel for Big Data
  18. Fast clustering algorithms for massive datasets
  19. The curse of big data
  20. Interesting Data Science Application: Steganography


Video Tutorials


Cloudera Training

Big Data

Hadoop - 15 video tutorials



- nosql mongo db tutorial (multiple videos)

Hadoop MapReduce - 5 videos



Cloudera Video Tutorials

Real-Time Analytics - videos