The MongoDB engineering team has recently made a series of significant updates to the MongoDB Connector for Hadoop. This makes it easier for Hadoop users to integrate real-time data from MongoDB – the most popular database for big data systems – with Hadoop for deep, offline analytics. The Connector exposes the analytical power of Hadoop's MapReduce to live application data from MongoDB, driving value from big data faster and more efficiently.
The Connector presents MongoDB as a Hadoop-compatible file system allowing a MapReduce job to read from MongoDB directly without first copying it to HDFS, thereby eliminating the need to move Terabytes of data across the network. MapReduce jobs can pass queries as filters, so avoiding the need to scan entire collections, and can also take advantage of MongoDB’s rich indexing capabilities including geospatial, text-search, array, compound and sparse indexes.
As well as reading from MongoDB, the results of Hadoop jobs can also be written back out to MongoDB, to support real-time operational processes and ad-hoc querying.
Version 1.1 of Connector adds support for MongoDB’s native BSON (Binary JSON) backup files, which can be stored locally in HDFS and co-located with TaskTrackers, where they can be processed directly by Hadoop, or on local or cloud-based file systems such as Amazon S3.
In addition to existing MapReduce, Pig, Hadoop Streaming (with node.js, Python or Ruby) and Flume support, the new MongoDB Hadoop connector enables SQL-like queries from Apache Hive to be run across MongoDB data sets. The latest version of the Connector enables Hive to access BSON files, with full support for MongoDB collections scheduled for the next release of the Connector later this year.
MongoUpdateWriteable is another new feature of the Connector. This allows Hadoop to modify an existing output collection in MongoDB, rather than only writing to new collections. As a result, users can run incremental MapReduce jobs, for example to aggregate trends or pattern matching on a daily basis, which can then by efficiently queried in a single collection by MongoDB.
The MongoDB Connector for Hadoop works by
- Examining the MongoDB Collection and calculates a set of splits from the data
- Each of the splits gets assigned to a node in Hadoop cluster
- In parallel, Hadoop nodes pull data for their splits from MongoDB (or BSON) and processes them locally
- Hadoop merges results and streams output back to MongoDB or BSON
Mike O’Brien, MongoDB software engineer and maintainer of the MongoDB Connector for Hadoop demonstrated its new features in a recent webinar – which is now available for viewing on-demand
Following on from Mike’s webinar, we will also host a new session on Wednesday 21st August exploring the big data use cases of MongoDB and Hadoop, and the value of integration between them in creating a big data pipeline
In summary, the MongoDB Connector for Hadoop adds to the broadest set of query and data analysis capabilities of any NoSQL database including:
- The MongoDB API, which was recently adopted by IBM as the new standard for building mobile applications;
- The MongoDB aggregation framework, which provides functionality similar to SQL GROUP_BY operators;
- Multiple integrations with leading BI tool vendors, including QlikTech, Actuate, Informatica JasperSoft, Pentaho and Talend, to perform BI on live data;
- Native MapReduce within MongoDB when integration with Hadoop isn’t needed;
- MongoDB Connector for Hadoop enabling integration with Hadoop MapReduce jobs, such as aggregating data from multiple input sources, or as part of Hadoop-based data warehousing or ETL workflows.
You can download MongoDB Connector for Hadoop from GitHub
Review the documentation, including details on how to get started and sample code
If you have any questions, email the mongodb-user Mailing List
We’d also love to hear how you can use the connector to bring together MongoDB and Hadoop – feel free to comment below