MongoDB Atlas Data Lake is a new member of the MongoDB Atlas family which has just been announced at MongoDB World and is available in public beta. It brings the technology that has made MongoDB the most popular document database in the world and applies it to the great data lakes of the cloud. As companies have accumulated more and more data in cloud storage like Amazon S3, so the need to process that data effectively has risen.
With MongoDB Atlas Data Lake, you use the MongoDB Query Language, which is built for rich, complex structures and work with data stored in JSON, BSON, CSV, TSV, Avro, and Parquet formats. Data is analyzed on demand with no infrastructure setup and no time-consuming transformations, pre-processing or metadata management. There's no schema to pre-define, allowing you to work with your data faster.
As an on-demand service available in MongoDB's Atlas cloud data platform, there's no deployment process. All you need to begin your data exploration is to provide access to your S3 storage buckets. Users will configure Atlas Data Lake from the same UI as MongoDB Atlas operational clusters though a simple wizard to configure permissions, give read-only access to their S3 buckets, map S3 directories to databases and collections and get them ready to run queries. Atlas Data Lake will also provide stats on queries executed, data scanned and returned as well as average execution time.
By leveraging the MongoDB Query Language, you can apply one skill set to your data lake and your transactional databases. It isn't just the query language that works with Data Lake. The service is compatible with MongoDB drivers, the MongoDB Shell, MongoDB Compass and MongoDB BI Connector. That means applications written in JavaScript, Perl, Python, C, C++, Java, Ruby, Go, Scala, R and many other languages can access Data Lake using drivers MongoDB users are deploying every day. Data scientists will be able to use tools such as R Studio with our R driver or Jupyter Notebooks with our Python driver for statistics, machine learning and data lake analytics.
Behind the scenes, the MongoDB Atlas Data Lake currently deploys multiple compute nodes to analyze each S3 bucket and process queries against that storage bucket's data. These nodes work in parallel and in the bucket's region for fast processing and to minimize data transfer and the associated cost. When done, each node returns its results to a central node that sorts, filters and aggregates the separate results into a final result as needed. For the Data Lake user, this process is entirely transparent and allows them to get on with their work extracting the value and insight from that data. It also means that there are no limits to concurrent queries being applied to the data. Future enhancements to the compute node architecture will also be transparent to the user.
MongoDB Atlas Data Lake is designed to get the best from your data lake with the tools and platforms you already use, whether you want to analyze data, build data services, feed machine learning and AI or build active archives.
For more information, check out the Data Lake product page or learn more by reading our documentation.