MongoDB Sets Up Real-Time Analytics Muscle with Apache Spark Connector

by Ostatic Staff - Jul. 05, 2016

The MongoDB World meetup took place last week, and there were a lot of interesting announcements made, including ones related to connecting open source database functionality to Apache Spark. From cloud developers working to incorporate databases with their deployments to enterprises that want more flexibility from their data repositories, open source databases are flourishing, and MongoDB is a leader in this area.

At last week's event, the MongoDB Connector for Apache Spark  was announced. It is billed as "a powerful integration that enables developers and data scientists to create new insights and drive real-time action on live, operational, and streaming data."

Working closely with Databricks, the company founded by the team that created the Apache Spark project, the MongoDB Connector has received Databricks Certified Application status for Spark. According to MongoDB leaders, "the certification means that developers can focus on building modern, data driven applications, knowing that the connector provides seamless integration and complete API compatibility between Spark processes and MongoDB."

A blog post notes:

"The connector enables developers to build more functional applications faster and with less complexity, using a single integrated analytics and database technology stack. With industry estimates assessing that data integration consumes 80 percent of analytics development, the connector enables data engineers to eliminate the requirement for shuttling data between separate operational and analytics infrastructure. Each of these systems demands their unique configuration, maintenance and management requirements."

"Written in Scala, Apache Spark’s native language, the connector provides a more natural development experience for Spark users. The connector exposes all of Spark’s libraries, enabling MongoDB data to be materialized as DataFrames and Datasets for analysis with machine learning, graph, streaming and SQL APIs, further benefiting from automatic schema inference. The connector also takes advantage of MongoDB’s aggregation pipeline and rich secondary indexes to extract, filter, and process only the range of data it needs – for example, analyzing all customers located in a specific geography."

 “Users are already combining Apache Spark and MongoDB to build sophisticated analytics applications. The new native MongoDB Connector for Apache Spark provides higher performance, greater ease of use, and access to more advanced Apache Spark functionality than any MongoDB connector available today,” said Reynold Xin, co-founder and chief architect of Databricks.

“Combining Apache Spark, the leading open-source big data analytics processing engine in the Apache Software Foundation, with MongoDB, the industry’s fastest-growing database, enables organizations to fully realize the potential of real-time analytics,” said Eliot Horowitz, co-founder and CTO of MongoDB. “Spark jobs can be executed directly against operational data managed by MongoDB, without the time and expense of Extract Transform Load (ETL) processes. MongoDB can efficiently index and serve analytics results back into live, operational processes, making them smarter, more contextual and responsive to events as they happen.”

Some folks in the artificial intelligence arena are already taking an interest in the new connector. “Building an artificial intelligence (AI) application requires huge amounts of data to be processed at once, both reliably and efficiently,” said Jeff Smith, Data Engineering Team Lead, x.ai, and author of Reactive Machine Learning Systems. “To store all that data, we use MongoDB for its flexible data model and its scaling capabilities. And to process all of that data to build machine learning models, we build robust pipelines in Scala using the distributed data processing capabilities of Apache Spark. Now, with the new native MongoDB Connector for Apache Spark, we have an even better way of connecting up these two key pieces of our infrastructure. We're rapidly building out a personal assistant who schedules meetings nearly flawlessly, and our datasets are increasing at an exponential rate. We believe the new connector will help us move faster and build reliable machine learning systems that can operate at massive scale.”

You can download the MongoDB Connector for Apache Spark now, and review the documentation.