Guest Post: Apache in Space
The ApacheCon NA 2011 conference is rapidly approaching. The event takes place November 7 to 11 at the Westin Bayshore in Vancouver, Canada. Registration for the event is now open, with discounts available. In conjuction with ApacheCon NA 2011, OStatic is running a series of guest posts from movers and shakers in the Apache community. In this latest guest post, three officials from NASA’s Jet Propulsion Laboratories (JPL) introduce OODT (Object-Oriented Data Technology), an open source middleware suite for working with and managing data-intensive scientific applications. It’s used at JPL and overseen by the Apache Software Foundation.
Apache in Space
By Chris A. Mattmann, Andrew Hart, Emily Law
Preface: OODT (Object-Oriented Data Technology) is an open source middleware suite for managing, unifying, and archiving data used in data-intensive scientific applications at various organizations, including the NASA Jet Propulsion Laboratory, the U.S. National Cancer Institute, and the Whittier Virtual Pediatric Intensive Care Unit at Children's Hospital Los Angeles.. Originally created at NASA in 1998, OODT was donated to the ASF in 2010 and is now a top-level Apache project. The "middleware for metadata" (and vice-versa) solution is used for computer processing workflow, file and resource management, information integration, and linking databases, allowing distributed computing and data resources to be searchable and utilized by any end user.
Introducing OODT: The Apache Object Oriented Data Technology (OODT) project provides everything that a data system developer needs to acquire, ingest, process, and disseminate gobs of information to interested science users, the general community, and to decision and policy makers for the benefits of society.
Originally developed within the confines of NASA’s Jet Propulsion Laboratory (JPL), Apache OODT was donated from NASA to the Apache Software Foundation (ASF) in January 2010. The project graduated from the Apache Incubator in November 2010, and has been a thriving Apache community in residence ever since.
The crux of OODT includes two families of components.
The first is the family of Information Integration components. These components include: the Profile server, responsible for unlocking metadata (“data about data”) from catalogs and registries that house the information in heterogeneous formats guided by similarly heterogeneous models; and the Product server, which similarly unlocks data (the actual bits, e.g., a science image, a scientific measurement such as temperature, etc.) from archives and repositories that house the data in heterogeneous ways. OODT provides out-of-the-box Profile and Product server implementations like XMLPS, an XML-configurable Profile/Product server combination that easily connects to JDBC-accessible databases, and unlocks their information, and like OPeNDAPps, an XML-configurable Profile server that extracts metadata from scientific datasets served by the OPeNDAP protocol, a popular science data protocol and data model for scientific information. The OODT Web Grid substrate exposes Profile and Product servers using a REST-based interface.
The second set is the Catalog and Archive Server (CAS) components. These components include services for File Management, Workflow Management, and Resource Management and client-side frameworks including remote file acquisition (Push/Pull), automatic file identification, crawling, and ingestion (Crawler), and a science algorithm wrapper (CAS-PGE). The File Management component tracks files, and their associated metadata, and provides for means to classify and categorize the files. The Workflow Management component models data flow, and control flow, and lets scientists and users specify sequences of algorithms to perform functions like data mining; event generation (email-notification), and condition checking. The Resource Management component takes workflow tasks, and sends them out to the cloud, to compute clusters – even to your local laptop. The CAS Push/Pull component acquires remote content, and makes it available to the CAS Crawler component. The Crawler takes the content, identifies the files within it that are important (using the Apache Tika framework), extracts metadata from those files (again using Apache Tika) and then ships the extracted metadata and files to the File Management component. Finally, CAS-PGE allows science algorithms to be fed their required nurturing information (switches, command-line options, start/end time ranges, and spatial ranges, data files, environment variables, etc.) from the File Manager, Workflow Manager, and Resource Manager, then runs the underlying algorithm, making sure to catalog its output data files and metadata (using the Crawler) and get that information back into the system. Chris Mattmann will spend a bunch of time discussing these components during his Supercharging your OODT CAS deployments talk at ApacheCon NA 2011.
Apache OODT has recently added a number of user interfaces, primarily geared towards providing intuitive browser-based methods for interacting with OODT. For example, the OODT OpsUI, which leverages the Apache Wicket web application framework, provides comprehensive operational insight into a running OODT data system. Andrew Hart will discuss the OpsUI and OODT interface development in his talk at ApacheCon NA 2011.
The OODT framework has been adopted on mission-critical projects by organizations as diverse as NASA, the National Institutes of Health, and the Square Kilometre Array South Africa project. The growing user community benefits from code contributions from industry (Google, Vdio), academia (USC and others), and government agencies. Emily Law will highlight the various deployments of OODT in her ApacheCon NA 2011 talk.
Come join us in the special Apache in Space! (OODT) track to learn about the system, the community, and the software that are helping to capture data in a diverse set of disciplines, and in one of the fastest growing projects at Apache! Register at http://apachecon.com/
Bios:
Chris A. Mattmann is a senior computer scientist at the NASA Jet Propulsion Laboratory, working on instrument and science data systems on earth science missions and informatics tasks. He’s also an adjunct assistant professor in the Computer Science Department at the University of Southern California. His research interests are primarily software architecture and large-scale data-intensive systems. Mattmann received his PhD in computer science from the University of Southern California. He’s a senior member of IEEE. Contact him at mattmann@apache.org.
Andrew Hart is a staff software engineer at the NASA Jet Propulsion Laboratory (JPL) where he works in the Data Management Systems and Technologies group. He brings his computer science background and creative talents to the design and development of data management components and web interfaces for a wide range of projects including an informatics infrastructure for the National Cancer Institute’s Early Detection Research Network (EDRN), data modeling and architectural support for the Laura P. and Leland K. Whittier Virtual Pediatric Intensive Care Unit (Whittier VPICU) at Children’s Hospital Los Angeles, and web services for JPL's Regional Climate Model Evaluation System. He is also an active contributor to the Object Oriented Data Technology (OODT) project which last year became a TLP at Apache.
Emily Law has over twenty years of experience in the research, development and management of complex information systems. Ms. Law has worked for both large and small organizations having served on a diverse set of projects during her career. Since 1996, Ms. Law has been working at NASA’s Jet Propulsion Laboratory where she has provided leadership and management in the architecture, development and operations of highly distributed ground data systems for planetary exploration and earth science. She currently serves as a deputy program manager and development manager to two separate directorates covering data systems in solar system research and earth science. In 2005, she was the recipient of the NASA Exceptional Achievement Medal in recognition of outstanding contributions to the NASA Deep Space Network. She currently leads operations for NASA’s Planetary Data System, NASA’s official archive to manage results from solar system research; and leads development of the Lunar Mapping and Modeling Project data management infrastructure and portal in support of Lunar Exploration activities. Ms. Law has B.S. degrees in Math and Computer Science from Cal State Long Beach and is a graduate of University of Southern California where she holds a M.S. degree in Computer Engineering.