A FRAMEWORK FOR COLLABORATIVE REVIEW OF CANDIDATE EVENTS IN HIGH DATA RATE STREAMS: THE V-FASTR EXPERIMENT AS A CASE STUDY

Andrew F. Hart; Luca Cinquini; Shakeh E. Khudikyan; David R. Thompson; Chris A. Mattmann; Kiri Wagstaff; Joseph Lazio; Dayton Jones

doi:10.1088/0004-6256/149/1/23

1. INTRODUCTION

One of the current primary goals of radio astronomy is to explore and understand the "dynamic radio sky" (Cordes et al. 2004). In contrast to generating catalogs of known sources, this scientific thrust focuses on transient events, or transient signals generated by persistent yet time-varying sources. We do not yet fully understand the scope and distribution of different transient sources, which range from the known (e.g., active galactic nuclei, brown dwarfs, flare stars, X-ray binaries, supernovae, gamma-ray bursts) to the probable (e.g., exoplanets), to the possible (e.g., ET civilizations, annihilating black holes). As noted by Cordes et al. (2004, p.14), "most exciting would be the discovery of new classes of sources" (italics in original). Radio telescopes continue to increase their data collecting abilities, observing the sky with progressively finer time resolution. Of current particular interest is the detection and characterization of "fast radio transients," which last for only small fractions of a second.

The V-FASTR experiment (Wayth et al. 2011) is one of a new breed of radio astronomy experiments specifically targeting fast transient radio signals. The experiment is conducted in a fully commensal (passive) fashion, searching for signals in the data gathered during the regular processing activities of its host instrument. Unlike more traditional, single-telescope observations, however, the V-FASTR experiment simultaneously utilizes anywhere between 2 and 10 telescopes of the National Radio Astronomy Observatoryʼs (NRAO) Very Long Baseline Array (VLBA) (Romney 2010). The VLBA consists of 10 25 m telescopes that are positioned geographically such that no 2 are within each otherʼs local horizon, and the V-FASTR experiment leverages this configuration to better discriminate between instances of terrestrial Radio Frequency Interference (RFI) and potentially genuine astronomical pulses (Thompson et al. 2011).

The huge volumes of raw time-series voltage data generated by the VLBA in the course of its operation make storing the full record of an entire observing session infeasible at the present time. As a consequence, considerable effort has been devoted to developing and fine-tuning algorithms for the real-time identification of potentially interesting signals in the noisy and often incomplete data (Thompson et al. 2011; Wayth et al. 2012). All data selected by the real-time processing step is subsequently inspected, on a daily basis, by members of the geographically distributed V-FASTR science team and either discarded as spurious or archived offline for full analysis at a later date.

The V-FASTR experiment must therefore operate within several important resource constraints: the inability to archive the full observational record due to space constraints, and a practical workload constraint upon the human analysts reviewing candidate detections. To address the latter, we have developed a metadata-driven, collaborative candidate review framework for the V-FASTR experiment. The framework comprises a set of software components dedicated to the automatic capture and organization of metadata describing the candidate events identified as interesting by the automated algorithms, and an online environment for the collaborative perusal and inspection of related imagery data by the V-FASTR analysis team.

The rest of this paper describes the system as follows. In Section 2 we describe our project in a more general context. Section 3 presents the methodology and an architectural description of the system. We follow with an evaluation of our experience deploying the framework in Section 4, and offer conclusions and future directions for the work in Section 5.

2. BACKGROUND

To better understand the context of the system implementation presented in Section 3, we first briefly introduce the V-FASTR experiment and describe the development of scientific data systems at the NASA Jet Propulsion Laboratory (JPL). We then describe the Object Oriented Data Technology (OODT) project, an open source information integration platform that plays a central role in our framework. Finally, we briefly touch upon several related efforts at developing online tools to collaboratively classify and validate scientific observations.

2.1. V-FASTR: The VLBA Fast TRansients Experiment

V-FASTR (VLBA Fast TRansients) is a data analysis system used by the VLBA to detect candidate fast transient events. Principal investigators submit observing proposals to the VLBA targeted at galaxies, supernovae, quasars, pulsars, and more. V-FASTR analyzes all data collected by the VLBA as part of routine processing and produces a nightly list of candidates identified within the data processed that day. The raw data for each candidate is temporarily saved in case it is needed to interpret or follow up on a particularly promising or unusual detection. However, the raw data consumes significant disk space and therefore the candidate list must be reviewed on a timely basis by experts. False positives can be deleted and their disk space reclaimed, while truly interesting events can be re-processed to enable the generation of a sky image to localize the source of the signal. Software tools that streamline and simplify this review process are therefore highly valued by candidate reviewers and can have a positive impact on other similar efforts throughout the world.

2.2. Data System Development at JPL

The Data Management Systems and Technologies group at the JPL develops software ground data systems to support NASA science missions. These pipelines are specifically optimized to support the data-intensive and computationally-intensive processing steps often needed to convert raw remote-sensing observations into higher level data products at scale so that they can be interpreted by the scientists. The process almost always involves developing a close collaboration with project scientists to obtain an understanding of the processing algorithms involved, a sense of the scale and throughput requirements, and other operational constraints of the expected production environment.

Over the years the group has developed a diverse portfolio of data system experience across a broad spectrum of domains including earth and climate science (Mattmann et al. 2009; Hart et al. 2011; Tran et al. 2011), planetary science, astrophysics, snow hydrology, radio astronomy, cancer research (Crichton et al. 2001), and pediatric intensive care (Crichton et al. 2011).

2.3. Open Source and OODT

One of the products of this long track record of experience in the realm of scientific data processing systems is a suite of software components known as OODT¹ originally arose out of a desire on the part of NASAʼs Office of Space Science to improve the return on investment for individual mission data systems by leveraging commonalities in their design to create a reusable platform of configurable components, on top of which mission-specific customizations could be made. OODT thus represents both an architecture and a reference implementation. Its components communicate with one another over standard, open protocols such as XML-RPC² and can be used either individually, or coupled together to form more complex data processing pipelines.

In 2009 OODT began the transition from a JPL-internal development project to a free and open source software project at the Apache Software Foundation (ASF).³ Graduating to a top-level project in 2011, OODT has since undergone several public releases at the ASF and is in use by a varied group of scientific and commercial endeavors. As we will describe further in Section 3, several OODT components form the core platform of our candidate validation framework. The ready availability of OODT components under a liberal license, combined with their substantial pedigree was appealing to our project both for time and budgetary considerations.

2.4. Related Work

In the following section we identify several ongoing efforts that also utilize online tools to assist in the collaborative review and classification of scientific observations.

2.4.1. Astropulse

Astropulse is part of a series of sky surveys for radio pulses being conducted by the Search for Extraterrestrial Intelligence (SETI) at the University of Berkeley (Siemion et al. 2010). The Astropulse project conducts a survey of the sky from the Arecibo Observatory in Puerto Rico, searching for short (microsecond) broadband radio frequency pulses. While Astropulseʼs use of Areciboʼs enormous single dish telescope affords excellent sensitivity, V-FASTRʼs ability to perform continent-scale baseline interferometery yields much greater positional accuracy when attempting to localize the source of a signal.

As a variant of the SETI@home project, Astropulse utilizes the same distributed, collaborative volunteer computing infrastructure accumulated over the years by that effort to perform a number of computationally intense transformations and calculations of the data in an attempt to better classify the origin of any signals detected. The use of volunteer computing to perform units of computational work is an appealing approach that obviates the need to directly acquire sufficient hardware for the processing demands. However, the fully automated nature of the approach is not a natural fit for V-FASTRʼs manual review requirement.

2.4.2. Galaxy Zoo

GalaxyZoo⁴ is an Internet-based project that relies on the help of volunteers to classify a very large database of galaxy images recorded by either the Sloan Digital Sky Survey or the Hubble telescope. Users are asked to classify galaxies based on shape, color and direction of rotation, and report on possible unidentified features. The rationale behind human intervention is that manual classification is more accurate and insightful than any algorithm that can currently by undertaken by an automatic program. To date, the project has met with success that far exceeded expectations: more than 250,000 volunteers have helped classify millions of images, resulting in the confirmation of important scientific hypothesis, the formulation of new ones, and the discovery of new interesting objects.

While Galaxy Zooʼs tactic of appealing to volunteers to mitigate the challenge of image classification at scale is attractive, the paradigm does not translate well to the V-FASTR setting due to differences in the nature of the archives between the two projects. Whereas Galaxy Zoo permits its volunteer reviewers to leisurely peruse and mine a largely static image archive, the rapidly growing data volumes associated with ongoing V-FASTR observations dictate that reviews must be regularly scheduled to keep the project within its resource limits.

2.4.3. Foldit: The Protein Folding Game

Foldit (Cooper et al. 2010) is a collaborative online protein folding game developed by the Center for Game Science at the University of Washington, and it represents a "crowd-sourced" attempt to solve the computationally challenging task of predicting protein structure. Proteins, chains of amino acids, play a key role in a wide range of human diseases, but comparatively little is known about how they contort themselves into the specific shapes that determine their function. Because of the scale and complexity of the challenge, the researchers behind Foldit have turned to the puzzle-solving capabilities of human beings for assistance. After learning the rules on simple challenges, players compete against one another to design alternative protein structures, with the goal of arriving at an arrangement that minimizes the total energy needed to maintain the shape.

Foldit has created an environment in which the unknown and diverse strategies of its human participants become a core strength. Furthermore, by presenting the scientific activity as a competitive game, the project, which currently boasts over 400,000 players, has shown that it is possible to recruit and leverage human processing power at scale. This provides an interesting model for other projects, including V-FASTR, which at some point may rely upon a human element to augment or improve automated processes.

3. IMPLEMENTATION

In this section we provide details on the implementation of our metadata-driven framework for online review of V-FASTR candidate detection events. We describe our methodology and the considerations that informed our design, followed by a presentation of the system architecture.

3.1. Development Methodology

Several factors influenced the development process and have left their imprint on the final architecture. We feel that our implementation is uniquely suited to the needs of the V-FASTR project precisely because these factors were identified early on and were thus able to influence all aspects of the design process.

3.1.1. Collaboration

As described in Section 2, our group has developed substantial experience in the design and implementation of data systems for a broad range of scientific domains. In each case, a close working relationship with members of the project science team was an essential ingredient to the success of the project, and our experience developing an online candidate review framework for V-FASTR was no different. As software engineers familiar with the challenges inherent in scientific data management, our intuitions about the technical challenges of the system served us well in scoping out the project timeline. However, it was our early and regular communication with members of the V-FASTR science team that was critical to obtaining the domain knowledge necessary to make accurate assumptions, and in the early identification of issues. The current system architecture, covering both the back and front end elements, is a direct result of an ongoing feedback loop between the science and software teams.

3.1.2. Constraints

As mentioned in Section 2, V-FASTR is a commensal experiment that scans for fast transients in data that is already being collected as part of the regular third-party use of the VLBA instrument. As such, the experiment maintains a "guest" status on the NRAO computing infrastructure. Consequently, care must consistently be taken not to overtax NRAO system resources, including disk storage, CPU time, and network bandwidth. These physical constraints motivated many of the architectural decisions described in the following sections.

Each V-FASTR data product may contain hundreds of files, rooted at a top-level job directory, and includes two types of products: filterbank data (up to $\sim 100$ GB per job) and baseband voltage data (up to $\sim 10$ GB per job). The total data storage capacity available to V-FASTR is just $\sim 8$ TB, enough to contain $\sim 800$ jobs of $\sim 10$ GB each (on average). Because products are produced at a average rate of $\sim 10$ –20 per day (but sometimes in the hundreds), the storage would be exhausted within a few weeks unless products are periodically reviewed by the science team analysts. During review, each candidate is either flagged for higher-resolution processing (and saved) or discarded as a false positive and the disk space reclaimed (see Figure 1 for an overview of the average data volumes per job at different processing stages). The desire to provide analysts with a streamlined method for this review process is at the very core of our design.

**Figure 1.** Depiction of the full V-FASTR data flow with volume estimates (per job) at each stage. The candidate review framework (both metadata pipeline and web portal components) interact with the metadata and derived products repository at the intersection of A and B above.
Download figure:
Standard image High-resolution image

Similarly, the network bandwidth constraints of the host led us to a data transfer configuration that focused on metadata rather than requiring the complete transfer of raw, unreviewed, and possibly spurious detection data over the Internet. Instead, metadata sufficient to describe the salient characteristics of a candidate event to a trained analyst was transferred into our candidate review framework. This careful selection process had the beneficial side effect of greatly limiting the size of the transferred products, allowing for a considerably longer retention period on the $\sim 10$ TB archive hosted at JPL.

Finally, security constraints were also critically important to the design, particularly because the system spans two separate security domains: NRAO and JPL. To comply with the security requirements of the host system, data transfer was configured on the NRAO system to allow read-only operations and was made accessible only to clients originating from the JPL domain. Furthermore, on the front-end, the functionality exposed by the web portal component interacted only with the local metadata archive, eliminating the possibility of corruption or inappropriate access to the raw observational data.

3.2. Architecture

As previously mentioned, the candidate review framework is driven by metadata describing the candidate events to be reviewed by V-FASTR analysts. To communicate this data from the raw source repository at the NRAO to an analyst using a browser anywhere in the world, we developed a software framework consisting of two principal components: a metadata pipeline that manages the capture, transfer, and storage of metadata annotations, and a web portal which provides analysts with a convenient, context-rich environment for efficiently classifying candidate events.

3.2.1. Metadata Pipeline

On the JPL side, the V-FASTR data products are processed through a metadata extraction and data archiving pipeline that eventually leads to the event candidates being available for inspection on the web portal. The pipeline is composed of three major software components: rsync, the OODT CAS Crawler, and the OODT File Manager, depicted in Figure 2.

rsync. Data products are automatically transferred from the NRAO staging area to the JPL server using rsync. rsync is a popular application and data transfer protocol that allows to synchronize the content of a directory tree between two servers with minimal human intervention. It was chosen because of its simplicity, high performance, reliability, and wide range of configuration options. Through rsync, files are transferred in compressed format and using delta encoding, meaning that only the file differences are sent through subsequent transfers. For this project, an rsync server daemon was set up on the NRAO side to expose the data staging area where the products are collected. For security reasons, the daemon was restricted to allow read-only operations to clients originating from a designated JPL IP address. On the JPL side, an rsync client was set up to run hourly as a system cron job, transferring products to the JPL archive area. To minimize bandwidth usage, the client only transfers a very small subset of the data comprising a product directory tree, namely the detection images and the output and calibration files containing the metadata needed by the web portal. On average, this represents a reduction of the data product size by a factor of $3.5\times {{10}^{3}}$ : from an average size of $\sim 35$ GB on the NRAO server (for a product with several detections), to $\sim 10$ MB on the JPL server. The rsync data transfer rates between the two servers were measured to be around $\sim 2$ MB s⁻¹, more than enough to transfer between 10 and 20 data products per day.
CAS Crawler. Once the data products are transferred to the JPL server, they are automatically detected by the OODT CAS Crawler daemon, which runs at sub-hour time intervals to pick up new products as soon as they become available. The Crawler is responsible for notifying the OODT File Manager and therefore starting the product ingestion process. For this deployment, the Crawler was configured to send a signal only if two preconditions are both satisfied: (1) a similarly named product does not already exist in the File Manager catalog and (2) the product directory contains a special marker file indicating that the product has been processed by the mail program, and therefore is in a complete state (i.e., no files are missing).
CAS File Manager. The OODT CAS File Manager is a customizable software component that is responsible for processing and archiving a data product, making it available for query and access to clients. For this project, the File Manager was deployed with the default Apache Lucene metadata back-end, and configured to archive products in-place, i.e., without moving them to a separate archive directory, otherwise the rsync process would transfer them again from the NRAO server. Additionally, we leveraged the extensibility of the OODT framework by configuring the File Manager with custom metadata extractors that were purposely written to parse the information contained in the V-FASTR output and calibration files. Information is extracted at the three levels that comprise the hierarchy of a V-FASTR data product: job, scan, and event. Additionally, a numerical algorithm was written to assign each pair of available images (-det.jpg and -dedisp.jpg) to the event that generated them.

**Figure 2.** Component diagram for the metadata pipeline component of the V-FASTR candidate review framework.
Download figure:
Standard image High-resolution image

In general, a File Manager can store metadata in its back-end catalog as different object types. Each object type is defined to contain multiple metadata fields, where each field is composed of a named key associated to one or more string values. For this project, the decision was made to maintain a one-to-one correspondence between a data product and the corresponding metadata ingested into the catalog. So rather than defining three object types for jobs, scans, and events, a single object type was used holding all information for a data product in a single container, with dynamically named keys that are encoded to contain the scan and event numbers. This decision was motivated by the desire to simplify and optimize the querying of information by the web portal client, since all metadata for a product is retrieved through a single request to the File Manager. As a consequence, the default Apache Lucene metadata catalog implementation had to be slightly modified to allow for the ingestion of dynamically named metadata fields.

3.2.2. Web Portal

The second major component of the candidate review framework is an interactive web portal. The primary purpose of the portal is to provide a convenient online environment for the location-independent perusal and assessment of potential candidates in context. The portal provides V-FASTR analysts with the ability to quickly navigate through the available information to identify candidates worthy of further inspection on a familiar web platform.

The portal has been implemented as a PHP web application using the Apache OODT Balance web framework running on top of the Apache HTTPD Web Server. OODT Balance was chosen here for its ability to easily integrate with the OODT components in the back-end metadata pipeline, namely the OODT CAS File Manager described earlier. Furthermore, the flexible, modular approach of the framework allowed us to quickly connect the web portal to the metadata repository and rapidly begin constructing the necessary views specific to the V-FASTR candidate review and validation use cases.

As Figure 3 shows, the web portal offers a variety of views of the available metadata which are hierarchically organized to match the conceptual relationships in the data. At the highest level, a job or run might consist of multiple scans, each of which may itself contain multiple detection event candidates. This hierarchy, expressed in the metadata, is preserved in the layout of the portal views, and the breadcrumb navigation provided to facilitate orientation within the nested structure.

At the level of an individual event candidate (Figure 3, middle image), two graphical representations of the event are available to assist analysts in classifying the nature of the signal. These images are generated automatically as part of the initial candidate identification process (Wayth et al. 2011), and they provide a trained analyst the necessary structural clues needed to rapidly assess the received signal as being genuinely extraterrestrial in origin or merely a product of RFI.

To support both metadata browsing in context and the desire for an analyst to be able to rapidly peruse the image representations of an entire job (many events in many scans) at once, a compromise was struck whereby, for each job, a portal user may select a traditional, hierarchical navigation or a flattened view in which all of the (possibly numerous) event candidates are presented simultaneously on screen and can be accessed serially simply by scrolling the view.

Together, the metadata pipeline and the web portal constitute an end-to-end framework for capturing, archiving, and presenting metadata about detected transient event candidates to V-FASTR scientists. Furthermore, by providing a reliable process and flexible interface, the system directly streamlines the analysis process, boosting the overall efficiency of the project.

4. EVALUATION

As we have described in the previous section, the candidate review framework embraces the model of online collaborative validation of fast transient candidates by a team of geographically dispersed analysts, and improves the efficiency with which analysts may classify observational data. In this section we describe the early results of our experience with the operational deployment of the framework, as well as highlight several areas for the evolution of the tool to further enhance its utility.

4.1. Experience

The initial deployment of the collaborative review framework for operational use by the V-FASTR science team was made in early summer 2012. The immediate feedback was largely positive: analysts praised the capabilities of the system, the general improved accessibility afforded by a web-based user interface, and the newfound capability to easily navigate rapidly through all detections in a given job, or peruse the different levels (scans and events) within a job individually. The biggest initial complaint with the system was that too many mouse clicks were required to complete an analysis of all of the candidates in an entire job.

A consequence of the iterative feedback loop that developed between the software and science teams (described further in Section 3) was that suggestions for improvements were repeatedly made, tracked, and acted upon. This process resulted in an updated release occurring approximately every two weeks during the first few months of the deployment. Suggestions for improvements included the addition of various metadata fields identified as critical to the classification task, updates to the visual organization of the elements of the web portal views, and a relentless focus on reducing the number of mouse clicks required on the part of analyst users.

By the time of this writing, the V-FASTR portal has been running operationally for several weeks, and we can draw some early conclusions on usefulness of the system. Overall, as reported by the science team, it seems like the project has definitely accomplished its broad goal of facilitating the collaborative task of inspecting and screening radio-transient events. By extracting all relevant metadata from the latest data products, and presenting it on the web portal in a concise fashion, scientists can now execute their tasks more efficiently, compared to earlier times when they had to log onto a terminal and analyze the raw data manually. Additionally, the online availability of all data and metadata through a browser interface (as opposed to an ssh terminal) has allowed for greater flexibility with regard to when and where evaluations can be performed, including for the first time on a mobile device.

4.2. Evolution

On the whole, the ability to interact with the totality of the candidate data and metadata through a browser interface has greatly expanded the analysts' ability to perform their tasks with greater flexibility regarding when and where evaluations can be performed. This includes, for the first time, anecdotal accounts of an analyst reviewing candidates from a mobile device.

With this freedom, in turn, has come a number of feature requests which can be taken together to form a roadmap of sorts for the evolution of the framework. Now that the interaction with candidate metadata has transitioned to the browser, the science team has identified three key features they feel would complete the transition and entirely replace the prior ad-hoc methods for coordinating the analysts' activities:

Job assignment. As mentioned in Section 3, the timely review of detection candidates is critical to remaining within the resource constraints imposed upon the experiment. At the moment, review jobs are assigned to analysts via email. Augmenting the web portal with the ability to identify an individual analyst would enable the presentation of a prioritized list of that analystʼs outstanding review tasks.
Effort Tracking. Along the same lines, it is important to spread the analysis load evenly across the science team, since no one person is performing the analysis as his or her full-time job. Augmenting the framework with the ability to track the analysis contributions of individual users over time would assist in the equitable scheduling of future review jobs.
In-browser archiving. When an analyst determines a candidate event merits high-resolution followup, the last step is to archive the associated raw data so that it can be evaluated at a later date. Currently, due to the security restrictions permitting read-only access to external connections to the archive at the NRAO (described in Section 3), this process is handled out-of-band by the analyst logging into an NRAO machine and archiving the appropriate data manually. It is possible that, with the identity management features discussed in the previous two items (and the associated auditing capabilities that it could entail) the restrictions might be negotiated to the point that certain defined activities (such as archiving a single job directory) could be initiated from within the portal environment.

5. CONCLUSION

V-FASTR, and commensal operations more generally, are particularly challenging experiments due to extreme data volume and real-time requirements. Processing occurs continually, and the data flow must be coordinated across multiple physical locations with transport mechanisms ranging from FedEx transport (disks from the antenna), high-bandwidth interconnects (the correlator and transient detection systems), daily rsync over IP (the ska-dc mirror), and distributed WWW protocols (manual review which takes place by analysts on three continents). Various components of the system operate on millisecond, hourly, and daily clocks and all components must continue operating since there is very little margin for buffer resources. In addition, the data processing components are highly heterogeneous, with human experts playing their own role as scheduled pattern recognition engines in the overall architecture. By facilitating timely review, and reducing the learning curve for new reviewers, the V-FASTR portal will play a critical role in keeping the data flowing and making the system sustainable in the long term.

This effort was supported by the Jet Propulsion Laboratory, managed by the California Institute of Technology under a contract with the National Aeronautics and Space Administration.

A FRAMEWORK FOR COLLABORATIVE REVIEW OF CANDIDATE EVENTS IN HIGH DATA RATE STREAMS: THE V-FASTR EXPERIMENT AS A CASE STUDY

Article metrics

Permissions

Author e-mails

Author affiliations

ORCID iDs

Dates

ABSTRACT

1. INTRODUCTION

2. BACKGROUND