Sergey Kirshner, Igor Cadez, Padhraic Smyth, Chandrika Kamath
We describe the application of probabilistic model-based learning to the problem of automatically identifying classes of galaxies, based on both morphological and pixel intensity characteristics. The EM algorithm can be used to learn how to spatially orient a set of galaxies so that they are geometrically aligned. We augment this “ordering-model” with a mixture model on objects, and demonstrate how classes of galaxies can be learned in an unsupervised manner using a two-level EM algorithm. The resulting models provide highly accurate classi£cation of galaxies in cross-validation experiments.
Introduction and Background
The £eld of astronomy is increasingly data-driven as new observing instruments permit the rapid collection of massive archives of sky image data. In this paper we investigate the problem of identifying bent-double radio galaxies in the FIRST (Faint Images of the Radio Sky at Twenty-cm) Survey data set . FIRST produces large numbers of radio images of the deep sky using the Very Large Array at the National Radio Astronomy Observatory. It is scheduled to cover more that 10,000 square degrees of the northern and southern caps (skies). Of particular scienti£c interest to astronomers is the identi£cation and cataloging of sky objects with a “bent-double” morphology, indicating clusters of galaxies (, see Figure 1). Due to the very large number of observed deep-sky radio sources, (on the order of 106 so far) it is infeasible for the astronomers to label all of them manually. The data from the FIRST Survey (http://sundog.stsci.edu/) is available in both raw image format and in the form of a catalog of features that have been automatically derived from the raw images by an image analysis program . Each entry corresponds to a single detectable “blob” of bright intensity relative to the sky background: these entries are called
Figure 1: 4 examples of radio-source galaxy images. The two on the left are labelled as “bent-doubles” and the two on the right are not. The con£gurations on the left have more “bend” and symmetry than the two non-bent-doubles on the right.
components. The “blob” of intensities for each component is £tted with an ellipse. The ellipses and intensities for each component are described by a set of estimated features such as sky position of the centers (RA (right ascension) and Dec (declination)), peak density ¤ux and integrated ¤ux, root mean square noise in pixel intensities, lengths of the major and minor axes, and the position angle of the major axis of the ellipse counterclockwise from the north. The goal is to £nd sets of components that are spatially close and that resemble a bent-double. In the results in this paper we focus on candidate sets of components that have been detected by an existing spatial clustering algorithm  where each set consists of three components from the catalog (three ellipses). As of the year 2000, the catalog contained over 15,000 three-component con£gurations and over 600,000 con£gurations total. The set which we use to build and evaluate our models consists of a total of 128 examples of bent-double galaxies and 22 examples of non-bent-double con£gurations. A con£guration is labelled as a bent-double if two out of three astronomers agree to label it as such. Note that the visual identi£cation process is the bottleneck in the process since it requires signi£cant time and effort from the scientists, and is subjective and error-prone, motivating the creation of automated methods for identifying bent-doubles.
Three-component bent-double con£gurations typically consist of a center or “core” com- ponent and two other side components called “lobes”. Previous work on automated classi£- cation of three-component candidate sets has focused on the use of decision-tree classi£ers using a variety of geometric and image intensity features . One of the limitations of the decision-tree approach is its relative in¤exibility in handling uncertainty about the object being classi£ed, e.g., the identi£cation of which of the three components should be treated as the core of a candidate object. A bigger limitation is the £xed size of the feature vec- tor. A primary motivation for the development of a probabilistic approach is to provide a framework that can handle uncertainties in a ¤exible coherent manner.
2 Learning to Match Orderings using the EM Algorithm
We denote a three-component con£guration by C = (c 1; c2; c3), where the ci’s are the components (or “blobs”) described in the previous section. Each component cx is repre- sented as a feature vector, where the speci£c features will be de£ned later. Our approach focuses on building a probabilistic model for bent-doubles: p (C) = p (c1; c2; c3), the like- lihood of the observed ci under a bent-double model where we implicitly condition (for now) on the class “bent-double.”
By looking at examples of bent-double galaxies and by talking to the scientists study- ing them, we have been able to establish a number of potentially useful characteristics of the components, the primary one being geometric symmetry. In bent-doubles, two of the components will look close to being mirror images of one another with respect to a line through the third component. We will call mirror-image components lobe compo-