Technical Highlights in Information/Knowledge Management
A Fully Automated Peak Picking and Integration Algorithm for Mass
Spectral Data
A numerical algorithm is described
that accurately locates and calculates the area beneath peaks from
real mass spectral data using only reproducible mathematical operations
and no user-selected parameters. Such a fully automated algorithm
was required for rapid and repeatable processing of mass spectral
data containing hundreds of peaks. By working without any user input
it both saves operator time and eliminates operator bias. The first
criterion is desirable when processing large amounts of data (for
example in proteomics research). The second criterion is necessary
to the Polymer Division's goal of creating an absolute molecular mass
distribution synthetic polymer Standard Reference Material where operator
bias in the data analysis cannot be tolerated.
A unified collection of algorithms
has been developed that accurately locates peaks and calculates their
area using only reproducible mathematical operations and no user-selected
parameters. As shown in Figure 1, the method consists of three steps:
1) statistical characterization of the data set and an analyte-free
background spectrum; 2) data set segmentation to determine "strategic
points"; and 3) deflation of the number of strategic points guided
by the statistical properties of the data sets. The final deflated
set of strategic points consists of groups of three points that define
the beginning, center, and end of each peak in the data. For closely
spaced peaks the strategic point that defines the end of one peak
may also define the beginning of the next. Finally, a polygonal fitting
routine is used to calculate relative peak area.
The time-series segmentation algorithm
at the heart of the method consists of two steps. The first portion
(2a) requires the selection of the strategic points. These points
are selected based on an iterative procedure that identifies points
whose orthogonal distance from the end-point connecting line segment
is greatest. Once a point with greatest orthogonal distance from the
mean has been identified, it joins the collection of strategic points
and, in turn, becomes an end-point for two new line segments from
which a point with greatest orthogonal distance is again found. This
numerical scheme is performed until the greatest orthogonal distance
to any end-point connecting line segment drops beneath a prescribed
threshold value. This threshold value is calculated from the statistical
properties of the data set. The selection of these points does not
require equally spaced data. The second phase of the algorithm (2b)
requires the solution of an optimization problem, specifically, locating
strategic point heights (that is, adjusting strategic y-axis values
associated strategic x-axis values) that minimize the sum of orthogonal
distance from raw data. This problem is a nonlinear (and non-quadratic)
optimization problem that can be accomplished quickly using a modern
nonlinear programming algorithm. Parts 2a and 2b are collectively
called the Kearsley-Wallace method, which is an extension of the earlier
Douglas-Peucker method.
Figure 1 Method Flow Chart
Consider the polystyrene matrix-assisted
laser desorption ionization time-of-flight mass spectrum shown in
Figure 2 (black) and its complementary matrix-only background spectrum
(red). The resultant strategic points (green) defining peak beginning,
center, and end, and the relative peak areas (blue) are also shown.
Note that ion intensity is on a logarithmic scale, thus the small
peaks are significantly smaller than the main series of peaks. The
analysis of this was done without operator intervention of any sort.
The only input provided was the spectrum to be analyzed and an analyte
free spectrum to determine inherent instrument noise. The noise has
both chemical (e.g., improperly time focused ions) and electronic
(e.g., detector dark current) components. These noise elements span
a wide frequency range and cannot simply be smoothed out of the data
without distorting peak shape (and; therefore, peak area). Our experience
shows that the power spectrum of the noise cannot be predicted solely
from the experimental conditions; therefore, blind application of
smoothing and/or filtering algorithms will unintentionally remove
information from the data.
Figure 2. Sample polystyrene
MALDI TOF mass spectrum (black) and it complementary matrix -only
background spectrum (red).
Some of the additional strengths
of this method include the fact that it requires no knowledge of peak
shape and; furthermore, it requires no preprocessing of the data,
i.e., smoothing or baseline correction with their resultant distortion
of peak area. Lastly, the method does not require equal spacing of
data points (e.g., time-of-flight data can be processed in mass-space
where the points have a square root spacing). The one significant
weakness is that the method is more successful and efficient if a
blank (analyte-free) spectrum is used to calibrate instrument background
noise. (However, such a background spectrum is not strictly required.)
Future plans include the creation
of a publicly accessible, secure-Web-server application for online,
real-time application of the algorithm. We will also relay the method
to other standards-setting organizations for comment and to commercial
software vendors for implementation in their products. Lastly, we
have begun to tackle the much more subtle problem of automated, operator
independent baseline compensation.