PageRenderTime 234ms CodeModel.GetById 51ms app.highlight 153ms RepoModel.GetById 18ms app.codeStats 0ms

/rcdk/inst/doc/rcdk.Rnw

http://github.com/rajarshi/cdkr
Unknown | 673 lines | 556 code | 117 blank | 0 comment | 0 complexity | c13514ba169fc4bac20b0275b06ebeac MD5 | raw file
  1% \VignetteIndexEntry{rcdk Tutorial}
  2% \VignettePackage{rcdk}
  3% \VignetteKeywords{}
  4% \VignetteDepends{xtable}
  5%% To generate the Latex code
  6%library(rcdk)
  7%Rnwfile<- file.path("rcdk.Rnw")
  8%Sweave(Rnwfile,pdf=TRUE,eps=TRUE,stylepath=TRUE,driver=RweaveLatex())
  9
 10
 11\documentclass[letterpaper, 11pt]{article}
 12
 13\usepackage{times}
 14\usepackage{url}
 15\usepackage[pdftex,bookmarks=true]{hyperref}
 16
 17\newcommand{\Rfunction}[1]{{\texttt{#1}}}
 18\newcommand{\Rpackage}[1]{{\textit{#1}}}
 19\newcommand{\funcarg}[1]{{\texttt{#1}}}
 20
 21\newcommand{\rclass}[1]{{\textit{#1}}}
 22
 23<<echo=FALSE>>=
 24options(width=74)
 25library(xtable)
 26@
 27\parindent 0in
 28\parskip 1em
 29
 30\begin{document}
 31
 32\title{rcdk: Integrating the CDK with R}
 33\author{Rajarshi Guha \\ Miguel Rojas Cherto}
 34\maketitle
 35\tableofcontents
 36\newpage
 37
 38\section{Introduction}
 39
 40Given that much of cheminformatics involves mathematical and
 41statistical modeling of chemical information, R is a natural platform
 42for such work. There are many cheminformatics applications that will
 43generate useful information such as descriptors, fingerprints and so
 44on. While one can always run these applications to generate data that
 45is then imported into R, it can be convenient to be able to manipulate
 46chemical structures and generate chemical information with the R environment.
 47
 48The CDK is a Java library for cheminformatics that supports a wide
 49variety of cheminformatics functionality ranging from reading
 50molecular file formats, performing ring perception and armaticity
 51detection to fingerprint generation and molecular descriptors. 
 52
 53The goal of the \Rpackage{rcdk} package is to allow an R user to
 54access the cheminformatics functionality of the CDK from within
 55R. While one can use the \Rpackage{rJava} package to make direct calls
 56to specific methods in the CDK, from R, such usage does not usually
 57follow common R idioms. The goal of the \Rpackage{rcdk} is to allow
 58users to use the CDK classes and methods in an R-like fashion.
 59
 60The library is loaded as follows
 61
 62<<<>>=
 63library(rcdk)
 64@ 
 65To list the documentation for all available packages try
 66<<eval=FALSE>>=
 67library(help=rcdk)
 68@ 
 69The package also provides an example data set, called \texttt{bpdata}
 70which contains 277 molecules, in SMILES format and their associated
 71boiling points (BP) in Kelvin. The data.frame has two columns, viz.,
 72the SMILES and the BP. Molecules names are used as row names.
 73
 74\section{Input / Output}
 75
 76Chemical structures come in a variety of formats and the CDK supports
 77many of them. Many such formats are disk based and these files can be
 78parsed and loaded by specifying their full paths
 79
 80<<eval=FALSE>>=
 81mols <- load.molecules( c('data1.sdf', '/some/path/data2.sdf') )
 82@ 
 83
 84Note that the above function will load any file format that is
 85supported by the CDK, so there's no need to specify formats. In
 86addition one can specify a URL (which should start with ``http://'')
 87to specify remote files as well.  The result of this function is a
 88\texttt{list} of molecule objects. The molecule objects are of class
 89\texttt{jobjRef} (provided by the \Rpackage{rJava} package). As a
 90result,they are pretty opaque to the user and are really meant to be
 91processed using methods from the \Rpackage{rcdk} or \Rpackage{rJava}
 92packages.
 93
 94However, since it loads all the molecules from the specified file into 
 95a list, large files can lead to out of memory errors. In such a situtation
 96it is preferable to iterate over the file, one structure at a time. Currently
 97this behavior is supported for SDF and SMI files. An example of such a 
 98usage for a large SD file would be:
 99<<eval=FALSE>>=
100iter <- iload.molecules('verybig.sdf', type='sdf')
101while(hasNext(iter)) {
102 mol <- nextElem(iter)
103 print(get.property(mol, "cdk:Title"))
104}
105@ 
106
107Another common way to obtain molecule objects is by parsing SMILES
108strings. The simplest way to do this is
109<<>>=
110smile <- 'c1ccccc1CC(=O)C(N)CC1CCCCOC1'
111mol <- parse.smiles(smile)[[1]]
112@ 
113Usage is more efficient when multiple SMILE are supplied, since then a single
114SMILES parser object is used to parse all the supplied SMILES.
115<<>>=
116smiles <- c('CCC', 'c1ccccc1', 'CCCC(C)(C)CC(=O)NC')
117mols <- parse.smiles(smiles)
118@ 
119If you plan on parsing a large number of SMILES, you may run into
120memory issues, due to the large size of \texttt{IAtomContainer}
121objects. In such a case, it can be useful to call the Java and R garbage
122collectors explicitly at the appropriate time. In addition it can be useful
123to explicitly allocate a large amount of memory for the JVM. For example,
124<<eval=FALSE>>=
125options("java.parameters"=c("-Xmx4000m"))
126library(rcdk)
127
128for (smile in smiles) {
129  m <- parse.smiles(smile)
130  ## perform operations on this molecule
131  
132  jcall("java/lang/System","V","gc")
133  gc()
134}
135@ 
136Given a list of molecule objects, it is possible to serialize them to
137a file in some specified format. Currently, the only output formats
138are SMILES or SDF. To write molecules to a disk file in SDF format:
139<<eval=FALSE>>=
140write.molecules(mols, filename='mymols.sdf')
141@ 
142By default, if \texttt{mols} is a list of multiple molecules, all of
143them will be written to a single SDF file. If this is not desired, you
144can write each on to individual files (which are prefixed by the
145value of \funcarg{filename}):
146
147<<eval=FALSE>>=
148write.molecules(mols, filename='mymols.sdf', together=FALSE)
149@ 
150
151To generate a SMILES representation of a molecule we can do
152<<>>=
153get.smiles(mols[[1]])
154unlist(lapply(mols, get.smiles))
155@ 
156
157\section{Visualization}
158
159Currently the \Rpackage{rcdk} package only supports 2D
160visualization. This can be used to view the structure of individual
161molecules or multiple molecules in a tabular format. It is also
162possible to view a molecular-data table, where one of the columns is
163the 2D image and the remainder can contain data associated with the molecules.
164
165Unfortunately, due to event handling issues, the depictions will
166display on OS X, but the Swing window will become unresponsive. As a
167result, it is not recommended to generate 2D depictions on OS X.
168
169Molecule visualization is performed using the
170\texttt{view.molecule.2d} function. For viewing a single molecule or a
171list of multiple molecules, it is simply
172
173<<eval=FALSE>>=
174smiles <- c('CCC', 'CCN', 'CCN(C)(C)',
175            'c1ccccc1Cc1ccccc1',
176            'C1CCC1CC(CN(C)(C))CC(=O)CC')
177mols <- parse.smiles(smiles)
178view.molecule.2d(mols[[1]])
179view.molecule.2d(mols)
180@ 
181If multiple molecules are provided, they are display in a matrix
182format, with a default of four columns. This can be changed via the
183\emph{ncol} argument. Furthermore, the size of the images are 200
184$\times$ 200 pixels, by default. But this can be easily changed via
185the \emph{cellx} and \emph{celly} arguments.
186
187In many cases, it is useful to view a ``molecular spreadsheet'', which
188is a table of molecules along with information related to the
189molecules being viewed. The data is arranged in a spreadsheet like
190manner, with one of the columns being molecules and the remainder
191being textual or numeric information. 
192
193This can be achieved using the \texttt{view.table} method which takes
194a list of molecule objects and a \texttt{data.frame} containing the
195associated data. As expected, the number of rows in the
196\texttt{data.frame} should equal the length of the molecule list.
197
198<<eval=FALSE>>=
199dframe <- data.frame(x = runif(4),
200                     toxicity = factor(c('Toxic', 'Toxic', 'Nontoxic', 'Nontoxic')),
201                     solubility = c('yes', 'yes', 'no', 'yes'))
202view.table(mols[1:4], dframe)
203@ 
204As shown, the \texttt{view.table} supports numeric, character and
205factor data types.
206
207
208\section{Manipulating Molecules}
209
210In general, given a \texttt{jobjRef} for a molecule object one can
211access all the class and methods of the CDK library via
212\Rpackage{rJava}. However this can be cumbersome. The \Rpackage{rcdk}
213package is in the process of exposing methods and classes that
214manipulate molecules. This section describes them - more will be
215implemented in future releases.
216
217\subsection{Adding Information to Molecules}
218
219In many scenarios it's useful to associate information with
220molecules. Within R, you could always create a \texttt{data.frame} and
221store the molecule objects along with relevant information in
222it. However, when serializing the molecules, you want to be able to
223store the associated information.
224
225Using the CDK it's possible to directly add information to a molecule
226object using properties. Note that adding such properties uses a
227key-value paradigm, where the key should be of class
228\texttt{character}. The value can be of class \texttt{integer},
229\texttt{double}, \texttt{character} or \texttt{jobjRef}. Obviously,
230after setting a property, you can get a property by its key.
231<<>>=
232mol <- parse.smiles('c1ccccc1')[[1]]
233set.property(mol, "title", "Molecule 1")
234set.property(mol, "hvyAtomCount", 6)
235get.property(mol, "title")
236@ 
237
238It is also possible to get all available properties at once in the
239from of a list. The property names are used as the list names.
240<<>>=
241get.properties(mol)
242@ 
243
244After adding such properties to the molecule, you can write it out to
245an SD file, so that the property values become SD tags.
246<<eval=FALSE>>=
247write.molecules(mol, 'tagged.sdf', write.props=TRUE)
248@ 
249
250\subsection{Atoms and Bonds}
251
252
253
254Probably the most important thing to do is to get the atoms and bonds
255of a molecule. The code below gets the atoms and bonds as lists of
256\texttt{jobjRef} objects, which can be manipulated using
257\Rpackage{rJava} or via other methods of this package.
258
259<<>>=
260mol <- parse.smiles('c1ccccc1C(Cl)(Br)c1ccccc1')[[1]]
261atoms <- get.atoms(mol)
262bonds <- get.bonds(mol)
263cat('No. of atoms =', length(atoms), '\n')
264cat('No. of bonds =', length(bonds), '\n')
265@ 
266
267Right now, given an atom the \Rpackage{rcdk} package does not offer a
268lot of methods to operate on it. One must access the CDK directly. In
269the future more manipulators will be added. Right now, you can get the
270symbol for each atom
271
272<<>>=
273unlist(lapply(atoms, get.symbol))
274@ 
275
276It's also possible to get the 3D (or 2D coordinates) for an atom. 
277
278<<>>=
279coords <- get.point3d(atoms[[1]])
280@ 
281
282Given this, it's quite easy to get the 3D coordinate matrix for a molecule
283
284<<eval=FALSE>>=
285coords <- do.call('rbind', lapply(atoms, get.point3d))
286@ 
287
288Once you have the coordinate matrix, a quick way to check whether the
289molecule is flat is to do
290<<eval=FALSE>>=
291if ( any(apply(coords, 2, function(x) length(unique(x))) == 1) ) {
292  print("molecule is flat")
293}
294@ 
295
296This is quite a simplistic check that just looks at whether the X, Y
297or Z coordinates are constant. To be more rigorous one could evaluate
298the moments of inertia about the axes.
299
300\subsection{Substructure Matching}
301The CDK library supports substructure searches using SMARTS (or
302SMILES) patterns. The implementation allows one to check whether a
303target molecule contains a substructure or not as well as to retrieve
304the atoms and bonds of the target molecule that match the query
305substructure. At this point, the \Rpackage{rcdk} only support the
306former operation - given a query pattern, does it occur or not in a
307list of target molecules. The \texttt{match} method of this package is
308modeled after the same method in the \Rpackage{base} package. An
309example of its usage would be to identify molecules that contain a
310carbon atom that has exactly two bonded neighbors.
311<<<eval=TRUE>>=
312mols <- parse.smiles(c('CC(C)(C)C','c1ccc(Cl)cc1C(=O)O', 'CCC(N)(N)CC'))
313query <- '[#6D2]'
314hits <- match(query, mols)
315print(hits)
316@ 
317
318\section{Molecular Descriptors}
319
320Probably the most desired feature when doing predictive modeling of
321molecular activities is molecular descriptors. The CDK implements a
322variety of molecular descriptors, categorized into topological,
323constitutional, geometric, electronic and hybrid. It is possible to
324evaluate all available descriptors at one go, or evaluate individual
325descriptors.
326
327First, we can take a look at the available descriptor
328categories. 
329
330<<>>=
331dc <- get.desc.categories()
332dc
333@ 
334
335Given the categories we can get the names of the descriptors for a
336single category. Of course, you can always provide the category name
337directly.
338
339<<>>=
340dn <- get.desc.names(dc[1])
341@ 
342
343Each descriptor name is actually a fully qualified Java class name for
344the corresponding descriptor. These names can be supplied to
345\texttt{eval.desc} to evaluate a single or multiple descriptors for
346one or more molecules.
347
348<<>>=
349aDesc <- eval.desc(mol, dn[1])
350allDescs <- eval.desc(mol, dn)
351@ 
352
353The return value of \texttt{eval.desc} is a data.frame with the
354descriptors in the columns and the molecules in the rows. For the
355above example we get a single row. But given a list of molecules, we
356can easily get a descriptor matrix. For example, lets build a linear
357regression model to predict boiling points for the BP dataset. First
358we need a set of descriptors and so we evaluate all available descriptors.
359Also note that since a descriptor might belong to more than one
360category, we should obtain a unique set of descriptor names
361<<>>=
362data(bpdata)
363mols <- parse.smiles(bpdata[,1])
364descNames <- unique(unlist(sapply(get.desc.categories(), get.desc.names)))
365descs <- eval.desc(mols, descNames) 
366class(descs)
367dim(descs)
368@ 
369%descs <- eval.desc(mols,c("org.openscience.cdk.qsar.descriptors.molecular.KierHallSmartsDescriptor")) 
370
371As you can see we get a \texttt{data.frame}. Many of the columns will
372be \texttt{NA}. This is because when a descriptor cannot be evaluated
373(due to some error) it returns NA. In our case, since our molecules
374have no 3D coordinates many geometric, electronic and hybrid
375descriptors cannot be evaluated.
376
377Given the ubiquity of certain descriptors, some of them are directly
378available via their own functions. Specifically, one can calculate
379TPSA (topological polar surface area), AlogP and XlogP without having
380to go through \texttt{eval.desc}.\footnote{Note that AlogP and XlogP
381  assume that hydrogens are explicitly specified in the molecule. This
382  may not be true if the molecules were obtained from SMILES}.
383
384<<>>=
385mol <- parse.smiles('CC(=O)CC(=O)NCN')[[1]]
386convert.implicit.to.explicit(mol)
387get.tpsa(mol)
388get.xlogp(mol)
389get.alogp(mol)
390@ 
391
392In any case, now that we have a descriptor matrix, we easily build a
393linear regression model. First, remove NA's, correlated and constant
394columns. The code is shown below, but since it involves a stochastic
395element, we will not run it for this example. If we were to perform
396feature selection, then this type of reduction would have to be performed.
397
398<<eval=FALSE>>=
399descs <- descs[, !apply(descs, 2, function(x) any(is.na(x)) )]
400descs <- descs[, !apply( descs, 2, function(x) length(unique(x)) == 1 )]
401r2 <- which(cor(descs)^2 > .6, arr.ind=TRUE)
402r2 <- r2[ r2[,1] > r2[,2] , ]
403descs <- descs[, -unique(r2[,2])]
404@ 
405
406Note that the above correlation reduction step is pretty crude and
407there are better ways to do it. Given the reduced descriptor matrix,
408we can perform feature selection (say using \Rpackage{leaps} or a GA)
409to identify a suitable subset of descriptors. For now, we'll select
410some descriptors that we know are correlated to BP. The fit is shown
411in Figure \ref{fig:ols} which plots the observed versus predicted BP's.
412
413<<>>=
414model <- lm(BP ~ khs.sCH3 + khs.sF + apol + nHBDon, data.frame(bpdata, descs))
415summary(model)
416@ 
417%model <- lm(BP ~ khs.sCH3 + khs.sF + apol + nHBDon, data.frame(bpdata, descs))
418%model <- lm(BP ~ khs.sCH3 + khs.sF, data.frame(bpdata, descs))
419
420\begin{figure}[h]
421  \centering
422<<fig=TRUE,echo=FALSE>>=
423par(mar=c(4.3,4.3,1,1),cex.lab=1.3, pty='s')
424plot(bpdata$BP, model$fitted, pch=19, 
425     ylab='Predicted BP', xlab='Observed BP',
426     xlim=range(bpdata$BP), ylim=range(bpdata$BP))
427abline(0,1, col='red')
428@ 
429  \caption{A plot of observed versus predicted boiling points,
430    obtained from a linear regression model using 277 molecules.}
431  \label{fig:ols}
432\end{figure}
433
434\section{Fingerprints}
435
436Fingerprints are a common representation used for a variety of
437purposes such as similarity searching and predictive modeling. The CDK
438provides four types of fingerprints, viz.,
439\begin{itemize}
440\item Standard - a path based, hashed fingerprint. The default size is
441  1024 bits, but this can be changed via an argument
442\item Extended - similar to the Standard form, but takes into account
443  ring systems. Default size is 1024 bits
444\item EState - a structural key type fingerprint that checks for the
445  presence or absence of 79 EState substructures. Length of the
446  fingerprint is 79 bits
447\item MACCS - the well known 166 bit structural keys
448\end{itemize}
449When using \Rpackage{rcdk} to evaluate fingerprints, you will need the
450\Rpackage{fingerprint} package. Since this is a dependency of the
451\Rpackage{rcdk} package, it should have been automatically installed.
452
453To generate the fingerprints, we must first obtain molecule
454objects. Thus for example,
455<<>>=
456smiles <- c('CCC', 'CCN', 'CCN(C)(C)', 
457            'c1ccccc1Cc1ccccc1',
458            'C1CCC1CC(CN(C)(C))CC(=O)CC')
459mols <- parse.smiles(smiles)
460fp <- get.fingerprint(mols[[1]], type='maccs')
461@ 
462
463The variable, \texttt{fp}, will be of class \rclass{fingerprint} and
464can be manipulated using the methods provided by the package of the
465same name. A simple example is to perform a hierarchical clustering of
466the first 50 structures in the BP dataset.  
467
468<<>>= mols <-
469mols <- parse.smiles(bpdata[1:50,1]) 
470fps <- lapply(mols, get.fingerprint, type='extended') 
471fp.sim <- fp.sim.matrix(fps, method='tanimoto') 
472fp.dist <- 1 - fp.sim 
473@
474
475Once we have the distance matrix (which we must derive from the
476Tanimoto similarity matrix), we can then perform the clustering and
477visualize it.
478\begin{figure}[t]
479  \centering
480<<fig=TRUE>>=
481clustering <- hclust(as.dist(fp.dist))
482plot(clustering, main='A Clustering of the BP dataset')
483@   
484  \caption{A clustering of the first 50 molecules of the BP dataset, using the CDK extended fingerprints.}
485  \label{fig:cluster}
486\end{figure}
487
488Another common task for fingerprints is similarity searching. In other
489words, given a collection of ``target'' molecules, find those
490molecules that are similar to a ``query'' molecule. This is achieved
491by evaluating a similarity metric between the query and each of the
492target molecules. Those target molecules exceeding a user defined
493cutoff will be returned. With the help of the \Rpackage{fingerprint}
494package this is easily accomplished.
495
496For example, we can identify all the molecules in the BP dataset that
497have a Tanimoto similarity of 0.3 or more with acetalehyde, and then
498create a summary in Table \ref{tab:sims}. Note that this could also be
499accomplished with molecular descriptors, in which case you'd probably
500evaluate the Euclidean distance between descriptor vectors.
501
502<<>>=
503query.mol <- parse.smiles('CC(=O)')[[1]]
504target.mols <- parse.smiles(bpdata[,1])
505query.fp <- get.fingerprint(query.mol, type='maccs')
506target.fps <- lapply(target.mols, get.fingerprint, type='maccs')
507sims <- unlist(lapply(target.fps, 
508                      distance, 
509                      fp2=query.fp, method='tanimoto'))
510hits <- which(sims > 0.3)
511@ 
512
513
514<<echo=FALSE,results=tex>>=
515d <- data.frame(SMILES=bpdata[hits,1], Similarity=sims[hits])
516row.names(d) <- NULL
517d <- d[sort.list(d[,2], dec=TRUE),]
518xtable(d, label="tab:sims", 
519       caption="Summary of molecules in the BP dataset that are greater than 0.3 similar to acetaldehyde")
520@ 
521
522
523\section{Handling Molecular Formulae}
524
525The molecular formula is the simplest way to characterize a molecular
526compound. It specifies the actual number of atoms of each element
527contained in the molecule. A molecular formula is represented by the
528chemical symbol of each constituent element. If a molecule contains
529more than one atom for a particular element, the quantity is shown as
530subscript after the chemical symbol. Otherwise, the number of neutrons
531(atomic mass) that an atom is composed can differ. This different type
532of atoms are known as isotopes. The number of nucleos is denoted as
533superscripted prefix previous to the chemical element. Generally it is
534not added when the isotope that characterizes the element is the most
535occurrence in nature. E.g.  $C_4H_{11}O^2D$.
536
537\subsection{Parsing a Molecule To a Molecular Formula}
538
539Front a molecule, defined as conjunct of atoms helding together by
540chemical bonds, we can simplify it taking only the information about
541the atoms. \Rpackage{rcdk} package provides a parser to translate
542molecules to molecular formlulas, the \texttt{get.mol2formula}
543function.
544
545<<>>=
546sp <- get.smiles.parser()
547molecule <- parse.smiles('N')[[1]]
548convert.implicit.to.explicit(molecule)
549formula <- get.mol2formula(molecule,charge=0)
550@ 
551
552Note that the above formula object is a CDKFormula-class. A cdkFormula-class contains some attributes that
553defines a molecular formula. For example, the mass, the charge, the isotopes, the character representation
554of the molecular formula and the IMolecularFormula \texttt{jobjRef} object.   
555
556
557The molecular mass for this formula.
558<<>>=
559formula@mass
560@ 
561
562The charge for this formula.
563<<>>=
564formula@charge
565@ 
566
567
568The isotopes for this formula. It is formed for three columns. 
569\texttt{isoto} (the symbol expression of the isotope), \texttt{number}
570(number of atoms for this isotope) and \texttt{mass} (exact mass of 
571this isotope).
572<<>>=
573formula@isotopes
574@ 
575
576The \texttt{jobjRef} object from the \texttt{IMolecularFormula}
577java class in CDK.
578<<>>=
579formula@objectJ
580@ 
581
582The symbol expression of the molecular formula.
583<<>>=
584formula@string
585@ 
586
587
588Depending of the circumstances, you may want to change the charge
589of the molecular formula. 
590<<eval=FALSE>>=
591formula <- set.charge.formula(formula, charge=1)
592@ 
593
594\subsection{Initializing a Formula from the Symbol Expression}
595
596Other way to create a \texttt{cdkFormula} is from the symbol
597expression. Thus, setting the characters of the elemental formula, the
598function \texttt{get.formula} parses it to an object of
599cdkFormula-class.
600
601<<>>=
602formula <- get.formula('NH4', charge = 1);
603formula
604@ 
605
606
607\subsection{Generating Molecular Formula}
608
609Mass spectrometry is an essential and reliable technique to determine
610the molecular mass of compounds. Conversely, one can use the measured
611mass to identify the compound via its elemental formula. One of the
612limitations of the method is the precision and accuracy of the
613instrumentation.  As a result, rather than specify exact masses, we
614specify tolerances or ranges of possible mass, resulting in multiple
615candidate formulae for a given \emph{mass window}.  The
616\texttt{generate.formula} function returns a list formulas which have
617a given mass (within an error window):
618
619<<>>=
620mfSet <- generate.formula(18.03383, window=1, 
621		elements=list(c("C",0,50),c("H",0,50),c("N",0,50)), 
622		validation=FALSE);
623for (i in 1:length(mfSet)) {
624  print(mfSet[i])
625}
626@ 
627
628Important to know is if an elemental formula is valid. The method \texttt{isvalid.formula}
629provides this function. Two constraints can be applied, the nitrogen rule and the
630(Ring Double Bond Equivalent) RDBE rule.
631
632<<>>=
633formula <- get.formula('NH4', charge = 0);
634isvalid.formula(formula,rule=c("nitrogen","RDBE"))
635@ 
636
637We can observe that the ammonium is only valid if it is defined with
638charge of $+1$.
639
640\subsection{Calculating Isotope Pattern}
641
642Due to the measurement errors in medium resolution spectrometry, a
643given error window can result in a massive number of candidate
644formulae. The isotope pattern of ions obtained experimentally can be
645compared with the theoretical ones. The best match is reflected as the
646most probable elemental formula. \Rpackage{rcdk} provides the function
647\texttt{get.isotopes.pattern} which predicts the theoretical isotope
648pattern given a formula.
649
650<<>>=
651formula <- get.formula('CHCl3', charge = 0) 
652isotopes <- get.isotopes.pattern(formula,minAbund=0.1)
653isotopes
654@ 
655
656In this example we generate a formula for a possible compound with a
657charge ($z = \approx 0$) containing the standard elements C, H, and
658Cl. The isotope pattern can be visually inspectd, as shown in
659Figure~\ref{fig:isotopes}.
660
661\begin{figure}[h]
662  \centering
663<<fig=TRUE>>=
664plot(isotopes, type="h", xlab="m/z", ylab="Intensity")
665@   
666\caption{Theoretical isotope pattern given a molecular formula.}
667\label{fig:isotopes}
668\end{figure}
669
670
671
672\end{document}
673