/rcdk/inst/doc/rcdk.Rnw

http://github.com/rajarshi/cdkr · Unknown · 673 lines · 556 code · 117 blank · 0 comment · 0 complexity · c13514ba169fc4bac20b0275b06ebeac MD5 · raw file

  1. % \VignetteIndexEntry{rcdk Tutorial}
  2. % \VignettePackage{rcdk}
  3. % \VignetteKeywords{}
  4. % \VignetteDepends{xtable}
  5. %% To generate the Latex code
  6. %library(rcdk)
  7. %Rnwfile<- file.path("rcdk.Rnw")
  8. %Sweave(Rnwfile,pdf=TRUE,eps=TRUE,stylepath=TRUE,driver=RweaveLatex())
  9. \documentclass[letterpaper, 11pt]{article}
  10. \usepackage{times}
  11. \usepackage{url}
  12. \usepackage[pdftex,bookmarks=true]{hyperref}
  13. \newcommand{\Rfunction}[1]{{\texttt{#1}}}
  14. \newcommand{\Rpackage}[1]{{\textit{#1}}}
  15. \newcommand{\funcarg}[1]{{\texttt{#1}}}
  16. \newcommand{\rclass}[1]{{\textit{#1}}}
  17. <<echo=FALSE>>=
  18. options(width=74)
  19. library(xtable)
  20. @
  21. \parindent 0in
  22. \parskip 1em
  23. \begin{document}
  24. \title{rcdk: Integrating the CDK with R}
  25. \author{Rajarshi Guha \\ Miguel Rojas Cherto}
  26. \maketitle
  27. \tableofcontents
  28. \newpage
  29. \section{Introduction}
  30. Given that much of cheminformatics involves mathematical and
  31. statistical modeling of chemical information, R is a natural platform
  32. for such work. There are many cheminformatics applications that will
  33. generate useful information such as descriptors, fingerprints and so
  34. on. While one can always run these applications to generate data that
  35. is then imported into R, it can be convenient to be able to manipulate
  36. chemical structures and generate chemical information with the R environment.
  37. The CDK is a Java library for cheminformatics that supports a wide
  38. variety of cheminformatics functionality ranging from reading
  39. molecular file formats, performing ring perception and armaticity
  40. detection to fingerprint generation and molecular descriptors.
  41. The goal of the \Rpackage{rcdk} package is to allow an R user to
  42. access the cheminformatics functionality of the CDK from within
  43. R. While one can use the \Rpackage{rJava} package to make direct calls
  44. to specific methods in the CDK, from R, such usage does not usually
  45. follow common R idioms. The goal of the \Rpackage{rcdk} is to allow
  46. users to use the CDK classes and methods in an R-like fashion.
  47. The library is loaded as follows
  48. <<<>>=
  49. library(rcdk)
  50. @
  51. To list the documentation for all available packages try
  52. <<eval=FALSE>>=
  53. library(help=rcdk)
  54. @
  55. The package also provides an example data set, called \texttt{bpdata}
  56. which contains 277 molecules, in SMILES format and their associated
  57. boiling points (BP) in Kelvin. The data.frame has two columns, viz.,
  58. the SMILES and the BP. Molecules names are used as row names.
  59. \section{Input / Output}
  60. Chemical structures come in a variety of formats and the CDK supports
  61. many of them. Many such formats are disk based and these files can be
  62. parsed and loaded by specifying their full paths
  63. <<eval=FALSE>>=
  64. mols <- load.molecules( c('data1.sdf', '/some/path/data2.sdf') )
  65. @
  66. Note that the above function will load any file format that is
  67. supported by the CDK, so there's no need to specify formats. In
  68. addition one can specify a URL (which should start with ``http://'')
  69. to specify remote files as well. The result of this function is a
  70. \texttt{list} of molecule objects. The molecule objects are of class
  71. \texttt{jobjRef} (provided by the \Rpackage{rJava} package). As a
  72. result,they are pretty opaque to the user and are really meant to be
  73. processed using methods from the \Rpackage{rcdk} or \Rpackage{rJava}
  74. packages.
  75. However, since it loads all the molecules from the specified file into
  76. a list, large files can lead to out of memory errors. In such a situtation
  77. it is preferable to iterate over the file, one structure at a time. Currently
  78. this behavior is supported for SDF and SMI files. An example of such a
  79. usage for a large SD file would be:
  80. <<eval=FALSE>>=
  81. iter <- iload.molecules('verybig.sdf', type='sdf')
  82. while(hasNext(iter)) {
  83. mol <- nextElem(iter)
  84. print(get.property(mol, "cdk:Title"))
  85. }
  86. @
  87. Another common way to obtain molecule objects is by parsing SMILES
  88. strings. The simplest way to do this is
  89. <<>>=
  90. smile <- 'c1ccccc1CC(=O)C(N)CC1CCCCOC1'
  91. mol <- parse.smiles(smile)[[1]]
  92. @
  93. Usage is more efficient when multiple SMILE are supplied, since then a single
  94. SMILES parser object is used to parse all the supplied SMILES.
  95. <<>>=
  96. smiles <- c('CCC', 'c1ccccc1', 'CCCC(C)(C)CC(=O)NC')
  97. mols <- parse.smiles(smiles)
  98. @
  99. If you plan on parsing a large number of SMILES, you may run into
  100. memory issues, due to the large size of \texttt{IAtomContainer}
  101. objects. In such a case, it can be useful to call the Java and R garbage
  102. collectors explicitly at the appropriate time. In addition it can be useful
  103. to explicitly allocate a large amount of memory for the JVM. For example,
  104. <<eval=FALSE>>=
  105. options("java.parameters"=c("-Xmx4000m"))
  106. library(rcdk)
  107. for (smile in smiles) {
  108. m <- parse.smiles(smile)
  109. ## perform operations on this molecule
  110. jcall("java/lang/System","V","gc")
  111. gc()
  112. }
  113. @
  114. Given a list of molecule objects, it is possible to serialize them to
  115. a file in some specified format. Currently, the only output formats
  116. are SMILES or SDF. To write molecules to a disk file in SDF format:
  117. <<eval=FALSE>>=
  118. write.molecules(mols, filename='mymols.sdf')
  119. @
  120. By default, if \texttt{mols} is a list of multiple molecules, all of
  121. them will be written to a single SDF file. If this is not desired, you
  122. can write each on to individual files (which are prefixed by the
  123. value of \funcarg{filename}):
  124. <<eval=FALSE>>=
  125. write.molecules(mols, filename='mymols.sdf', together=FALSE)
  126. @
  127. To generate a SMILES representation of a molecule we can do
  128. <<>>=
  129. get.smiles(mols[[1]])
  130. unlist(lapply(mols, get.smiles))
  131. @
  132. \section{Visualization}
  133. Currently the \Rpackage{rcdk} package only supports 2D
  134. visualization. This can be used to view the structure of individual
  135. molecules or multiple molecules in a tabular format. It is also
  136. possible to view a molecular-data table, where one of the columns is
  137. the 2D image and the remainder can contain data associated with the molecules.
  138. Unfortunately, due to event handling issues, the depictions will
  139. display on OS X, but the Swing window will become unresponsive. As a
  140. result, it is not recommended to generate 2D depictions on OS X.
  141. Molecule visualization is performed using the
  142. \texttt{view.molecule.2d} function. For viewing a single molecule or a
  143. list of multiple molecules, it is simply
  144. <<eval=FALSE>>=
  145. smiles <- c('CCC', 'CCN', 'CCN(C)(C)',
  146. 'c1ccccc1Cc1ccccc1',
  147. 'C1CCC1CC(CN(C)(C))CC(=O)CC')
  148. mols <- parse.smiles(smiles)
  149. view.molecule.2d(mols[[1]])
  150. view.molecule.2d(mols)
  151. @
  152. If multiple molecules are provided, they are display in a matrix
  153. format, with a default of four columns. This can be changed via the
  154. \emph{ncol} argument. Furthermore, the size of the images are 200
  155. $\times$ 200 pixels, by default. But this can be easily changed via
  156. the \emph{cellx} and \emph{celly} arguments.
  157. In many cases, it is useful to view a ``molecular spreadsheet'', which
  158. is a table of molecules along with information related to the
  159. molecules being viewed. The data is arranged in a spreadsheet like
  160. manner, with one of the columns being molecules and the remainder
  161. being textual or numeric information.
  162. This can be achieved using the \texttt{view.table} method which takes
  163. a list of molecule objects and a \texttt{data.frame} containing the
  164. associated data. As expected, the number of rows in the
  165. \texttt{data.frame} should equal the length of the molecule list.
  166. <<eval=FALSE>>=
  167. dframe <- data.frame(x = runif(4),
  168. toxicity = factor(c('Toxic', 'Toxic', 'Nontoxic', 'Nontoxic')),
  169. solubility = c('yes', 'yes', 'no', 'yes'))
  170. view.table(mols[1:4], dframe)
  171. @
  172. As shown, the \texttt{view.table} supports numeric, character and
  173. factor data types.
  174. \section{Manipulating Molecules}
  175. In general, given a \texttt{jobjRef} for a molecule object one can
  176. access all the class and methods of the CDK library via
  177. \Rpackage{rJava}. However this can be cumbersome. The \Rpackage{rcdk}
  178. package is in the process of exposing methods and classes that
  179. manipulate molecules. This section describes them - more will be
  180. implemented in future releases.
  181. \subsection{Adding Information to Molecules}
  182. In many scenarios it's useful to associate information with
  183. molecules. Within R, you could always create a \texttt{data.frame} and
  184. store the molecule objects along with relevant information in
  185. it. However, when serializing the molecules, you want to be able to
  186. store the associated information.
  187. Using the CDK it's possible to directly add information to a molecule
  188. object using properties. Note that adding such properties uses a
  189. key-value paradigm, where the key should be of class
  190. \texttt{character}. The value can be of class \texttt{integer},
  191. \texttt{double}, \texttt{character} or \texttt{jobjRef}. Obviously,
  192. after setting a property, you can get a property by its key.
  193. <<>>=
  194. mol <- parse.smiles('c1ccccc1')[[1]]
  195. set.property(mol, "title", "Molecule 1")
  196. set.property(mol, "hvyAtomCount", 6)
  197. get.property(mol, "title")
  198. @
  199. It is also possible to get all available properties at once in the
  200. from of a list. The property names are used as the list names.
  201. <<>>=
  202. get.properties(mol)
  203. @
  204. After adding such properties to the molecule, you can write it out to
  205. an SD file, so that the property values become SD tags.
  206. <<eval=FALSE>>=
  207. write.molecules(mol, 'tagged.sdf', write.props=TRUE)
  208. @
  209. \subsection{Atoms and Bonds}
  210. Probably the most important thing to do is to get the atoms and bonds
  211. of a molecule. The code below gets the atoms and bonds as lists of
  212. \texttt{jobjRef} objects, which can be manipulated using
  213. \Rpackage{rJava} or via other methods of this package.
  214. <<>>=
  215. mol <- parse.smiles('c1ccccc1C(Cl)(Br)c1ccccc1')[[1]]
  216. atoms <- get.atoms(mol)
  217. bonds <- get.bonds(mol)
  218. cat('No. of atoms =', length(atoms), '\n')
  219. cat('No. of bonds =', length(bonds), '\n')
  220. @
  221. Right now, given an atom the \Rpackage{rcdk} package does not offer a
  222. lot of methods to operate on it. One must access the CDK directly. In
  223. the future more manipulators will be added. Right now, you can get the
  224. symbol for each atom
  225. <<>>=
  226. unlist(lapply(atoms, get.symbol))
  227. @
  228. It's also possible to get the 3D (or 2D coordinates) for an atom.
  229. <<>>=
  230. coords <- get.point3d(atoms[[1]])
  231. @
  232. Given this, it's quite easy to get the 3D coordinate matrix for a molecule
  233. <<eval=FALSE>>=
  234. coords <- do.call('rbind', lapply(atoms, get.point3d))
  235. @
  236. Once you have the coordinate matrix, a quick way to check whether the
  237. molecule is flat is to do
  238. <<eval=FALSE>>=
  239. if ( any(apply(coords, 2, function(x) length(unique(x))) == 1) ) {
  240. print("molecule is flat")
  241. }
  242. @
  243. This is quite a simplistic check that just looks at whether the X, Y
  244. or Z coordinates are constant. To be more rigorous one could evaluate
  245. the moments of inertia about the axes.
  246. \subsection{Substructure Matching}
  247. The CDK library supports substructure searches using SMARTS (or
  248. SMILES) patterns. The implementation allows one to check whether a
  249. target molecule contains a substructure or not as well as to retrieve
  250. the atoms and bonds of the target molecule that match the query
  251. substructure. At this point, the \Rpackage{rcdk} only support the
  252. former operation - given a query pattern, does it occur or not in a
  253. list of target molecules. The \texttt{match} method of this package is
  254. modeled after the same method in the \Rpackage{base} package. An
  255. example of its usage would be to identify molecules that contain a
  256. carbon atom that has exactly two bonded neighbors.
  257. <<<eval=TRUE>>=
  258. mols <- parse.smiles(c('CC(C)(C)C','c1ccc(Cl)cc1C(=O)O', 'CCC(N)(N)CC'))
  259. query <- '[#6D2]'
  260. hits <- match(query, mols)
  261. print(hits)
  262. @
  263. \section{Molecular Descriptors}
  264. Probably the most desired feature when doing predictive modeling of
  265. molecular activities is molecular descriptors. The CDK implements a
  266. variety of molecular descriptors, categorized into topological,
  267. constitutional, geometric, electronic and hybrid. It is possible to
  268. evaluate all available descriptors at one go, or evaluate individual
  269. descriptors.
  270. First, we can take a look at the available descriptor
  271. categories.
  272. <<>>=
  273. dc <- get.desc.categories()
  274. dc
  275. @
  276. Given the categories we can get the names of the descriptors for a
  277. single category. Of course, you can always provide the category name
  278. directly.
  279. <<>>=
  280. dn <- get.desc.names(dc[1])
  281. @
  282. Each descriptor name is actually a fully qualified Java class name for
  283. the corresponding descriptor. These names can be supplied to
  284. \texttt{eval.desc} to evaluate a single or multiple descriptors for
  285. one or more molecules.
  286. <<>>=
  287. aDesc <- eval.desc(mol, dn[1])
  288. allDescs <- eval.desc(mol, dn)
  289. @
  290. The return value of \texttt{eval.desc} is a data.frame with the
  291. descriptors in the columns and the molecules in the rows. For the
  292. above example we get a single row. But given a list of molecules, we
  293. can easily get a descriptor matrix. For example, lets build a linear
  294. regression model to predict boiling points for the BP dataset. First
  295. we need a set of descriptors and so we evaluate all available descriptors.
  296. Also note that since a descriptor might belong to more than one
  297. category, we should obtain a unique set of descriptor names
  298. <<>>=
  299. data(bpdata)
  300. mols <- parse.smiles(bpdata[,1])
  301. descNames <- unique(unlist(sapply(get.desc.categories(), get.desc.names)))
  302. descs <- eval.desc(mols, descNames)
  303. class(descs)
  304. dim(descs)
  305. @
  306. %descs <- eval.desc(mols,c("org.openscience.cdk.qsar.descriptors.molecular.KierHallSmartsDescriptor"))
  307. As you can see we get a \texttt{data.frame}. Many of the columns will
  308. be \texttt{NA}. This is because when a descriptor cannot be evaluated
  309. (due to some error) it returns NA. In our case, since our molecules
  310. have no 3D coordinates many geometric, electronic and hybrid
  311. descriptors cannot be evaluated.
  312. Given the ubiquity of certain descriptors, some of them are directly
  313. available via their own functions. Specifically, one can calculate
  314. TPSA (topological polar surface area), AlogP and XlogP without having
  315. to go through \texttt{eval.desc}.\footnote{Note that AlogP and XlogP
  316. assume that hydrogens are explicitly specified in the molecule. This
  317. may not be true if the molecules were obtained from SMILES}.
  318. <<>>=
  319. mol <- parse.smiles('CC(=O)CC(=O)NCN')[[1]]
  320. convert.implicit.to.explicit(mol)
  321. get.tpsa(mol)
  322. get.xlogp(mol)
  323. get.alogp(mol)
  324. @
  325. In any case, now that we have a descriptor matrix, we easily build a
  326. linear regression model. First, remove NA's, correlated and constant
  327. columns. The code is shown below, but since it involves a stochastic
  328. element, we will not run it for this example. If we were to perform
  329. feature selection, then this type of reduction would have to be performed.
  330. <<eval=FALSE>>=
  331. descs <- descs[, !apply(descs, 2, function(x) any(is.na(x)) )]
  332. descs <- descs[, !apply( descs, 2, function(x) length(unique(x)) == 1 )]
  333. r2 <- which(cor(descs)^2 > .6, arr.ind=TRUE)
  334. r2 <- r2[ r2[,1] > r2[,2] , ]
  335. descs <- descs[, -unique(r2[,2])]
  336. @
  337. Note that the above correlation reduction step is pretty crude and
  338. there are better ways to do it. Given the reduced descriptor matrix,
  339. we can perform feature selection (say using \Rpackage{leaps} or a GA)
  340. to identify a suitable subset of descriptors. For now, we'll select
  341. some descriptors that we know are correlated to BP. The fit is shown
  342. in Figure \ref{fig:ols} which plots the observed versus predicted BP's.
  343. <<>>=
  344. model <- lm(BP ~ khs.sCH3 + khs.sF + apol + nHBDon, data.frame(bpdata, descs))
  345. summary(model)
  346. @
  347. %model <- lm(BP ~ khs.sCH3 + khs.sF + apol + nHBDon, data.frame(bpdata, descs))
  348. %model <- lm(BP ~ khs.sCH3 + khs.sF, data.frame(bpdata, descs))
  349. \begin{figure}[h]
  350. \centering
  351. <<fig=TRUE,echo=FALSE>>=
  352. par(mar=c(4.3,4.3,1,1),cex.lab=1.3, pty='s')
  353. plot(bpdata$BP, model$fitted, pch=19,
  354. ylab='Predicted BP', xlab='Observed BP',
  355. xlim=range(bpdata$BP), ylim=range(bpdata$BP))
  356. abline(0,1, col='red')
  357. @
  358. \caption{A plot of observed versus predicted boiling points,
  359. obtained from a linear regression model using 277 molecules.}
  360. \label{fig:ols}
  361. \end{figure}
  362. \section{Fingerprints}
  363. Fingerprints are a common representation used for a variety of
  364. purposes such as similarity searching and predictive modeling. The CDK
  365. provides four types of fingerprints, viz.,
  366. \begin{itemize}
  367. \item Standard - a path based, hashed fingerprint. The default size is
  368. 1024 bits, but this can be changed via an argument
  369. \item Extended - similar to the Standard form, but takes into account
  370. ring systems. Default size is 1024 bits
  371. \item EState - a structural key type fingerprint that checks for the
  372. presence or absence of 79 EState substructures. Length of the
  373. fingerprint is 79 bits
  374. \item MACCS - the well known 166 bit structural keys
  375. \end{itemize}
  376. When using \Rpackage{rcdk} to evaluate fingerprints, you will need the
  377. \Rpackage{fingerprint} package. Since this is a dependency of the
  378. \Rpackage{rcdk} package, it should have been automatically installed.
  379. To generate the fingerprints, we must first obtain molecule
  380. objects. Thus for example,
  381. <<>>=
  382. smiles <- c('CCC', 'CCN', 'CCN(C)(C)',
  383. 'c1ccccc1Cc1ccccc1',
  384. 'C1CCC1CC(CN(C)(C))CC(=O)CC')
  385. mols <- parse.smiles(smiles)
  386. fp <- get.fingerprint(mols[[1]], type='maccs')
  387. @
  388. The variable, \texttt{fp}, will be of class \rclass{fingerprint} and
  389. can be manipulated using the methods provided by the package of the
  390. same name. A simple example is to perform a hierarchical clustering of
  391. the first 50 structures in the BP dataset.
  392. <<>>= mols <-
  393. mols <- parse.smiles(bpdata[1:50,1])
  394. fps <- lapply(mols, get.fingerprint, type='extended')
  395. fp.sim <- fp.sim.matrix(fps, method='tanimoto')
  396. fp.dist <- 1 - fp.sim
  397. @
  398. Once we have the distance matrix (which we must derive from the
  399. Tanimoto similarity matrix), we can then perform the clustering and
  400. visualize it.
  401. \begin{figure}[t]
  402. \centering
  403. <<fig=TRUE>>=
  404. clustering <- hclust(as.dist(fp.dist))
  405. plot(clustering, main='A Clustering of the BP dataset')
  406. @
  407. \caption{A clustering of the first 50 molecules of the BP dataset, using the CDK extended fingerprints.}
  408. \label{fig:cluster}
  409. \end{figure}
  410. Another common task for fingerprints is similarity searching. In other
  411. words, given a collection of ``target'' molecules, find those
  412. molecules that are similar to a ``query'' molecule. This is achieved
  413. by evaluating a similarity metric between the query and each of the
  414. target molecules. Those target molecules exceeding a user defined
  415. cutoff will be returned. With the help of the \Rpackage{fingerprint}
  416. package this is easily accomplished.
  417. For example, we can identify all the molecules in the BP dataset that
  418. have a Tanimoto similarity of 0.3 or more with acetalehyde, and then
  419. create a summary in Table \ref{tab:sims}. Note that this could also be
  420. accomplished with molecular descriptors, in which case you'd probably
  421. evaluate the Euclidean distance between descriptor vectors.
  422. <<>>=
  423. query.mol <- parse.smiles('CC(=O)')[[1]]
  424. target.mols <- parse.smiles(bpdata[,1])
  425. query.fp <- get.fingerprint(query.mol, type='maccs')
  426. target.fps <- lapply(target.mols, get.fingerprint, type='maccs')
  427. sims <- unlist(lapply(target.fps,
  428. distance,
  429. fp2=query.fp, method='tanimoto'))
  430. hits <- which(sims > 0.3)
  431. @
  432. <<echo=FALSE,results=tex>>=
  433. d <- data.frame(SMILES=bpdata[hits,1], Similarity=sims[hits])
  434. row.names(d) <- NULL
  435. d <- d[sort.list(d[,2], dec=TRUE),]
  436. xtable(d, label="tab:sims",
  437. caption="Summary of molecules in the BP dataset that are greater than 0.3 similar to acetaldehyde")
  438. @
  439. \section{Handling Molecular Formulae}
  440. The molecular formula is the simplest way to characterize a molecular
  441. compound. It specifies the actual number of atoms of each element
  442. contained in the molecule. A molecular formula is represented by the
  443. chemical symbol of each constituent element. If a molecule contains
  444. more than one atom for a particular element, the quantity is shown as
  445. subscript after the chemical symbol. Otherwise, the number of neutrons
  446. (atomic mass) that an atom is composed can differ. This different type
  447. of atoms are known as isotopes. The number of nucleos is denoted as
  448. superscripted prefix previous to the chemical element. Generally it is
  449. not added when the isotope that characterizes the element is the most
  450. occurrence in nature. E.g. $C_4H_{11}O^2D$.
  451. \subsection{Parsing a Molecule To a Molecular Formula}
  452. Front a molecule, defined as conjunct of atoms helding together by
  453. chemical bonds, we can simplify it taking only the information about
  454. the atoms. \Rpackage{rcdk} package provides a parser to translate
  455. molecules to molecular formlulas, the \texttt{get.mol2formula}
  456. function.
  457. <<>>=
  458. sp <- get.smiles.parser()
  459. molecule <- parse.smiles('N')[[1]]
  460. convert.implicit.to.explicit(molecule)
  461. formula <- get.mol2formula(molecule,charge=0)
  462. @
  463. Note that the above formula object is a CDKFormula-class. A cdkFormula-class contains some attributes that
  464. defines a molecular formula. For example, the mass, the charge, the isotopes, the character representation
  465. of the molecular formula and the IMolecularFormula \texttt{jobjRef} object.
  466. The molecular mass for this formula.
  467. <<>>=
  468. formula@mass
  469. @
  470. The charge for this formula.
  471. <<>>=
  472. formula@charge
  473. @
  474. The isotopes for this formula. It is formed for three columns.
  475. \texttt{isoto} (the symbol expression of the isotope), \texttt{number}
  476. (number of atoms for this isotope) and \texttt{mass} (exact mass of
  477. this isotope).
  478. <<>>=
  479. formula@isotopes
  480. @
  481. The \texttt{jobjRef} object from the \texttt{IMolecularFormula}
  482. java class in CDK.
  483. <<>>=
  484. formula@objectJ
  485. @
  486. The symbol expression of the molecular formula.
  487. <<>>=
  488. formula@string
  489. @
  490. Depending of the circumstances, you may want to change the charge
  491. of the molecular formula.
  492. <<eval=FALSE>>=
  493. formula <- set.charge.formula(formula, charge=1)
  494. @
  495. \subsection{Initializing a Formula from the Symbol Expression}
  496. Other way to create a \texttt{cdkFormula} is from the symbol
  497. expression. Thus, setting the characters of the elemental formula, the
  498. function \texttt{get.formula} parses it to an object of
  499. cdkFormula-class.
  500. <<>>=
  501. formula <- get.formula('NH4', charge = 1);
  502. formula
  503. @
  504. \subsection{Generating Molecular Formula}
  505. Mass spectrometry is an essential and reliable technique to determine
  506. the molecular mass of compounds. Conversely, one can use the measured
  507. mass to identify the compound via its elemental formula. One of the
  508. limitations of the method is the precision and accuracy of the
  509. instrumentation. As a result, rather than specify exact masses, we
  510. specify tolerances or ranges of possible mass, resulting in multiple
  511. candidate formulae for a given \emph{mass window}. The
  512. \texttt{generate.formula} function returns a list formulas which have
  513. a given mass (within an error window):
  514. <<>>=
  515. mfSet <- generate.formula(18.03383, window=1,
  516. elements=list(c("C",0,50),c("H",0,50),c("N",0,50)),
  517. validation=FALSE);
  518. for (i in 1:length(mfSet)) {
  519. print(mfSet[i])
  520. }
  521. @
  522. Important to know is if an elemental formula is valid. The method \texttt{isvalid.formula}
  523. provides this function. Two constraints can be applied, the nitrogen rule and the
  524. (Ring Double Bond Equivalent) RDBE rule.
  525. <<>>=
  526. formula <- get.formula('NH4', charge = 0);
  527. isvalid.formula(formula,rule=c("nitrogen","RDBE"))
  528. @
  529. We can observe that the ammonium is only valid if it is defined with
  530. charge of $+1$.
  531. \subsection{Calculating Isotope Pattern}
  532. Due to the measurement errors in medium resolution spectrometry, a
  533. given error window can result in a massive number of candidate
  534. formulae. The isotope pattern of ions obtained experimentally can be
  535. compared with the theoretical ones. The best match is reflected as the
  536. most probable elemental formula. \Rpackage{rcdk} provides the function
  537. \texttt{get.isotopes.pattern} which predicts the theoretical isotope
  538. pattern given a formula.
  539. <<>>=
  540. formula <- get.formula('CHCl3', charge = 0)
  541. isotopes <- get.isotopes.pattern(formula,minAbund=0.1)
  542. isotopes
  543. @
  544. In this example we generate a formula for a possible compound with a
  545. charge ($z = \approx 0$) containing the standard elements C, H, and
  546. Cl. The isotope pattern can be visually inspectd, as shown in
  547. Figure~\ref{fig:isotopes}.
  548. \begin{figure}[h]
  549. \centering
  550. <<fig=TRUE>>=
  551. plot(isotopes, type="h", xlab="m/z", ylab="Intensity")
  552. @
  553. \caption{Theoretical isotope pattern given a molecular formula.}
  554. \label{fig:isotopes}
  555. \end{figure}
  556. \end{document}