rcdk.Rnw | searchcode

/rcdk/inst/doc/rcdk.Rnw

http://github.com/rajarshi/cdkr · Unknown · 673 lines · 556 code · 117 blank · 0 comment · 0 complexity · c13514ba169fc4bac20b0275b06ebeac MD5 · raw file

% \VignetteIndexEntry{rcdk Tutorial}
% \VignettePackage{rcdk}
% \VignetteKeywords{}
% \VignetteDepends{xtable}
%% To generate the Latex code
%library(rcdk)
%Rnwfile<- file.path("rcdk.Rnw")
%Sweave(Rnwfile,pdf=TRUE,eps=TRUE,stylepath=TRUE,driver=RweaveLatex())


\documentclass[letterpaper, 11pt]{article}

\usepackage{times}
\usepackage{url}
\usepackage[pdftex,bookmarks=true]{hyperref}

\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Rpackage}[1]{{\textit{#1}}}
\newcommand{\funcarg}[1]{{\texttt{#1}}}

\newcommand{\rclass}[1]{{\textit{#1}}}

<<echo=FALSE>>=
options(width=74)
library(xtable)
@
\parindent 0in
\parskip 1em

\begin{document}

\title{rcdk: Integrating the CDK with R}
\author{Rajarshi Guha \\ Miguel Rojas Cherto}
\maketitle
\tableofcontents
\newpage

\section{Introduction}

Given that much of cheminformatics involves mathematical and
statistical modeling of chemical information, R is a natural platform
for such work. There are many cheminformatics applications that will
generate useful information such as descriptors, fingerprints and so
on. While one can always run these applications to generate data that
is then imported into R, it can be convenient to be able to manipulate
chemical structures and generate chemical information with the R environment.

The CDK is a Java library for cheminformatics that supports a wide
variety of cheminformatics functionality ranging from reading
molecular file formats, performing ring perception and armaticity
detection to fingerprint generation and molecular descriptors. 

The goal of the \Rpackage{rcdk} package is to allow an R user to
access the cheminformatics functionality of the CDK from within
R. While one can use the \Rpackage{rJava} package to make direct calls
to specific methods in the CDK, from R, such usage does not usually
follow common R idioms. The goal of the \Rpackage{rcdk} is to allow
users to use the CDK classes and methods in an R-like fashion.

The library is loaded as follows

<<<>>=
library(rcdk)
@ 
To list the documentation for all available packages try
<<eval=FALSE>>=
library(help=rcdk)
@ 
The package also provides an example data set, called \texttt{bpdata}
which contains 277 molecules, in SMILES format and their associated
boiling points (BP) in Kelvin. The data.frame has two columns, viz.,
the SMILES and the BP. Molecules names are used as row names.

\section{Input / Output}

Chemical structures come in a variety of formats and the CDK supports
many of them. Many such formats are disk based and these files can be
parsed and loaded by specifying their full paths

<<eval=FALSE>>=
mols <- load.molecules( c('data1.sdf', '/some/path/data2.sdf') )
@ 

Note that the above function will load any file format that is
supported by the CDK, so there's no need to specify formats. In
addition one can specify a URL (which should start with ``http://'')
to specify remote files as well.  The result of this function is a
\texttt{list} of molecule objects. The molecule objects are of class
\texttt{jobjRef} (provided by the \Rpackage{rJava} package). As a
result,they are pretty opaque to the user and are really meant to be
processed using methods from the \Rpackage{rcdk} or \Rpackage{rJava}
packages.

However, since it loads all the molecules from the specified file into 
a list, large files can lead to out of memory errors. In such a situtation
it is preferable to iterate over the file, one structure at a time. Currently
this behavior is supported for SDF and SMI files. An example of such a 
usage for a large SD file would be:
<<eval=FALSE>>=
iter <- iload.molecules('verybig.sdf', type='sdf')
while(hasNext(iter)) {
 mol <- nextElem(iter)
 print(get.property(mol, "cdk:Title"))
}
@ 

Another common way to obtain molecule objects is by parsing SMILES
strings. The simplest way to do this is
<<>>=
smile <- 'c1ccccc1CC(=O)C(N)CC1CCCCOC1'
mol <- parse.smiles(smile)[[1]]
@ 
Usage is more efficient when multiple SMILE are supplied, since then a single
SMILES parser object is used to parse all the supplied SMILES.
<<>>=
smiles <- c('CCC', 'c1ccccc1', 'CCCC(C)(C)CC(=O)NC')
mols <- parse.smiles(smiles)
@ 
If you plan on parsing a large number of SMILES, you may run into
memory issues, due to the large size of \texttt{IAtomContainer}
objects. In such a case, it can be useful to call the Java and R garbage
collectors explicitly at the appropriate time. In addition it can be useful
to explicitly allocate a large amount of memory for the JVM. For example,
<<eval=FALSE>>=
options("java.parameters"=c("-Xmx4000m"))
library(rcdk)

for (smile in smiles) {
  m <- parse.smiles(smile)
  ## perform operations on this molecule
  
  jcall("java/lang/System","V","gc")
  gc()
}
@ 
Given a list of molecule objects, it is possible to serialize them to
a file in some specified format. Currently, the only output formats
are SMILES or SDF. To write molecules to a disk file in SDF format:
<<eval=FALSE>>=
write.molecules(mols, filename='mymols.sdf')
@ 
By default, if \texttt{mols} is a list of multiple molecules, all of
them will be written to a single SDF file. If this is not desired, you
can write each on to individual files (which are prefixed by the
value of \funcarg{filename}):

<<eval=FALSE>>=
write.molecules(mols, filename='mymols.sdf', together=FALSE)
@ 

To generate a SMILES representation of a molecule we can do
<<>>=
get.smiles(mols[[1]])
unlist(lapply(mols, get.smiles))
@ 

\section{Visualization}

Currently the \Rpackage{rcdk} package only supports 2D
visualization. This can be used to view the structure of individual
molecules or multiple molecules in a tabular format. It is also
possible to view a molecular-data table, where one of the columns is
the 2D image and the remainder can contain data associated with the molecules.

Unfortunately, due to event handling issues, the depictions will
display on OS X, but the Swing window will become unresponsive. As a
result, it is not recommended to generate 2D depictions on OS X.

Molecule visualization is performed using the
\texttt{view.molecule.2d} function. For viewing a single molecule or a
list of multiple molecules, it is simply

<<eval=FALSE>>=
smiles <- c('CCC', 'CCN', 'CCN(C)(C)',
            'c1ccccc1Cc1ccccc1',
            'C1CCC1CC(CN(C)(C))CC(=O)CC')
mols <- parse.smiles(smiles)
view.molecule.2d(mols[[1]])
view.molecule.2d(mols)
@ 
If multiple molecules are provided, they are display in a matrix
format, with a default of four columns. This can be changed via the
\emph{ncol} argument. Furthermore, the size of the images are 200
$\times$ 200 pixels, by default. But this can be easily changed via
the \emph{cellx} and \emph{celly} arguments.

In many cases, it is useful to view a ``molecular spreadsheet'', which
is a table of molecules along with information related to the
molecules being viewed. The data is arranged in a spreadsheet like
manner, with one of the columns being molecules and the remainder
being textual or numeric information. 

This can be achieved using the \texttt{view.table} method which takes
a list of molecule objects and a \texttt{data.frame} containing the
associated data. As expected, the number of rows in the
\texttt{data.frame} should equal the length of the molecule list.

<<eval=FALSE>>=
dframe <- data.frame(x = runif(4),
                     toxicity = factor(c('Toxic', 'Toxic', 'Nontoxic', 'Nontoxic')),
                     solubility = c('yes', 'yes', 'no', 'yes'))
view.table(mols[1:4], dframe)
@ 
As shown, the \texttt{view.table} supports numeric, character and
factor data types.


\section{Manipulating Molecules}

In general, given a \texttt{jobjRef} for a molecule object one can
access all the class and methods of the CDK library via
\Rpackage{rJava}. However this can be cumbersome. The \Rpackage{rcdk}
package is in the process of exposing methods and classes that
manipulate molecules. This section describes them - more will be
implemented in future releases.

\subsection{Adding Information to Molecules}

In many scenarios it's useful to associate information with
molecules. Within R, you could always create a \texttt{data.frame} and
store the molecule objects along with relevant information in
it. However, when serializing the molecules, you want to be able to
store the associated information.

Using the CDK it's possible to directly add information to a molecule
object using properties. Note that adding such properties uses a
key-value paradigm, where the key should be of class
\texttt{character}. The value can be of class \texttt{integer},
\texttt{double}, \texttt{character} or \texttt{jobjRef}. Obviously,
after setting a property, you can get a property by its key.
<<>>=
mol <- parse.smiles('c1ccccc1')[[1]]
set.property(mol, "title", "Molecule 1")
set.property(mol, "hvyAtomCount", 6)
get.property(mol, "title")
@ 

It is also possible to get all available properties at once in the
from of a list. The property names are used as the list names.
<<>>=
get.properties(mol)
@ 

After adding such properties to the molecule, you can write it out to
an SD file, so that the property values become SD tags.
<<eval=FALSE>>=
write.molecules(mol, 'tagged.sdf', write.props=TRUE)
@ 

\subsection{Atoms and Bonds}



Probably the most important thing to do is to get the atoms and bonds
of a molecule. The code below gets the atoms and bonds as lists of
\texttt{jobjRef} objects, which can be manipulated using
\Rpackage{rJava} or via other methods of this package.

<<>>=
mol <- parse.smiles('c1ccccc1C(Cl)(Br)c1ccccc1')[[1]]
atoms <- get.atoms(mol)
bonds <- get.bonds(mol)
cat('No. of atoms =', length(atoms), '\n')
cat('No. of bonds =', length(bonds), '\n')
@ 

Right now, given an atom the \Rpackage{rcdk} package does not offer a
lot of methods to operate on it. One must access the CDK directly. In
the future more manipulators will be added. Right now, you can get the
symbol for each atom

<<>>=
unlist(lapply(atoms, get.symbol))
@ 

It's also possible to get the 3D (or 2D coordinates) for an atom. 

<<>>=
coords <- get.point3d(atoms[[1]])
@ 

Given this, it's quite easy to get the 3D coordinate matrix for a molecule

<<eval=FALSE>>=
coords <- do.call('rbind', lapply(atoms, get.point3d))
@ 

Once you have the coordinate matrix, a quick way to check whether the
molecule is flat is to do
<<eval=FALSE>>=
if ( any(apply(coords, 2, function(x) length(unique(x))) == 1) ) {
  print("molecule is flat")
}
@ 

This is quite a simplistic check that just looks at whether the X, Y
or Z coordinates are constant. To be more rigorous one could evaluate
the moments of inertia about the axes.

\subsection{Substructure Matching}
The CDK library supports substructure searches using SMARTS (or
SMILES) patterns. The implementation allows one to check whether a
target molecule contains a substructure or not as well as to retrieve
the atoms and bonds of the target molecule that match the query
substructure. At this point, the \Rpackage{rcdk} only support the
former operation - given a query pattern, does it occur or not in a
list of target molecules. The \texttt{match} method of this package is
modeled after the same method in the \Rpackage{base} package. An
example of its usage would be to identify molecules that contain a
carbon atom that has exactly two bonded neighbors.
<<<eval=TRUE>>=
mols <- parse.smiles(c('CC(C)(C)C','c1ccc(Cl)cc1C(=O)O', 'CCC(N)(N)CC'))
query <- '[#6D2]'
hits <- match(query, mols)
print(hits)
@ 

\section{Molecular Descriptors}

Probably the most desired feature when doing predictive modeling of
molecular activities is molecular descriptors. The CDK implements a
variety of molecular descriptors, categorized into topological,
constitutional, geometric, electronic and hybrid. It is possible to
evaluate all available descriptors at one go, or evaluate individual
descriptors.

First, we can take a look at the available descriptor
categories. 

<<>>=
dc <- get.desc.categories()
dc
@ 

Given the categories we can get the names of the descriptors for a
single category. Of course, you can always provide the category name
directly.

<<>>=
dn <- get.desc.names(dc[1])
@ 

Each descriptor name is actually a fully qualified Java class name for
the corresponding descriptor. These names can be supplied to
\texttt{eval.desc} to evaluate a single or multiple descriptors for
one or more molecules.

<<>>=
aDesc <- eval.desc(mol, dn[1])
allDescs <- eval.desc(mol, dn)
@ 

The return value of \texttt{eval.desc} is a data.frame with the
descriptors in the columns and the molecules in the rows. For the
above example we get a single row. But given a list of molecules, we
can easily get a descriptor matrix. For example, lets build a linear
regression model to predict boiling points for the BP dataset. First
we need a set of descriptors and so we evaluate all available descriptors.
Also note that since a descriptor might belong to more than one
category, we should obtain a unique set of descriptor names
<<>>=
data(bpdata)
mols <- parse.smiles(bpdata[,1])
descNames <- unique(unlist(sapply(get.desc.categories(), get.desc.names)))
descs <- eval.desc(mols, descNames) 
class(descs)
dim(descs)
@ 
%descs <- eval.desc(mols,c("org.openscience.cdk.qsar.descriptors.molecular.KierHallSmartsDescriptor")) 

As you can see we get a \texttt{data.frame}. Many of the columns will
be \texttt{NA}. This is because when a descriptor cannot be evaluated
(due to some error) it returns NA. In our case, since our molecules
have no 3D coordinates many geometric, electronic and hybrid
descriptors cannot be evaluated.

Given the ubiquity of certain descriptors, some of them are directly
available via their own functions. Specifically, one can calculate
TPSA (topological polar surface area), AlogP and XlogP without having
to go through \texttt{eval.desc}.\footnote{Note that AlogP and XlogP
  assume that hydrogens are explicitly specified in the molecule. This
  may not be true if the molecules were obtained from SMILES}.

<<>>=
mol <- parse.smiles('CC(=O)CC(=O)NCN')[[1]]
convert.implicit.to.explicit(mol)
get.tpsa(mol)
get.xlogp(mol)
get.alogp(mol)
@ 

In any case, now that we have a descriptor matrix, we easily build a
linear regression model. First, remove NA's, correlated and constant
columns. The code is shown below, but since it involves a stochastic
element, we will not run it for this example. If we were to perform
feature selection, then this type of reduction would have to be performed.

<<eval=FALSE>>=
descs <- descs[, !apply(descs, 2, function(x) any(is.na(x)) )]
descs <- descs[, !apply( descs, 2, function(x) length(unique(x)) == 1 )]
r2 <- which(cor(descs)^2 > .6, arr.ind=TRUE)
r2 <- r2[ r2[,1] > r2[,2] , ]
descs <- descs[, -unique(r2[,2])]
@ 

Note that the above correlation reduction step is pretty crude and
there are better ways to do it. Given the reduced descriptor matrix,
we can perform feature selection (say using \Rpackage{leaps} or a GA)
to identify a suitable subset of descriptors. For now, we'll select
some descriptors that we know are correlated to BP. The fit is shown
in Figure \ref{fig:ols} which plots the observed versus predicted BP's.

<<>>=
model <- lm(BP ~ khs.sCH3 + khs.sF + apol + nHBDon, data.frame(bpdata, descs))
summary(model)
@ 
%model <- lm(BP ~ khs.sCH3 + khs.sF + apol + nHBDon, data.frame(bpdata, descs))
%model <- lm(BP ~ khs.sCH3 + khs.sF, data.frame(bpdata, descs))

\begin{figure}[h]
  \centering
<<fig=TRUE,echo=FALSE>>=
par(mar=c(4.3,4.3,1,1),cex.lab=1.3, pty='s')
plot(bpdata$BP, model$fitted, pch=19, 
     ylab='Predicted BP', xlab='Observed BP',
     xlim=range(bpdata$BP), ylim=range(bpdata$BP))
abline(0,1, col='red')
@ 
  \caption{A plot of observed versus predicted boiling points,
    obtained from a linear regression model using 277 molecules.}
  \label{fig:ols}
\end{figure}

\section{Fingerprints}

Fingerprints are a common representation used for a variety of
purposes such as similarity searching and predictive modeling. The CDK
provides four types of fingerprints, viz.,
\begin{itemize}
\item Standard - a path based, hashed fingerprint. The default size is
  1024 bits, but this can be changed via an argument
\item Extended - similar to the Standard form, but takes into account
  ring systems. Default size is 1024 bits
\item EState - a structural key type fingerprint that checks for the
  presence or absence of 79 EState substructures. Length of the
  fingerprint is 79 bits
\item MACCS - the well known 166 bit structural keys
\end{itemize}
When using \Rpackage{rcdk} to evaluate fingerprints, you will need the
\Rpackage{fingerprint} package. Since this is a dependency of the
\Rpackage{rcdk} package, it should have been automatically installed.

To generate the fingerprints, we must first obtain molecule
objects. Thus for example,
<<>>=
smiles <- c('CCC', 'CCN', 'CCN(C)(C)', 
            'c1ccccc1Cc1ccccc1',
            'C1CCC1CC(CN(C)(C))CC(=O)CC')
mols <- parse.smiles(smiles)
fp <- get.fingerprint(mols[[1]], type='maccs')
@ 

The variable, \texttt{fp}, will be of class \rclass{fingerprint} and
can be manipulated using the methods provided by the package of the
same name. A simple example is to perform a hierarchical clustering of
the first 50 structures in the BP dataset.  

<<>>= mols <-
mols <- parse.smiles(bpdata[1:50,1]) 
fps <- lapply(mols, get.fingerprint, type='extended') 
fp.sim <- fp.sim.matrix(fps, method='tanimoto') 
fp.dist <- 1 - fp.sim 
@

Once we have the distance matrix (which we must derive from the
Tanimoto similarity matrix), we can then perform the clustering and
visualize it.
\begin{figure}[t]
  \centering
<<fig=TRUE>>=
clustering <- hclust(as.dist(fp.dist))
plot(clustering, main='A Clustering of the BP dataset')
@   
  \caption{A clustering of the first 50 molecules of the BP dataset, using the CDK extended fingerprints.}
  \label{fig:cluster}
\end{figure}

Another common task for fingerprints is similarity searching. In other
words, given a collection of ``target'' molecules, find those
molecules that are similar to a ``query'' molecule. This is achieved
by evaluating a similarity metric between the query and each of the
target molecules. Those target molecules exceeding a user defined
cutoff will be returned. With the help of the \Rpackage{fingerprint}
package this is easily accomplished.

For example, we can identify all the molecules in the BP dataset that
have a Tanimoto similarity of 0.3 or more with acetalehyde, and then
create a summary in Table \ref{tab:sims}. Note that this could also be
accomplished with molecular descriptors, in which case you'd probably
evaluate the Euclidean distance between descriptor vectors.

<<>>=
query.mol <- parse.smiles('CC(=O)')[[1]]
target.mols <- parse.smiles(bpdata[,1])
query.fp <- get.fingerprint(query.mol, type='maccs')
target.fps <- lapply(target.mols, get.fingerprint, type='maccs')
sims <- unlist(lapply(target.fps, 
                      distance, 
                      fp2=query.fp, method='tanimoto'))
hits <- which(sims > 0.3)
@ 


<<echo=FALSE,results=tex>>=
d <- data.frame(SMILES=bpdata[hits,1], Similarity=sims[hits])
row.names(d) <- NULL
d <- d[sort.list(d[,2], dec=TRUE),]
xtable(d, label="tab:sims", 
       caption="Summary of molecules in the BP dataset that are greater than 0.3 similar to acetaldehyde")
@ 


\section{Handling Molecular Formulae}

The molecular formula is the simplest way to characterize a molecular
compound. It specifies the actual number of atoms of each element
contained in the molecule. A molecular formula is represented by the
chemical symbol of each constituent element. If a molecule contains
more than one atom for a particular element, the quantity is shown as
subscript after the chemical symbol. Otherwise, the number of neutrons
(atomic mass) that an atom is composed can differ. This different type
of atoms are known as isotopes. The number of nucleos is denoted as
superscripted prefix previous to the chemical element. Generally it is
not added when the isotope that characterizes the element is the most
occurrence in nature. E.g.  $C_4H_{11}O^2D$.

\subsection{Parsing a Molecule To a Molecular Formula}

Front a molecule, defined as conjunct of atoms helding together by
chemical bonds, we can simplify it taking only the information about
the atoms. \Rpackage{rcdk} package provides a parser to translate
molecules to molecular formlulas, the \texttt{get.mol2formula}
function.

<<>>=
sp <- get.smiles.parser()
molecule <- parse.smiles('N')[[1]]
convert.implicit.to.explicit(molecule)
formula <- get.mol2formula(molecule,charge=0)
@ 

Note that the above formula object is a CDKFormula-class. A cdkFormula-class contains some attributes that
defines a molecular formula. For example, the mass, the charge, the isotopes, the character representation
of the molecular formula and the IMolecularFormula \texttt{jobjRef} object.   


The molecular mass for this formula.
<<>>=
formula@mass
@ 

The charge for this formula.
<<>>=
formula@charge
@ 


The isotopes for this formula. It is formed for three columns. 
\texttt{isoto} (the symbol expression of the isotope), \texttt{number}
(number of atoms for this isotope) and \texttt{mass} (exact mass of 
this isotope).
<<>>=
formula@isotopes
@ 

The \texttt{jobjRef} object from the \texttt{IMolecularFormula}
java class in CDK.
<<>>=
formula@objectJ
@ 

The symbol expression of the molecular formula.
<<>>=
formula@string
@ 


Depending of the circumstances, you may want to change the charge
of the molecular formula. 
<<eval=FALSE>>=
formula <- set.charge.formula(formula, charge=1)
@ 

\subsection{Initializing a Formula from the Symbol Expression}

Other way to create a \texttt{cdkFormula} is from the symbol
expression. Thus, setting the characters of the elemental formula, the
function \texttt{get.formula} parses it to an object of
cdkFormula-class.

<<>>=
formula <- get.formula('NH4', charge = 1);
formula
@ 


\subsection{Generating Molecular Formula}

Mass spectrometry is an essential and reliable technique to determine
the molecular mass of compounds. Conversely, one can use the measured
mass to identify the compound via its elemental formula. One of the
limitations of the method is the precision and accuracy of the
instrumentation.  As a result, rather than specify exact masses, we
specify tolerances or ranges of possible mass, resulting in multiple
candidate formulae for a given \emph{mass window}.  The
\texttt{generate.formula} function returns a list formulas which have
a given mass (within an error window):

<<>>=
mfSet <- generate.formula(18.03383, window=1, 
		elements=list(c("C",0,50),c("H",0,50),c("N",0,50)), 
		validation=FALSE);
for (i in 1:length(mfSet)) {
  print(mfSet[i])
}
@ 

Important to know is if an elemental formula is valid. The method \texttt{isvalid.formula}
provides this function. Two constraints can be applied, the nitrogen rule and the
(Ring Double Bond Equivalent) RDBE rule.

<<>>=
formula <- get.formula('NH4', charge = 0);
isvalid.formula(formula,rule=c("nitrogen","RDBE"))
@ 

We can observe that the ammonium is only valid if it is defined with
charge of $+1$.

\subsection{Calculating Isotope Pattern}

Due to the measurement errors in medium resolution spectrometry, a
given error window can result in a massive number of candidate
formulae. The isotope pattern of ions obtained experimentally can be
compared with the theoretical ones. The best match is reflected as the
most probable elemental formula. \Rpackage{rcdk} provides the function
\texttt{get.isotopes.pattern} which predicts the theoretical isotope
pattern given a formula.

<<>>=
formula <- get.formula('CHCl3', charge = 0) 
isotopes <- get.isotopes.pattern(formula,minAbund=0.1)
isotopes
@ 

In this example we generate a formula for a possible compound with a
charge ($z = \approx 0$) containing the standard elements C, H, and
Cl. The isotope pattern can be visually inspectd, as shown in
Figure~\ref{fig:isotopes}.

\begin{figure}[h]
  \centering
<<fig=TRUE>>=
plot(isotopes, type="h", xlab="m/z", ylab="Intensity")
@   
\caption{Theoretical isotope pattern given a molecular formula.}
\label{fig:isotopes}
\end{figure}



\end{document}