Simplified molecular input line entry specification
|
The simplified molecular input line entry specification or SMILES is a specification for unambiguously describing the structure of chemical molecules using short ASCII alpha-numeric strings. SMILES strings can be imported by most molecule editors for conversion back into two-dimensional drawings or three-dimensional models of the molecules.
The SMILES specification was developed by David Weininger in the late 1980s. It has since been modified and extended by others, most notably by Daylight Chemical Information Systems Inc. Other 'linear' notations include the Wiswesser Line Notation (WLN), ROSDAL and SLN (Tripos Inc). Recently, it has been introduced InChI formula representation.
Contents |
Graph-based definition
In terms of a graph-based computational procedure, SMILES is a string obtained by printing the symbol nodes encountered in a depth-first tree-traversal of a chemical graph. The chemical graph is first trimmed to remove hydrogen atoms and cycles are broken to make it into a spanning tree. Where cycles have been broken, numeric suffix labels are included to indicate the connected nodes. Brackets are used to indicate points of branching on the tree.
Examples
Atoms are represented by the standard abbreviation of the chemical elements, in square brackets, such as [Au] for gold. Hydroxide anion is [OH-]. Brackets can be omitted for the "organic subset" of C, N, O, P, S, Br, Cl, I. All other elements must be enclosed in brackets. If the brackets are omitted, the proper number of implicit hydrogen atoms is assumed; for instance the SMILES for water is simply O and that for ethanol is CCO. The double-bonded carbon dioxide is represented as O=C=O and the triple-bonded hydrogen cyanide as C#N. Cyclohexane is represented as C1CCCCC1, the idea being that the two ones label the same position in the molecule, thus forming a ring with six carbons. Branches are described with parentheses, as in CCC(=O)O for propionic acid and FC(F)F, or alternatively C(F)(F)F, for fluoroform.
Extensions
SMARTS is a modification of SMILES that allows, in addition to the SMILES elements, the specification of wildcard atoms and bonds. This is used in specifying search structures and is widely used in chemical database search applications. This practise has led to a common misconception that chemical substructure search is achieved computationally by matching SMILES/SMARTS strings, when in fact it is achieved by the computationally more intensive search for subgraph isomorphism in the graphs reconstructed from the SMILES representations.
Since SMILES is generated by tree-traversal, the string can vary depending on the root node chosen as well as the order in which nodes are encountered. A unique or 'canonical' form of the SMILES representation can be generated by applying rules to preprocess the tree before tree-traversal. A common application of unique SMILES is for exact matching of two structures and also for ensuring uniqueness among molecules in a database.
Important enhancements to SMILES include extensions to store information on stereochemistry.
External links
- SMILES tutorial, http://www.daylight.com/smiles/smiles-intro.html
- Web-based applications capable of converting SMILES strings to 2D structure images
- http://www.daylight.com/daycgi/depict
- http://cactus.nci.nih.gov/services/gifcreator/ converter with more controls
- Molecule editor applet that can create SMILES, http://www.molinspiration.com/jme/index.html
- SMILES parsing, http://www.dalkescientific.com/writings/diary/archive/
- SMILES conversion freeware, http://www.acdlabs.com/download/chemsk.html
- 3D molecule viewer for SMILES, http://jmol.sourceforge.net/
- Happy Atom (http://diesel.ins.cwi.nl:8080/happyatom/show/HomePage): In this project, we explore the idea of using Normalized Compression Distance with the SSMILES and SMILES molecule languagede:SMILES