Misplaced Pages

Simplified Molecular Input Line Entry System

Article snapshot taken from Wikipedia with creative commons attribution-sharealike license. Give it a read and then ask your questions in the chat. We can research this topic together.

The molecular configuration of a molecule is the permanent geometry that results from the spatial arrangement of its bonds . The ability of the same set of atoms to form two or more molecules with different configurations is stereoisomerism . This is distinct from constitutional isomerism which arises from atoms being connected in a different order. Conformers which arise from single bond rotations, if not isolatable as atropisomers , do not count as distinct molecular configurations as the spatial connectivity of bonds is identical.

#756243

41-406: The Simplified Molecular Input Line Entry System ( SMILES ) is a specification in the form of a line notation for describing the structure of chemical species using short ASCII strings . SMILES strings can be imported by most molecule editors for conversion back into two-dimensional drawings or three-dimensional models of the molecules. The original SMILES specification was initiated in

82-486: A spanning tree . Where cycles have been broken, numeric suffix labels are included to indicate the connected nodes. Parentheses are used to indicate points of branching on the tree. The resultant SMILES form depends on the choices: From the view point of a formal language theory, SMILES is a word. A SMILES is parsable with a context-free parser. The use of this representation has been in the prediction of biochemical properties (incl. toxicity and biodegradability ) based on

123-430: A broader category. They usually differ in physical characteristics as well as chemical properties. If two molecules with more than one chiral centre differ in one or more (but not all) centres, they are diastereomers. All stereoisomers that are not enantiomers are diastereomers. Diastereomerism also exists in alkenes. Alkenes are designated Z or E depending on group priority on adjacent carbon atoms. E/Z notation describes

164-443: A graph canonically. There is currently no systematic comparison across commercial software to test if such flaws exist in those packages. SMILES notation allows the specification of configuration at tetrahedral centers , and double bond geometry. These are structural features that cannot be specified by connectivity alone, and therefore SMILES which encode this information are termed isomeric SMILES. A notable feature of these rules

205-481: A line notation for encoding molecular structures and specific instances should strictly be called SMILES strings. However, the term SMILES is also commonly used to refer to both a single SMILES string and a number of SMILES strings; the exact meaning is usually apparent from the context. The terms "canonical" and "isomeric" can lead to some confusion when applied to SMILES. The terms describe different attributes of SMILES strings and are not mutually exclusive. Typically,

246-507: A more legible SMILES than others) to make an acyclic structure and adding numerical ring closure labels to show connectivity between non-adjacent atoms. For example, cyclohexane and dioxane may be written as C1CCCCC1 and O1CCOCC1 respectively. For a second ring, the label will be 2. For example, decalin (decahydronaphthalene) may be written as C1CCCC2C1CCCC2 . SMILES does not require that ring numbers be used in any particular order, and permits ring number zero, although this

287-439: A number of equally valid SMILES strings can be written for a molecule. For example, CCO , OCC and C(O)C all specify the structure of ethanol . Algorithms have been developed to generate the same SMILES string for a given molecule; of the many possible strings, these algorithms choose only one of them. This SMILES is unique for each structure, although dependent on the canonicalization algorithm used to generate it, and

328-614: A single bond must be shown explicitly: c1ccccc1-c2ccccc2 . This is one of the few cases where the single bond symbol - is required. (In fact, most SMILES software can correctly infer that the bond between the two rings cannot be aromatic and so will accept the nonstandard form c1ccccc1c2ccccc2 .) The Daylight and OpenEye algorithms for generating canonical SMILES differ in their treatment of aromaticity. Branches are described with parentheses, as in CCC(=O)O for propionic acid and FC(F)F for fluoroform . The first atom within

369-787: A steroidic 13-ringed pyrazine with the empirical formula C 54 H 74 N 2 O 10 isolated from the Indian Ocean hemichordate Cephalodiscus gilchristi : Starting with the left-most methyl group in the figure: Line notation [REDACTED] This article does not cite any sources . Please help improve this article by adding citations to reliable sources . Unsourced material may be challenged and removed . Find sources:   "Line notation"  –  news   · newspapers   · books   · scholar   · JSTOR ( June 2019 ) ( Learn how and when to remove this message ) Line notation

410-413: Is C1CCCC2CCCCC12 , where the final carbon participates in both ring-closing bonds 1 and 2. If two-digit ring numbers are required, the label is preceded by % , so C%12 is a single ring-closing bond of ring 12. Either or both of the digits may be preceded by a bond type to indicate the type of the ring-closing bond. For example, cyclopropene is usually written C1=CC1 , but if the double bond

451-1146: Is a typographical notation system using ASCII characters, most often used for chemical nomenclature . Chemistry [ edit ] Cell notation for representation of an electrochemical cell Dyson / IUPAC (1944) Hayward (1961) International Chemical Identifier (InChI) Wiswesser Line Notation (WLN) (1952) Simplified molecular input line entry specification (SMILES) Smiles arbitrary target specification (SMARTS) SYBYL Line Notation (SLN) Mathematics [ edit ] Mathematical markup language Music [ edit ] GUIDO music notation Chess [ edit ] Forsyth–Edwards Notation Retrieved from " https://en.wikipedia.org/w/index.php?title=Line_notation&oldid=910392602 " Categories : Notation Chemical nomenclature Musical notation Hidden categories: Articles lacking sources from June 2019 All articles lacking sources Molecular configuration Enantiomers are molecules having one or more chiral centres that are mirror images of each other. Chiral centres are designated R or S . If

SECTION 10

#1732787906757

492-555: Is a "non-bond", indicated with . , to indicate that two parts are not bonded together. For example, aqueous sodium chloride may be written as [Na+].[Cl-] to show the dissociation. An aromatic "one and a half" bond may be indicated with : ; see § Aromaticity below. Single bonds adjacent to double bonds may be represented using / or \ to indicate stereochemical configuration; see § Stereochemistry below. Ring structures are written by breaking each ring at an arbitrary point (although some choices will lead to

533-470: Is a peculiar but legal alternative way to write propane , more commonly written CCC . Choosing a ring-break point adjacent to attached groups can lead to a simpler SMILES form by avoiding branches. For example, cyclohexane-1,2-diol is most simply written as OC1CCCCC1O ; choosing a different ring-break location produces a branched structure that requires parentheses to write. Aromatic rings such as benzene may be written in one of three forms: In

574-421: Is chosen as the ring-closing bond, it may be written as C=1CC1 , C1CC=1 , or C=1CC=1 . (The first form is preferred.) C=1CC-1 is illegal, as it explicitly specifies conflicting types for the ring-closing bond. Ring-closing bonds may not be used to denote multiple bonds. For example, C1C1 is not a valid alternative to C=C for ethylene . However, they may be used with non-bonds; C1.C2.C12

615-420: Is non-chiral. In general, all L designated amino acids are enantiomers of their D counterparts except for isoleucine and threonine which contain two carbon stereocenters, making them diastereomers. Used as drugs, compounds with different configuration normally have different physiological activity, including the desired pharmacological effect, the toxicology and the metabolism. Enantiomeric ratios and purity

656-404: Is rarely used. Also, it is permitted to reuse ring numbers after the first ring has closed, although this usually makes formulae harder to read. For example, bicyclohexyl is usually written as C1CCCCC1C2CCCCC2 , but it may also be written as C0CCCCC0C0CCCCC0 . Multiple digits after a single atom indicate multiple ring-closing bonds. For example, an alternative SMILES notation for decalin

697-466: Is specified by @ or @@ . Consider the four bonds in the order in which they appear, left to right, in the SMILES form. Looking toward the central carbon from the perspective of the first bond, the other three are either clockwise or counter-clockwise. These cases are indicated with @@ and @ , respectively (because the @ symbol itself is a counter-clockwise spiral). For example, consider

738-430: Is termed the canonical SMILES. These algorithms first convert the SMILES to an internal representation of the molecular structure; an algorithm then examines that structure and produces a unique SMILES string. Various algorithms for generating canonical SMILES have been developed and include those by Daylight Chemical Information Systems, OpenEye Scientific Software , MEDIT , Chemical Computing Group , MolSoft LLC, and

779-429: Is that they allow rigorous partial specification of chirality. The term isomeric SMILES is also applied to SMILES in which isomers are specified. In terms of a graph-based computational procedure, SMILES is a string obtained by printing the symbol nodes encountered in a depth-first tree traversal of a chemical graph . The chemical graph is first trimmed to remove hydrogen atoms and cycles are broken to turn it into

820-481: Is usually omitted. For example, the SMILES for ethanol may be written as C-C-O , CC-O or C-CO , but is usually written CCO . Double, triple, and quadruple bonds are represented by the symbols = , # , and $ respectively as illustrated by the SMILES O=C=O ( carbon dioxide CO 2 ), C#N ( hydrogen cyanide HCN) and [Ga+]$ [As-] ( gallium arsenide ). An additional type of bond

861-438: The @ symbol to indicate stereochemistry around more complex chiral centers, such as trigonal bipyramidal molecular geometry . Isotopes are specified with a number equal to the integer isotopic mass preceding the atomic symbol. Benzene in which one atom is carbon-14 is written as [14c]1ccccc1 and deuterochloroform is [2H]C(Cl)(Cl)Cl . To illustrate a molecule with more than 9 rings, consider cephalostatin -1,

SECTION 20

#1732787906757

902-713: The Blue Obelisk open-source chemistry community. Other 'linear' notations include the Wiswesser Line Notation (WLN), ROSDAL and SLN (Tripos Inc). In July 2006, the IUPAC introduced the InChI as a standard for formula representation. SMILES is generally considered to have the advantage of being more human-readable than InChI; it also has a wide base of software support with extensive theoretical backing (such as graph theory ). The term SMILES refers to

943-558: The Chemistry Development Kit . A common application of canonical SMILES is indexing and ensuring uniqueness of molecules in a database . The original paper that described the CANGEN algorithm claimed to generate unique SMILES strings for graphs representing molecules, but the algorithm fails for a number of simple cases (e.g. cuneane , 1,2-dicyclopropylethane) and cannot be considered a correct method for representing

984-427: The amino acid alanine . One of its SMILES forms is NC(C)C(=O)O , more fully written as N[CH](C)C(=O)O . L -Alanine , the more common enantiomer , is written as N[C@@H](C)C(=O)O ( see depiction ). Looking from the nitrogen–carbon bond, the hydrogen ( H ), methyl ( C ), and carboxylate ( C(=O)O ) groups appear clockwise. D -Alanine can be written as N[C@H](C)C(=O)O ( see depiction ). While

1025-455: The hydroxide anion (   OH ) is represented by [OH-] , the hydronium cation ( H 3 O ) is [OH3+] and the cobalt (III) cation (Co) is either [Co+3] or [Co+++] . A bond is represented using one of the symbols . - = # $  : / \ . Bonds between aliphatic atoms are assumed to be single unless specified otherwise and are implied by adjacency in the SMILES string. Although single bonds may be written as - , this

1066-683: The 1980s. It has since been modified and extended. In 2007, an open standard called OpenSMILES was developed in the open source chemistry community. The original SMILES specification was initiated by David Weininger at the USEPA Mid-Continent Ecology Division Laboratory in Duluth in the 1980s. Acknowledged for their parts in the early development were "Gilman Veith and Rose Russo (USEPA) and Albert Leo and Corwin Hansch ( Pomona College ) for supporting

1107-411: The 3 and 4-cyanoanisole isomers. Writing SMILES for substituted rings in this way can make them more human-readable. Branches may be written in any order. For example, bromochlorodifluoromethane may be written as FC(Br)(Cl)F , BrC(F)(F)Cl , C(F)(Cl)(F)Br , or the like. Generally, a SMILES form is easiest to read if the simpler branch comes first, with the final, unparenthesized portion being

1148-424: The 3 groups projecting towards you are arranged clockwise from highest priority to lowest priority, that centre is designated R. If counterclockwise, the centre is S. Priority is based on atomic number: atoms with higher atomic number are higher priority. If two molecules with one or more chiral centres differ in all of those centres, they are enantiomers. Diastereomers are distinct molecular configurations that are

1189-457: The absolute stereochemistry of the double bond. Cis/trans notation is also used to describe the relative orientations of groups. Amino acids are designated either L or D depending on relative group arrangements around the stereogenic carbon center. L/D designations are not related to S/R absolute configurations. Only L configured amino acids are found in biological organisms. All amino acids except for L-cysteine have an S configuration and glycine

1230-428: The common case of atoms which: All other elements must be enclosed in brackets, and have charges and hydrogens shown explicitly. For instance, the SMILES for water may be written as either O or [OH2] . Hydrogen may also be written as a separate atom; water may also be written as [H]O[H] . When brackets are used, the symbol H is added if the atom in brackets is bonded to one or more hydrogen, followed by

1271-411: The first of the four bonds appears to the left of the carbon atom, but if the SMILES is written beginning with the chiral carbon, such as C(C)(N)C(=O)O , then all four are to the right, but the first to appear (the [CH] bond in this case) is used as the reference to order the following three: L -alanine may also be written [C@@H](C)(N)C(=O)O . The SMILES specification includes elaborations on

Simplified Molecular Input Line Entry System - Misplaced Pages Continue

1312-449: The fluorine atoms are on opposite sides of the double bond (as shown in the figure), whereas F/C=C\F ( see depiction ) is one possible representation of cis -1,2-difluoroethylene, in which the fluorines are on the same side of the double bond. Bond direction symbols always come in groups of at least two, of which the first is arbitrary. That is, F\C=C\F is the same as F/C=C/F . When alternating single-double bonds are present,

1353-436: The groups are larger than two, with the middle directional symbols being adjacent to two double bonds. For example, the common form of (2,4)-hexadiene is written C/C=C/C=C/C . As a more complex example, beta-carotene has a very long backbone of alternating single and double bonds, which may be written CC1CCC/C(C)=C1/C=C/C(C)=C/C=C/C(C)=C/C=C/C=C(C)/C=C/C=C(C)/C=C/C2=C(C)/CCCC2(C)C . Configuration at tetrahedral carbon

1394-479: The latter case, bonds between two aromatic atoms are assumed (if not explicitly shown) to be aromatic bonds. Thus, benzene , pyridine and furan can be represented respectively by the SMILES c1ccccc1 , n1ccccc1 and o1cccc1 . Aromatic nitrogen bonded to hydrogen, as found in pyrrole must be represented as [nH] ; thus imidazole is written in SMILES notation as n1c[nH]cc1 . When aromatic atoms are singly bonded to each other, such as in biphenyl ,

1435-441: The main principle of chemoinformatics that similar molecules have similar properties. The predictive models implemented a syntactic pattern recognition approach (which involved defining a molecular distance) as well as a more robust scheme based on statistical pattern recognition. Atoms are represented by the standard abbreviation of the chemical elements , in square brackets, such as [Au] for gold . Brackets may be omitted in

1476-449: The most complex. The only caveats to such rearrangements are: The one form of branch which does not require parentheses are ring-closing bonds: the SMILES fragment C1N is equivalent to C(1)N , both denoting a bond between the C and the N . Choosing ring-closing bonds adjacent to branch points can reduce the number of parentheses required. For example, toluene is normally written as Cc1ccccc1 or c1ccccc1C , avoiding

1517-416: The number of hydrogen atoms if greater than 1, then by the sign + for a positive charge or by - for a negative charge. For example, [NH4+] for ammonium ( NH 4 ). If there is more than one charge, it is normally written as digit; however, it is also possible to repeat the sign as many times as the ion has charges: one may write either [Ti+4] or [Ti++++] for titanium (IV) Ti. Thus,

1558-459: The order in which branches are specified in SMILES is normally unimportant, in this case it matters; swapping any two groups requires reversing the chirality indicator. If the branches are reversed so alanine is written as NC(C(=O)O)C , then the configuration also reverses; L -alanine is written as N[C@H](C(=O)O)C ( see depiction ). Other ways of writing it include C[C@H](N)C(=O)O , OC(=O)[C@@H](N)C and OC(=O)[C@H](C)N . Normally,

1599-405: The parentheses required if written as c1cc(C)ccc1 or c1cc(ccc1)C . SMILES permits, but does not require, specification of stereoisomers . Configuration around double bonds is specified using the characters / and \ to show directional single bonds adjacent to a double bond. For example, F/C=C/F ( see depiction ) is one representation of trans - 1,2-difluoroethylene , in which

1640-476: The parentheses, and the first atom after the parenthesized group, are both bonded to the same branch point atom. The bond symbol must appear inside the parentheses; outside (E.g.: CCC=(O)O ) is invalid. Substituted rings can be written with the branching point in the ring as illustrated by the SMILES COc(c1)cccc1C#N ( see depiction ) and COc(cc1)ccc1C#N ( see depiction ) which encode

1681-410: The work, and Arthur Weininger (Pomona; Daylight CIS) and Jeremy Scofield (Cedar River Software, Renton, WA) for assistance in programming the system." The Environmental Protection Agency funded the initial project to develop SMILES. It has since been modified and extended by others, most notably by Daylight Chemical Information Systems . In 2007, an open standard called "OpenSMILES" was developed by

Simplified Molecular Input Line Entry System - Misplaced Pages Continue

#756243