The International Chemical Identifier ( InChI , pronounced / ˈ ɪ n tʃ iː / IN -chee ) is a textual identifier for chemical substances , designed to provide a standard way to encode molecular information and to facilitate the search for such information in databases and on the web. Initially developed by the International Union of Pure and Applied Chemistry (IUPAC) and National Institute of Standards and Technology (NIST) from 2000 to 2005, the format and algorithms are non-proprietary. Since May 2009, it has been developed by the InChI Trust, a nonprofit charity from the United Kingdom which works to implement and promote the use of InChI.
35-452: The identifiers describe chemical substances in terms of layers of information — the atoms and their bond connectivity, tautomeric information, isotope information, stereochemistry , and electronic charge information. Not all layers have to be provided; for instance, the tautomer layer can be omitted if that type of information is not relevant to the particular application. The InChI algorithm converts input structural information into
70-443: A wildcard character is a kind of placeholder represented by a single character , such as an asterisk ( * ), which can be interpreted as a number of literal characters or an empty string . It is often used in file searches so the full name need not be typed. In telecommunications , a wildcard is a character that may be substituted for any of a defined subset of all possible characters. In computer ( software ) technology,
105-622: A change in molecular geometry and should not be confused with canonical resonance structures or mesomers. In inorganic extended solids, valence tautomerism can manifest itself in the change of oxidation states its spatial distribution upon the change of macroscopic thermodynamic conditions. Such effects have been called charge ordering or valence mixing to describe the behavior in inorganic oxides. The existence of multiple possible tautomers for individual chemical substances can lead to confusion. For example, samples of 2-pyridone and 2-hydroxypyridine do not exist as separate isolatable materials:
140-416: A date stamp, wildcards can be used to match date ranges, such as 202411*.mp4 to select video recordings from November 2024, to facilitate file operations such as copying and moving. In Unix-like and DOS operating systems, the question mark ? matches exactly one character. In DOS, if the question mark is placed at the end of the word, it will also match missing (zero) trailing characters; for example,
175-420: A leading caret ^ negates the set and matches only a character not within the list. In Microsoft Access , the asterisk sign * matches zero or more characters, the question mark ? matches a single character, the number sign # matches a single digit (0–9), and square brackets can be used for sets or ranges of characters to match. In regular expressions , the period ( . , also called "dot")
210-483: A nonstandard reconnected /r layer can be added, which effectively gives a new InChI generated without breaking bonds to metal atoms. This may contain various sublayers, including /f . Every InChI starts with the string " InChI= " followed by the version number, currently 1 . If the InChI is standard, this is followed by the letter S for standard InChIs , which is a fully standardized InChI flavor maintaining
245-649: A resolver until July 2015 when it was decommissioned. The format was originally called IChI (IUPAC Chemical Identifier), then renamed in July 2004 to INChI (IUPAC-NIST Chemical Identifier), and renamed again in November 2004 to InChI (IUPAC International Chemical Identifier), a trademark of IUPAC. Scientific direction of the InChI standard is carried out by the IUPAC Division VIII Subcommittee, and funding of subgroups investigating and defining
280-411: A single chemical species, whose true structure is a quantum superposition , essentially the "average" of the idealized, hypothetical geometries implied by these resonance forms. Tautomerization is pervasive in organic chemistry . It is typically associated with polar molecules and ions containing functional groups that are at least weakly acidic. Most common tautomers exist in pairs, which means that
315-484: A unique InChI identifier in a three-step process: normalization (to remove redundant information), canonicalization (to generate a unique number label for each atom), and serialization (to give a string of characters). InChIs differ from the widely used CAS registry numbers in three respects: firstly, they are freely usable and non-proprietary; secondly, they can be computed from structural information and do not have to be assigned by some organization; and thirdly, most of
350-539: A wildcard is a symbol used to replace or represent zero or more characters. Algorithms for matching wildcards have been developed in a number of recursive and non-recursive varieties. When specifying file names (or paths) in CP/M , DOS , Microsoft Windows , and Unix-like operating systems , the asterisk character ( * , also called "star") matches zero or more characters. For example, doc* matches doc and document but not dodo . If files are named with
385-431: Is normalized to reduce it to its so-called core parent structure. This may involve changing bond orders, rearranging formal charges and possibly adding and removing protons. Different input structures may give the same result; for example, acetic acid and acetate would both give the same core parent structure, that of acetic acid. A core parent structure may be disconnected, consisting of more than one component, in which case
SECTION 10
#1732771956343420-460: Is not represented in InChI; for this purpose a format such as PDB can be used. The InChIKey, sometimes referred to as a hashed InChI, is a fixed length (27 character) condensed digital representation of the InChI that is not human-understandable. The InChIKey specification was released in September 2007 in order to facilitate web searches for chemical compounds, since these were problematic with
455-574: Is now a "replaced" registry number so that look-up by either identifier reaches the same entry. The facility to automatically recognise such potential tautomerism and ensure that all tautomers are indexed together has been greatly facilitated by the creation of the International Chemical Identifier (InChI) and associated software. Thus the standard InChI for either tautomer is InChI=1S/C5H5NO/c7-5-3-1-2-4-6-5/h1-4H,(H,6,7) . Wildcard character In software ,
490-499: Is then the hashed version of the standard InChI string. The standard InChI will simplify comparison of InChI strings and keys generated by different groups, and subsequently accessed via diverse sources such as databases and web resources. The continuing development of the standard has been supported since 2010 by the not-for-profit InChI Trust , of which IUPAC is a member. Version 1.06 and was released in December 2020. Prior to 1.04,
525-407: The /p sublayer of the charge layer ( N for no protonation, O , P , ... if protons should be added and M , L , ... if they should be removed.) Morphine has the structure shown on the right. The standard InChI for morphine is InChI=1S/C17H19NO3/c1-18-7-6-17-10-3-5-13(20)16(17)21-15-12(19)4-2-9(14(15)17)8-11(10)18/h2-5,10-11,13,16,19-20H,6-8H2,1H3/t10-,11+,13-,16-,17-/m0/s1 and
560-647: The InChI, InChIKey and other identifiers. The release history of this software follows. The InChI has been adopted by many larger and smaller databases, including ChemSpider , ChEMBL , Golm Metabolome Database , and PubChem . However, the adoption is not straightforward, and many databases show a discrepancy between the chemical structures and the InChI they contain, which is a problem for linking databases. Tautomer In chemistry , tautomers ( / ˈ t ɔː t ə m ər / ) are structural isomers (constitutional isomers) of chemical compounds that readily interconvert. The chemical reaction interconverting
595-412: The InChI. The second part consists of 8 characters resulting from a hash of the remaining layers of the InChI, a single character indicating the kind of InChIKey ( S for standard and N for nonstandard), and a character indicating the version of InChI used (currently A for version 1). Finally, the single character at the end indicates the protonation of the core parent structure, corresponding to
630-504: The InChIKey was developed. There is a very small, but nonzero chance of two different molecules having the same InChIKey, but the probability for duplication of only the first 14 characters has been estimated as only one duplication in 75 databases each containing one billion unique structures. With all databases currently having below 50 million structures, such duplication appears unlikely at present. A recent study more extensively studies
665-676: The advantage that a user can easily use a wildcard search to find identifiers that match only in certain layers. The condensed, 27 character InChIKey is a hashed version of the full InChI (using the SHA-256 algorithm), designed to allow for easy web searches of chemical compounds. The standard InChIKey is the hashed counterpart of standard InChI . Most chemical structures on the Web up to 2007 have been represented as GIF files , which are not searchable for chemical content. The full InChI turned out to be too lengthy for easy searching, and therefore
700-451: The charge layer gives its charge, and the /p portion of the charge layer tells how many protons (hydrogen ions) must be added to or removed from it to regenerate the original structure. If present, the stereochemical layer, with sublayers b , /t , /m and /s , gives stereochemical information, and the isotopic layer /i (which may contain sublayers /h , /b , /t , /m and /s ) gives isotopic information. These are
735-412: The collision rate finding that the experimental collision rate is in agreement with the theoretical expectations. The InChIKey currently consists of three parts separated by hyphens, of 14, 10 and one character(s), respectively, like XXXXXXXXXXXXXX-YYYYYYYYFV-P . The first 14 characters result from a SHA-256 hash of the connectivity information (the main layer and /q sublayer of the charge layer) of
SECTION 20
#1732771956343770-569: The expansion of the standard is carried out by both IUPAC and the InChI Trust . The InChI Trust funds the development, testing and documentation of the InChI. Current extensions are being defined to handle polymers and mixtures , Markush structures , reactions and organometallics , and once accepted by the Division VIII Subcommittee will be added to the algorithm. The InChI Trust has developed software to generate
805-430: The full-length InChI. Unlike the InChI, the InChIKey is not unique: though collisions are expected to be extremely rare, there are known collisions. In January 2009 the 1.02 version of the InChI software was released. This provided a means to generate so called standard InChI, which does not allow for user selectable options in dealing with the stereochemistry and tautomeric layers of the InChI string. The standard InChIKey
840-461: The hydrogen is located at one of two positions, and even more specifically the most common form involves a hydrogen changing places with a double bond: H−X−Y=Z ⇌ X=Y−Z−H . Common tautomeric pairs include: Prototropy is the most common form of tautomerism and refers to the relocation of a hydrogen atom. Prototropic tautomerism may be considered a subset of acid-base behavior. Prototropic tautomers are sets of isomeric protonation states with
875-408: The information in an InChI is human readable (with practice). InChIs can thus be seen as akin to a general and extremely formalized version of IUPAC names . They can express more information than the simpler SMILES notation and, in contrast to SMILES strings, every structure has a unique InChI string, which is important in database applications. Information about the 3-dimensional coordinates of atoms
910-553: The list. In shells that interpret ! as a history substitution, a leading caret ^ can be used instead. The operation of matching of wildcard patterns to multiple file or path names is referred to as globbing . In SQL , wildcard characters can be used in LIKE expressions; the percent sign % matches zero or more characters, and underscore _ a single character. Transact-SQL also supports square brackets ( [ and ] ) to list sets and ranges of characters to match,
945-417: The only layers which can occur in a standard InChI. If the user wants to specify an exact tautomer, a fixed hydrogen layer /f can be appended, which may contain various additional sublayers; this cannot be done in standard InChI though, so different tautomers will have the same standard InChI (for example, alanine will give the same standard InChI whether input in a neutral or a zwitterionic form.) Finally,
980-408: The pattern 123? will match 123 and 1234 , but not 12345 . In Unix shells and Windows PowerShell , ranges of characters enclosed in square brackets ( [ and ] ) match a single character within the set; for example, [A-Za-z] matches any single uppercase or lowercase letter. In Unix shells, a leading exclamation mark ! negates the set and matches only a character not within
1015-763: The same empirical formula and total charge . Tautomerizations are catalyzed by: Two specific further subcategories of tautomerizations: Valence tautomerism is a type of tautomerism in which single and/or double bonds are rapidly formed and ruptured, without migration of atoms or groups. It is distinct from prototropic tautomerism, and involves processes with rapid reorganisation of bonding electrons. A pair of valence tautomers with formula C 6 H 6 O are benzene oxide and oxepin . Other examples of this type of tautomerism can be found in bullvalene , and in open and closed forms of certain heterocycles , such as organic azides and tetrazoles , or mesoionic münchnone and acylamino ketene. Valence tautomerism requires
1050-490: The same level of attention to structure details and the same conventions for drawing perception. The remaining information is structured as a sequence of layers and sub-layers, with each layer providing one specific type of information. The layers and sub-layers are separated by the delimiter " / " and start with a characteristic prefix letter (except for the chemical formula sub-layer of the main layer). The six layers with important sublayers are: The delimiter-prefix format has
1085-522: The software was freely available under the open-source LGPL license. Versions 1.05 and 1.06 used a custom license called IUPAC-InChI Trust License. The current software version is 1.07.1 (August 2024), uses the MIT license, and may be downloaded from the InChI GitHub site. In order to avoid generating different InChIs for tautomeric structures, before generating the InChI, an input chemical structure
International Chemical Identifier - Misplaced Pages Continue
1120-603: The standard InChIKey for morphine is BQJCRHHNABKAKU-KBQPJGBKSA-N . As the InChI cannot be reconstructed from the InChIKey, an InChIKey always needs to be linked to the original InChI to get back to the original structure. InChI Resolvers act as a lookup service to make these links, and prototype services are available from National Cancer Institute , the UniChem service at the European Bioinformatics Institute , and PubChem . ChemSpider has had
1155-583: The sublayers in the InChI usually consist of sublayers for each component, separated by semicolons (periods for the chemical formula sublayer). One way this can happen is that all metal atoms are disconnected during normalization; so, for example, the InChI for tetraethyllead will have five components, one for lead and four for the ethyl groups. The first, main, layer of the InChI refers to this core parent structure, giving its chemical formula, non-hydrogen connectivity without bond order ( /c sublayer) and hydrogen connectivity ( /h sublayer.) The /q portion of
1190-892: The two is called tautomerization . This conversion commonly results from the relocation of a hydrogen atom within the compound. The phenomenon of tautomerization is called tautomerism , also called desmotropism . Tautomerism is for example relevant to the behavior of amino acids and nucleic acids , two of the fundamental building blocks of life. The term is derived from Ancient Greek ταὐτό (tautó) 'the same' and μέρος (méros) 'part'. Care should be taken not to confuse tautomers with depictions of "contributing structures" in chemical resonance . Tautomers are distinct chemical species that can be distinguished by their differing atomic connectivities, molecular geometries, and physicochemical and spectroscopic properties, whereas resonance forms are merely alternative Lewis structure ( valence bond theory ) depictions of
1225-516: The two tautomeric forms are interconvertible and the proportion of each depends on factors such as temperature, solvent, and additional substituents attached to the main ring. Historically, each form of the substance was entered into databases such as those maintained by the Chemical Abstracts Service and given separate CAS Registry Numbers . 2-Pyridone was assigned [142-08-5] and 2-hydroxypyridine [109-10-4]. The latter
#342657