The Gene Ontology ( GO ) is a major bioinformatics initiative to unify the representation of gene and gene product attributes across all species . More specifically, the project aims to: 1) maintain and develop its controlled vocabulary of gene and gene product attributes; 2) annotate genes and gene products, and assimilate and disseminate annotation data; and 3) provide tools for easy access to all aspects of the data provided by the project, and to enable functional interpretation of experimental data using the GO, for example via enrichment analysis. GO is part of a larger classification effort, the Open Biomedical Ontologies , being one of the Initial Candidate Members of the OBO Foundry .
85-515: Whereas gene nomenclature focuses on gene and gene products, the Gene Ontology focuses on the function of the genes and gene products. The GO also extends the effort by using a markup language to make the data (not only of the genes and their products but also of curated attributes) machine readable , and to do so in a way that is unified across all species (whereas gene nomenclature conventions vary by biological taxon ). The Gene Ontology
170-406: A gloss) and the functional requirement (helping the reader to know what the symbol refers to). The same guideline applies to shorthand names for sequence variations; AMA says, "In general medical publications, textual explanations should accompany the shorthand terms at first mention." Thus "188del11" is glossed as "an 11-bp deletion at nucleotide 188." This corollary rule (which forms an adjunct to
255-577: A human, the GO Consortium considers them to be marginally less reliable and they are commonly to a higher level, less detailed terms. Full annotation data sets can be downloaded from the GO website. To support the development of annotation, the GO Consortium provides workshops and mentors new groups of curators and developers. Many machine learning algorithms have been designed and implemented to predict Gene Ontology annotations. Data source: There are
340-510: A known general function: In a 1998 analysis of the E. coli genome, a large number of genes with unknown function were designated names beginning with the letter y , followed by sequentially generated letters without a mnemonic meaning (e.g., ydiO and ydbK ). Since being designated, some y-genes have been confirmed to have a function, and assigned a synonym (alternative) name in recognition of this. However, as y-genes are not always re-named after being further characterised, this designation
425-462: A large number of tools available, both online and for download, that use the data provided by the GO project. The vast majority of these come from third parties; the GO Consortium develops and supports two tools, AmiGO and OBO-Edit. AmiGO is a web-based application that allows users to query, browse, and visualize ontologies and gene product annotation data. It also has a BLAST tool, tools allowing analysis of larger data sets, and an interface to query
510-442: A particular gene family may work together to revise the nomenclature for the entire set of genes when new information becomes available. For many genes and their corresponding proteins, an assortment of alternate names is in use across the scientific literature and public biological databases , posing a challenge to effective organization and exchange of biological information. Standardization of nomenclature thus tries to achieve
595-665: A person including, but not limited to, name, honorific prefix, affiliation, email address, and homepage, or the Person vocabulary of Schema.org . Similarly, a book can be described using the Book vocabulary of Schema.org and general publication terms from the Dublin Core vocabulary, an event with the Event vocabulary of Schema.org , and so on. To use machine-readable terms from any controlled vocabulary, web designers can choose from
680-476: A public library. In large organizations, controlled vocabularies may be introduced to improve technical communication . The use of controlled vocabulary ensures that everyone is using the same word to mean the same thing. This consistency of terms is one of the most important concepts in technical writing and knowledge management , where effort is expended to use the same word throughout a document or organization instead of slightly different ones to refer to
765-470: A way to organize knowledge for subsequent retrieval. They are used in subject indexing schemes, subject headings , thesauri , taxonomies and other knowledge organization systems . Controlled vocabulary schemes mandate the use of predefined, preferred terms that have been preselected by the designers of the schemes, in contrast to natural language vocabularies, which have no such restriction. In library and information science , controlled vocabulary
850-418: Is a carefully selected list of words and phrases , which are used to tag units of information (document or work) so that they may be more easily retrieved by a search. Controlled vocabularies solve the problems of homographs , synonyms and polysemes by a bijection between concepts and preferred terms. In short, controlled vocabularies reduce unwanted ambiguity inherent in normal human languages where
935-560: Is designed on faceted classification principles. Controlled vocabularies of the Semantic Web define the concepts and relationships (terms) used to describe a field of interest or area of concern. For instance, to declare a person in a machine-readable format, a vocabulary is needed that has the formal definition of "Person", such as the Friend of a Friend ( FOAF ) vocabulary, which has a Person class that defines typical properties of
SECTION 10
#17327800699591020-435: Is low. For example, an article might mention football as a secondary focus, and the indexer might decide not to tag it with "football" because it is not important enough compared to the main focus. But it turns out that for the searcher that article is relevant and hence recall fails. A free text search would automatically pick up that article regardless. On the other hand, free text searches have high exhaustivity (every word
1105-448: Is not a reliable indicator of a gene's significance. Loss of gene activity leads to a nutritional requirement ( auxotrophy ) not exhibited by the wildtype ( prototrophy ). Amino acids: Some pathways produce metabolites that are precursors of more than one pathway. Hence, loss of one of these enzymes will lead to a requirement for more than one amino acid. For example: Nucleotides: Vitamins: Loss of gene activity leads to loss of
1190-551: Is not neutral, and the indexer must carefully consider the ethics of their word choices. For example, traditionally colonialist terms have often been the preferred terms in chosen vocabularies when discussing First Nations issues, which has caused controversy. Controlled vocabularies, such as the Library of Congress Subject Headings , are an essential component of bibliography , the study and classification of books. They were initially developed in library and information science . In
1275-402: Is not the relationship of an acronym to its expansion. In fact, many official gene symbol–gene name pairs do not even share their initial-letter sequences (although some do). Nevertheless, gene and protein symbols "look just like" abbreviations and acronyms, which presents the problem that "failing" to "expand" them (even though it is not actually a failure and there are no true expansions) creates
1360-486: Is not the relationship of an acronym to its expansion. In this sense they are similar to the symbols for units of measurement in the SI system (such as km for the kilometre ), in that they can be viewed as true logograms rather than just abbreviations. Sometimes the distinction is academic, but not always. Although it is not wrong to say that "VEGFA" is an acronym standing for " vascular endothelial growth factor A ", just as it
1445-613: Is not wrong that "km" is an abbreviation for "kilometre", there is more to the formality of symbols than those statements capture. The root portion of the symbols for a gene family (such as the " SERPIN " root in SERPIN1 , SERPIN2 , SERPIN3 , and so on) is called a root symbol. The HUGO Gene Nomenclature Committee is responsible for providing human gene naming guidelines and approving new, unique human gene names and symbols (short identifiers typically created by abbreviating). All human gene names and symbols can be searched online at
1530-439: Is often biologically irrelevant. Also owing to the nature of how scientific knowledge has unfolded, proteins and their corresponding genes often have several names and symbols that are synonymous . Some of the earlier ones may be deprecated in favor of newer ones, although such deprecation is voluntary. Some older names and symbols live on simply because they have been widely used in the scientific literature (including before
1615-457: Is responsible for providing human gene naming guidelines and approving new, unique human gene names and symbols (short identifiers typically created by abbreviating). For some nonhuman species, model organism databases serve as central repositories of guidelines and help resources, including advice from curators and nomenclature committees. In addition to species-specific databases, approved gene names and symbols for many species can be located in
1700-410: Is searched) so although it has much lower precision, it has potential for high recall as long as the searcher overcome the problem of synonyms by entering every combination. Controlled vocabularies may become outdated rapidly in fast developing fields of knowledge, unless the preferred terms are updated regularly. Even in an ideal scenario, a controlled vocabulary is often less specific than the words of
1785-433: Is structured as a directed acyclic graph , and each term has defined relationships to one or more other terms in the same domain, and sometimes to other domains. The GO vocabulary is designed to be species-neutral and includes terms applicable to prokaryotes and eukaryotes , single and multicellular organisms . GO is not static, and additions, corrections, and alterations are suggested by and solicited from members of
SECTION 20
#17327800699591870-490: Is that abbreviations and acronyms must be expanded at first use, to provide a glossing type of explanation. Typically no exceptions are permitted except for small lists of especially well known terms (such as DNA or HIV ). Although readers with high subject-matter expertise do not need most of these expansions, those with intermediate or (especially) low expertise are appropriately served by them. One complication that gene and protein symbols bring to this general rule
1955-559: Is that some official gene names have the word "protein" within them, so the phrase "brain protein I3 ( BRI3 )" (referring to the gene) and "brain protein I3 (BRI3)" (referring to the protein) are both valid. The AMA Manual gives another example: both "the TH gene" and "the TH gene" can validly be parsed as correct ("the gene for tyrosine hydroxylase"), because the first mentions the alias (description) and
2040-429: Is that they are not, accurately speaking, abbreviations or acronyms, despite the fact that many were originally coined via abbreviating or acronymic etymology. They are pseudoacronyms (as SAT and KFC also are) because they do not "stand for" any expansion. Rather, the relationship of a gene symbol to the gene name is functionally the relationship of a nickname to a formal name (both are complete identifiers )—it
2125-547: Is the name given to a number of different team sports . Worldwide the most popular of these team sports is association football , which also happens to be called soccer in several countries. The word football is also applied to rugby football ( rugby union and rugby league ), American football , Australian rules football , Gaelic football , and Canadian football . A search for football therefore will retrieve documents that are about several completely different sports. Controlled vocabulary solves this problem by tagging
2210-425: Is the root symbol, and the family members are PRDX1 , PRDX2 , PRDX3 , PRDX4 , PRDX5 , and PRDX6 . Gene symbols generally are italicised, with only the first letter in uppercase and the remaining letters in lowercase ( Shh ). Italics are not required on web pages. Protein designations are the same as the gene symbol, but are not italicised and all are upper case (SHH). Nomenclature generally follows
2295-406: Is the scientific naming of genes , the units of heredity in living organisms. It is also closely associated with protein nomenclature, as genes and the proteins they code for usually have similar nomenclature. An international committee published recommendations for genetic symbols and nomenclature in 1957. The need to develop formal guidelines for human gene names and symbols was recognized in
2380-425: Is usable for indexing web pages is PSH . It is unlikely that a single metadata scheme will ever succeed in describing the content of the entire Web. To create a Semantic Web, it may be necessary to draw from two or more metadata systems to describe a Web page's contents. The eXchangeable Faceted Metadata Language (XFML) is designed to enable controlled vocabulary creators to publish and share metadata systems. XFML
2465-536: The ERIC Thesaurus. When selecting terms for a controlled vocabulary, the designer has to consider the specificity of the term chosen, whether to use direct entry, inter consistency and stability of the language. Lastly the amount of pre-coordination (in which case the degree of enumeration versus synthesis becomes an issue) and post-coordination in the system is another important issue. Controlled vocabulary elements (terms/phrases) employed as tags , to aid in
2550-480: The National Center for Biotechnology Information's "Entrez Gene" database. There are generally accepted rules and conventions used for naming genes in bacteria . Standards were proposed in 1966 by Demerec et al. Each bacterial gene is denoted by a mnemonic of three lower case letters which indicate the pathway or process in which the gene-product is involved, followed by a capital letter signifying
2635-609: The 1950s, government agencies began to develop controlled vocabularies for the burgeoning journal literature in specialized fields; an example is the Medical Subject Headings (MeSH) developed by the U.S. National Library of Medicine . Subsequently, for-profit firms (called Abstracting and indexing services) emerged to index the fast-growing literature in every field of knowledge. In the 1960s, an online bibliographic database industry developed based on dialup X.25 networking. These services were seldom made available to
Gene Ontology - Misplaced Pages Continue
2720-560: The 1960s and full guidelines were issued in 1979 (Edinburgh Human Genome Meeting). Several other genus -specific research communities (e.g., Drosophila fruit flies, Mus mice) have adopted nomenclature standards, as well, and have published them on the relevant model organism websites and in scientific journals, including the Trends in Genetics Genetic Nomenclature Guide. Scientists familiar with
2805-478: The 2007 article, "A Comparative Evaluation of Full-text, Concept-based, and Context-sensitive Search". Controlled vocabularies are often claimed to improve the accuracy of free text searching, such as to reduce irrelevant items in the retrieval list. These irrelevant items ( false positives ) are often caused by the inherent ambiguity of natural language . Take the English word football for example. Football
2890-407: The GO database directly. AmiGO can be used online at the GO website to access the data provided by the GO Consortium or downloaded and installed for local use on any database employing the GO database schema (e.g.). It is free open source software and is available as part of the go-dev software distribution. OBO-Edit is an open source, platform-independent ontology editor developed and maintained by
2975-488: The GO term and evidence used, and supplementary information, such as the conditions the function is observed under, may also be included in a GO annotation. The evidence code comes from a controlled vocabulary of codes, the Evidence Code Ontology, covering both manual and automated annotation methods. For example, Traceable Author Statement (TAS) means a curator has read a published scientific paper and
3060-418: The GO website in a number of formats or can be accessed online using the GO browser AmiGO . The Gene Ontology project also provides downloadable mappings of its terms to other classification systems. Data source: Genome annotation encompasses the practice of capturing data about a gene product, and GO annotations use terms from the GO to do so. Annotations from GO curators are integrated and disseminated on
3145-440: The GO website, where they can be downloaded directly or viewed online using AmiGO. In addition to the gene product identifier and the relevant GO term, GO annotations have at least the following data: The reference used to make the annotation (e.g. a journal article); An evidence code denoting the type of evidence upon which the annotation is based; The date and the creator of the annotation Supporting information, depending on
3230-619: The Gene Ontology Consortium. It is implemented in Java and uses a graph-oriented approach to display and edit ontologies. OBO-Edit includes a comprehensive search and filter interface, with the option to render subsets of terms to make them visually distinct; the user interface can also be customized according to user preferences. OBO-Edit also has a reasoner that can infer links that have not been explicitly stated based on existing relationships and their properties. Although it
3315-623: The HGNC website, and the guidelines for their formation are available there. The guidelines for humans fit logically into the larger scope of vertebrates in general, and the HGNC's remit has recently expanded to assigning symbols to all vertebrate species without an existing nomenclature committee, to ensure that vertebrate genes are named in line with their human orthologs/paralogs. Human gene symbols generally are italicised, with all letters in uppercase (e.g., SHH , for sonic hedgehog ). Italics are not necessary in gene catalogs. Protein designations are
3400-413: The ability to catabolise (use) the compound. If the gene in question is the wildtype a superscript '+' sign is used: If a gene is mutant, it is signified by a superscript '-': By convention, if neither is used, it is considered to be mutant. There are additional superscripts and subscripts which provide more information about the mutation: Other modifiers: When referring to the genotype (the gene)
3485-419: The actual gene. In some cases, the gene letter may be followed by an allele number. All letters and numbers are underlined or italicised. For example, leuA is one of the genes of the leucine biosynthetic pathway, and leuA273 is a particular allele of this gene. Where the actual protein coded by the gene is known then it may become part of the basis of the mnemonic, thus: Some gene designations refer to
Gene Ontology - Misplaced Pages Continue
3570-600: The appearance of violating the spell-out-all-acronyms rule. One common way of reconciling these two opposing forces is simply to exempt all gene and protein symbols from the glossing rule. This is certainly fast and easy to do, and in highly specialized journals, it is also justified because the entire target readership has high subject matter expertise. (Experts are not confused by the presence of symbols (whether known or novel) and they know where to look them up online for further details if needed.) But for journals with broader and more general target readerships, this action leaves
3655-444: The author's responsibility. However, as pointed out earlier, many authors make little attempt to follow the letter case or italic guidelines; and regarding protein symbols, they often will not use the official symbol at all. For example, although the guidelines would call p53 protein "TP53" in humans or "Trp53" in mice, most authors call it "p53" in both (and even refuse to call it "TP53" if edits or queries try to), not least because of
3740-414: The benefits of vocabulary control and bibliographic control , although adherence is voluntary. The advent of the information age has brought gene ontology , which in some ways is a next step of gene nomenclature, because it aims to unify the representation of gene and gene product attributes across all species. Gene nomenclature and protein nomenclature are not separate endeavors; they are aspects of
3825-492: The biologic principle that many proteins are essentially or exactly the same molecules regardless of mammalian species. Regarding the gene, authors are usually willing to call it by its human-specific symbol and capitalization, TP53 , and may even do so without being prompted by a query. But the end result of all these factors is that the published literature often does not follow the nomenclature guidelines completely. Controlled vocabulary Controlled vocabularies provide
3910-402: The content identification process of documents, or other information system entities (e.g. DBMS, Web Services) qualifies as metadata . There are three main types of indexing languages. When indexing a document, the indexer also has to choose the level of indexing exhaustivity, the level of detail in which the document is described. For example, using low indexing exhaustivity, minor aspects of
3995-407: The controlled vocabulary scheme to make best use of the system. But as already mentioned, the control of synonyms, homographs can help increase precision. Numerous methodologies have been developed to assist in the creation of controlled vocabularies, including faceted classification , which enables a given data record or document to be described in multiple ways. Word choice in chosen vocabularies
4080-624: The conventions of human nomenclature. Gene symbols generally are italicised, with all letters in uppercase (e.g., NLGN1 , for neuroligin1). Protein designations are the same as the gene symbol, but are not italicised; all letters are in uppercase (NLGN1). mRNAs and cDNAs use the same formatting conventions as the gene symbol. Gene symbols are italicised and all letters are in lowercase ( shh ). Protein designations are different from their gene symbol; they are not italicised, and all letters are in uppercase (SHH). Gene symbols are italicised and all letters are in lowercase ( shh ). Protein designations are
4165-399: The correct preferred term is searched, there is no need to search for other terms that might be synonyms of that term. A controlled vocabulary search may lead to unsatisfactory recall , in that it will fail to retrieve some documents that are actually relevant to the search question. This is particularly problematic when the search question involves terms that are sufficiently tangential to
4250-517: The data. Many major plant, animal, and microorganism databases make a contribution towards this project. As of July 2019, the GO contains 44,945 terms; there are 6,408,283 annotations to 4,467 different biological organisms. There is a significant body of literature on the development and use of the GO, and it has become a standard tool in the bioinformatics arsenal. Their objectives have three aspects: building gene ontology, assigning ontology to gene/gene products, and developing software and databases for
4335-466: The documents in such a way that the ambiguities are eliminated. Compared to free text searching, the use of a controlled vocabulary can dramatically increase the performance of an information retrieval system, if performance is measured by precision (the percentage of documents in the retrieval list that are actually relevant to the search topic). In some cases controlled vocabulary can enhance recall as well, because unlike natural language schemes, once
SECTION 50
#17327800699594420-503: The explanation clearer.) There is no way for a non-SME to know this is the case for any particular letter string without looking up every gene from the manuscript in a database such as NCBI Gene, reviewing its symbol, name, and alias list, and doing some mental cross-referencing and double-checking (plus it helps to have biochemical knowledge). Most medical journals do not (in some cases cannot) pay for that level of fact-checking as part of their copyediting service level; therefore, it remains
4505-574: The first duality (same symbol and name for gene or protein), the context usually makes the sense clear to scientific readers, and the nomenclatural systems also provide for some specificity by using italic for a symbol when the gene is meant and plain (roman) for when the protein is meant. Regarding the second duality (a given protein is endogenous in many kinds of organisms), the nomenclatural systems also provide for at least human-versus-nonhuman specificity by using different capitalization , although scientists often ignore this distinction, given that it
4590-652: The first letter is upper-case. E.g. the name of R NA po lymerase is RpoB, and this protein is encoded by rpoB gene. The research communities of vertebrate model organisms have adopted guidelines whereby genes in these species are given, whenever possible, the same names as their human orthologs . The use of prefixes on gene symbols to indicate species (e.g., "Z" for zebrafish) is discouraged. The recommended formatting of printed gene and protein symbols varies between species. Vertebrate genes and proteins have names (typically strings of words) and symbols, which are short identifiers (typically 3 to 8 characters). For example,
4675-426: The first two objects. Several analyses of the Gene Ontology using formal, domain-independent properties of classes (the metaproperties) are also starting to appear. For instance, there is now an ontological analysis of biological ontologies. From a practical view, an ontology is a representation of something we know about. "Ontologies" consist of representations of things that are detectable or directly observable and
4760-436: The game pool to ensure that each preferred term or heading refers to only one concept. There are two main kinds of controlled vocabulary tools used in libraries: subject headings and thesauri . While the differences between the two are diminishing, there are still some minor differences. The terms are chosen and organized by trained professionals (including librarians and information scientists) who possess expertise in
4845-437: The gene cytotoxic T-lymphocyte-associated protein 4 has the HGNC symbol CTLA4 . These symbols are usually, but not always, coined by contraction or acronymic abbreviation of the name. They are pseudo-acronyms , however, in the sense that they are complete identifiers by themselves—short names, essentially. They are synonymous with (rather than standing for) the gene/protein name (or any of its aliases), regardless of whether
4930-529: The gene and protein nomenclature throughout a manuscript (except by rare express instructions on particular assignments), the middle ground in manuscripts using synonyms or older symbols is that the copyeditor will add a mention of the current official symbol at least as a parenthetical gloss at the first mention of the gene or protein, and query for confirmation. Some basic conventions, such as (1) that animal/human homolog (ortholog) pairs differ in letter case ( title case and all caps , respectively) and (2) that
5015-428: The initial letters "match". For example, the symbol for the gene v-akt murine thymoma viral oncogene homolog 1, which is AKT1 , cannot be said to be an acronym for the name, and neither can any of its various synonyms, which include AKT , PKB , PRKBA , and RAC . Thus, the relationship of a gene symbol to the gene name is functionally the relationship of a nickname to a formal name (both are complete identifiers )—it
5100-457: The latest official symbol and name, but just as often they use synonyms and previous symbols and names, which are well established by earlier use in the literature. AMA style is that "authors should use the most up-to-date term" and that "in any discussion of a gene, it is recommended that the approved gene symbol be mentioned at some point, preferably in the title and abstract if relevant." Because copyeditors are not expected or allowed to rewrite
5185-442: The latter mentions the symbol. This seems confusing on the surface, although it is easier to understand when explained as follows: in this gene's case, as in many others, the alias (description) "happens to use the same letter string" that the symbol uses. (The matching of the letters is of course acronymic in origin and thus the phrase "happens to" implies more coincidence than is actually present; but phrasing it that way helps to make
SECTION 60
#17327800699595270-637: The metadata for that annotation bears a citation to that paper; Inferred from Sequence Similarity (ISS) means a human curator has reviewed the output from a sequence similarity search and verified that it is biologically meaningful. Annotations from automated processes (for example, remapping annotations created using another annotation vocabulary) are given the code Inferred from Electronic Annotation (IEA). In 2010, over 98% of all GO annotations were inferred computationally, not by curators, but as of July 2, 2019, only about 30% of all GO annotations were inferred computationally. As these annotations are not checked by
5355-433: The mnemonic is italicized and not capitalised. When referring to the gene product or phenotype, the mnemonic is first-letter capitalised and not italicized ( e.g. DnaA – the protein produced by the dnaA gene; LeuA – the phenotype of a leuA mutant; Amp – the ampicillin-resistance phenotype of the β-lactamase gene bla ). Protein names are generally the same as the gene names, but the protein names are not italicized, and
5440-438: The newer ones were coined) and are well established among users. For example, mentions of HER2 and ERBB2 are synonymous . Lastly, the correlation between genes and proteins is not always one-to-one (in either direction); in some cases it is several-to-one or one-to-several, and the names and symbols may then be gene-specific or protein-specific to some degree, or overlapping in usage: The HUGO Gene Nomenclature Committee
5525-429: The ontology has a term name, which may be a word or string of words; a unique alphanumeric identifier; a definition with cited sources; and an ontology indicating the domain to which it belongs. Terms may also have synonyms, which are classed as being exactly equivalent to the term name, broader, narrower, or related; references to equivalent concepts in other databases; and comments on term meaning or usage. The GO ontology
5610-440: The principles of user warrant (what terms users are likely to use), literary warrant (what terms are generally used in the literature and documents), and structural warrant (terms chosen by considering the structure, scope of the controlled vocabulary). Controlled vocabularies also typically handle the problem of homographs with qualifiers. For example, the term pool has to be qualified to refer to either swimming pool or
5695-456: The protein and another for the gene. Another reason is that many of the mechanisms of life are the same or very similar across species , genera, orders, and phyla (through homology, analogy, or some of both ), so that a given protein may be produced in many kinds of organisms; and thus scientists naturally often use the same symbol and name for a given protein in one species (for example, mice) as in another species (for example, humans). Regarding
5780-657: The public because they were difficult to use; specialist librarians called search intermediaries handled the searching job. In the 1980s, the first full text databases appeared; these databases contain the full text of the index articles as well as the bibliographic information. Online bibliographic databases have migrated to the Internet and are now publicly available; however, most are proprietary and can be expensive to use. Students enrolled in colleges and universities may be able to access some of these services without charge; some of these services may be accessible without charge at
5865-426: The readers without any explanatory annotation and can leave them wondering what the apparent-abbreviation stands for and why it was not explained. Therefore, a good alternative solution is simply to put either the official gene name or a suitable short description (gene alias/other designation) in parentheses after the first use of the official gene/protein symbol. This meets both the formal requirement (the presence of
5950-435: The relationships between those things. There is no universal standard terminology in biology and related domains, and term usage may be specific to a species, research area, or even a particular research group. This makes communication and sharing of data more difficult. The Gene Ontology project provides an ontology of defined terms representing gene product properties. The ontology covers three domains: Each GO term within
6035-426: The research and annotation communities, as well as by those directly involved in the GO project. For example, an annotator may request a specific term to represent a metabolic pathway, or a section of the ontology may be revised with the help of community experts (e.g.). Suggested edits are reviewed by the ontology editors, and implemented where appropriate. The GO ontology and annotation files are freely available from
6120-399: The same as the gene symbol except that they are not italicised. Like the gene symbol, they are in all caps because human (human-specific or human homolog). mRNAs and cDNAs use the same formatting conventions as the gene symbol. For naming families of genes , the HGNC recommends using a "root symbol" as the root for the various gene symbols. For example, for the peroxiredoxin family, PRDX
6205-482: The same as the gene symbol, but are not italicised; the first letter is in uppercase and the remaining letters are in lowercase (Shh). Gene symbols are italicised, with all letters in lowercase ( shh ). Protein designations are the same as the gene symbol, but are not italicised; the first letter is in uppercase and the remaining letters are in lowercase (Shh). A nearly universal rule in copyediting of articles for medical journals and other health science publications
6290-564: The same concept can be given different names and ensure consistency. For example, in the Library of Congress Subject Headings (a subject heading system that uses a controlled vocabulary), preferred terms—subject headings in this case—have to be chosen to handle choices between variant spellings of the same word (American versus British), choice among scientific and popular terms ( cockroach versus Periplaneta americana ), and choices between synonyms ( automobile versus car ), among other difficult issues. Choices of preferred terms are based on
6375-487: The same thing. Web searching could be dramatically improved by the development of a controlled vocabulary for describing Web pages; the use of such a vocabulary could culminate in a Semantic Web , in which the content of Web pages is described using a machine-readable metadata scheme. One of the first proposals for such a scheme is the Dublin Core Initiative. An example of a controlled vocabulary which
6460-518: The same whole. Any name or symbol used for a protein can potentially also be used for the gene that encodes it, and vice versa. But owing to the nature of how science has developed (with knowledge being uncovered bit by bit over decades), proteins and their corresponding genes have not always been discovered simultaneously (and not always physiologically understood when discovered), which is the largest reason why protein and gene names do not always match, or why scientists tend to favor one symbol or name for
6545-637: The short form is more widely used and the expansion is merely parenthetical to the discussion at hand. The same is true of gene/protein symbols. The HUGO Gene Nomenclature Committee (HGNC) maintains an official symbol and name for each human gene, as well as a list of synonyms and previous symbols and names. For example, for AFF1 (AF4/FMR2 family, member 1), previous symbols and names are MLLT2 ("myeloid/lymphoid or mixed-lineage leukemia (trithorax (Drosophila) homolog); translocated to, 2") and PBM1 ("pre-B-cell monocytic leukemia partner 1"), and synonyms are AF-4 and AF4 . Authors of journal articles often use
6630-481: The spell-everything-out rule) often also follows the "abbreviation-leading" style of expansion that is becoming more prevalent in recent years. Traditionally, the abbreviation always followed the fully expanded form in parentheses at first use. This is still the general rule. But for certain classes of abbreviations or acronyms (such as clinical trial acronyms [e.g., ECOG ] or standardized polychemotherapy regimens [e.g., CHOP ]), this pattern may be reversed, because
6715-400: The subject area such that the indexer might have decided to tag it using a different term (but the searcher might consider the same). Essentially, this can be avoided only by an experienced user of controlled vocabulary whose understanding of the vocabulary coincides with that of the indexer. Another possibility is that the article is just not tagged by the indexer because indexing exhaustivity
6800-566: The subject area. Controlled vocabulary terms can accurately describe what a given document is actually about, even if the terms themselves do not occur within the document's text. Well known subject heading systems include the Library of Congress system , Medical Subject Headings (MeSH) created by the United States National Library of Medicine , and Sears . Well known thesauri include the Art and Architecture Thesaurus and
6885-432: The symbol is italicized when referring to the gene but nonitalic when referring to the protein, are often not followed by contributors to medical journals. Many journals have the copyeditors restyle the casing and formatting to the extent feasible, although in complex genetics discussions only subject-matter experts (SMEs) can effortlessly parse them all. One example that illustrates the potential for ambiguity among non-SMEs
6970-413: The text itself. Indexers trying to choose the appropriate index terms might misinterpret the author, while this precise problem is not a factor in a free text, as it uses the author's own words. The use of controlled vocabularies can be costly compared to free text searches because human experts or expensive automated systems are necessary to index each entry. Furthermore, the user has to be familiar with
7055-418: The work will not be described with index terms. In general the higher the indexing exhaustivity, the more terms indexed for each document. In recent years free text search as a means of access to documents has become popular. This involves using natural language indexing with an indexing exhaustively set to maximum (every word in the text is indexed ). These methods have been compared in some studies, such as
7140-477: Was developed for biomedical ontologies, OBO-Edit can be used to view, search, and edit any ontology. It is freely available to download. The Gene Ontology Consortium is the set of biological databases and research groups actively involved in the gene ontology project. This includes a number of model organism databases and multi-species protein databases , software development groups, and a dedicated editorial office. Gene nomenclature Gene nomenclature
7225-437: Was originally constructed in 1998 by a consortium of researchers studying the genomes of three model organisms : Drosophila melanogaster (fruit fly), Mus musculus (mouse), and Saccharomyces cerevisiae (brewer's or baker's yeast). Many other Model Organism Databases have joined the Gene Ontology Consortium, contributing not only to annotation data, but also to the development of ontologies and tools to view and apply
#958041