Pfam is a database of protein families that includes their annotations and multiple sequence alignments generated using hidden Markov models . The latest version of Pfam, 37.0, was released in June 2024 and contains 21,979 families. It is currently provided through InterPro website.
58-530: The general purpose of the Pfam database is to provide a complete and accurate classification of protein families and domains. Originally, the rationale behind creating the database was to have a semi-automated method of curating information on known protein families to improve the efficiency of annotating genomes. The Pfam classification of protein families has been widely adopted by biologists because of its wide coverage of proteins and sensible naming conventions. It
116-402: A 50S subunit joins the 30S subunit, forming an active 70S ribosome. Termination of the polypeptide occurs when the A site of the ribosome is occupied by a stop codon (UAA, UAG, or UGA) on the mRNA, creating the primary structure of a protein. tRNA usually cannot recognize or bind to stop codons. Instead, the stop codon induces the binding of a release factor protein (RF1 & RF2) that prompts
174-674: A cell. To delve deeper into this intricate process, scientists typically use a technique known as ribosome profiling. This method enables researchers to take a snapshot of the translatome, showing which parts of the mRNA are being translated into proteins by ribosomes at a given time. Ribosome profiling provides valuable insights into translation dynamics, revealing the complex interplay between gene sequence, mRNA structure, and translation regulation. For example, research utilizing this method has revealed that genetic differences and their subsequent expression as mRNAs can also impact translation rate in an RNA-specific manner. Expanding on this concept,
232-576: A clan. This portion has grown to around three-fourths by 2019 (version 32.0). To identify possible clan relationships, Pfam curators use the Simple Comparison Of Outputs Program (SCOOP) as well as information from the ECOD database. ECOD is a semi-automated hierarchical database of protein families with known structures, with families that map readily to Pfam entries and homology levels that usually map to Pfam clans. Pfam
290-444: A curated gathering threshold are classified as members of the protein family. The resulting collection of members is then aligned to the profile HMM to generate a full alignment. For each family, a manually curated gathering threshold is assigned that maximises the number of true matches to the family while excluding any false positive matches. False positives are estimated by observing overlaps between Pfam family hits that are not from
348-443: A different structure from that of eukaryotic ribosomes, and thus antibiotics can specifically target bacterial infections without any harm to a eukaryotic host 's cells. The basic process of protein production is the addition of one amino acid at a time to the end of a protein. This operation is performed by a ribosome . A ribosome is made up of two subunits, a small subunit, and a large subunit. These subunits come together before
406-579: A domain or extended structure. Motifs are usually shorter sequence units found outside of globular domains. The descriptions of Pfam families are managed by the general public using Misplaced Pages (see #Community curation ). As of release 29.0, 76.1% of protein sequences in UniprotKB matched to at least one Pfam domain. New families come from a range of sources, primarily the PDB and analysis of complete proteomes to find genes with no Pfam hit. For each family,
464-536: A downstream hairpin (SElenoCysteine Insertion Sequence, or SECIS). There are many computer programs capable of translating a DNA/RNA sequence into a protein sequence. Normally this is performed using the Standard Genetic Code, however, few programs can handle all the "special" cases, such as the use of the alternative initiation codons which are biologically significant. For instance, the rare alternative start codon CTG codes for Methionine when used as
522-483: A more nuanced understanding of how translation regulation can impact cell behavior, metabolic state, and responsiveness to various stimuli or conditions. Translational control is critical for the development and survival of cancer . Cancer cells must frequently regulate the translation phase of gene expression, though it is not fully understood why translation is targeted over steps like transcription. While cancer cells often have genetically altered translation factors, it
580-409: A more recent development is single-cell ribosome profiling, a technique that allows us to study the translation process at the resolution of individual cells. This is particularly significant as cells, even those of the same type, can exhibit considerable variability in their protein synthesis. Single-cell ribosome profiling has the potential to shed light on the heterogeneous nature of cells, leading to
638-484: A page with information and links to databases as well as available images, then once an article has been reviewed by a curator it is moved from the Sandbox to Misplaced Pages proper. In order to guard against vandalism of articles, each Misplaced Pages revision is reviewed by curators before it is displayed on the Pfam website. Almost all cases of vandalism have been corrected by the community before they reach curators, however. Pfam
SECTION 10
#1732783882990696-409: A representative subset of sequences are aligned into a high-quality seed alignment. Sequences for the seed alignment are taken primarily from pfamseq (a non-redundant database of reference proteomes) with some supplementation from UniprotKB . This seed alignment is then used to build a profile hidden Markov model using HMMER . This HMM is then searched against sequence databases, and all hits that reach
754-480: A start codon, and for Leucine in all other positions. Example: Condensed translation table for the Standard Genetic Code (from the NCBI Taxonomy webpage). The "Starts" row indicate three start codons, UUG, CUG, and the very common AUG. It also indicates the first amino acid residue when interpreted as a start: in this case it is all methionine. Even when working with ordinary eukaryotic sequences such as
812-493: A substantial reorganisation to further reduce manual effort involved in curation and allow for more frequent updates. Circa 2022, Pfam was integrated into InterPro at the European Bioinformatics Institute . Curation of such a large database presented issues in terms of keeping up with the volume of new families and updated information that needed to be added. To speed up releases of the database,
870-402: Is also possible to translate either by hand (for short sequences) or by computer (after first programming one appropriately, see section below); this allows biologists and chemists to draw out the chemical structure of the encoded protein on paper. First, convert each template DNA base to its RNA complement (note that the complement of A is now U), as shown below. Note that the template strand of
928-411: Is called the genetic code . The translation is performed by a large complex of functional RNA and proteins called ribosomes . The entire process is called gene expression . In translation, messenger RNA (mRNA) is decoded in a ribosome, outside the nucleus, to produce a specific amino acid chain, or polypeptide . The polypeptide later folds into an active protein and performs its functions in
986-550: Is expected that DUFs will eventually outnumber families of known function. Over time both sequence and residue coverage have increased, and as families have grown, more evolutionary relationships have been discovered, allowing the grouping of families into clans. Clans were first introduced to the Pfam database in 2005. They are groupings of related families that share a single evolutionary origin, as confirmed by structural, functional, sequence and HMM comparisons. As of release 29.0, approximately one third of protein families belonged to
1044-491: Is much more common for cancer cells to modify the levels of existing translation factors. Several major oncogenic signaling pathways, including the RAS–MAPK , PI3K/AKT/mTOR , MYC, and WNT–β-catenin pathways, ultimately reprogram the genome via translation. Cancer cells also control translation to adapt to cellular stress. During stress, the cell translates mRNAs that can mitigate the stress and promote survival. An example of this
1102-519: Is named in order of addition. Names of these entries are updated as their functions are identified. Normally when the function of at least one protein belonging to a DUF has been determined, the function of the entire DUF is updated and the family is renamed. Some named families are still domains of unknown function, that are named after a representative protein, e.g. YbbR. Numbers of DUFs are expected to continue increasing as conserved sequences of unknown function continue to be identified in sequence data. It
1160-483: Is run by an international consortium of three groups. In the earlier releases of Pfam, family entries could only be modified at the Cambridge, UK site, limiting the ability of consortium members to contribute to site curation. In release 26.0, developers moved to a new system that allowed registered users anywhere in the world to add or modify Pfam families. Translation (biology) In biology , translation
1218-400: Is the expression of AMPK in various cancers; its activation triggers a cascade that can ultimately allow the cancer to escape apoptosis (programmed cell death) triggered by nutrition deprivation. Future cancer therapies may involve disrupting the translation machinery of the cell to counter the downstream effects of cancer. The transcription-translation process description, mentioning only
SECTION 20
#17327838829901276-471: Is the process in living cells in which proteins are produced using RNA molecules as templates. The generated protein is a sequence of amino acids . This sequence is determined by the sequence of nucleotides in the RNA. The nucleotides are considered three at a time. Each such triple results in addition of one specific amino acid to the protein being generated. The matching from nucleotide triple to amino acid
1334-475: Is used by experimental biologists researching specific proteins, by structural biologists to identify new targets for structure determination, by computational biologists to organise sequences and by evolutionary biologists tracing the origins of proteins. Early genome projects, such as human and fly used Pfam extensively for functional annotation of genomic data. The InterPro website allows users to submit protein or DNA sequences to search for matches to families in
1392-415: The 3' end . The energy required for translation of proteins is significant. For a protein containing n amino acids, the number of high-energy phosphate bonds required to translate it is 4 n -1. The rate of translation varies; it is significantly higher in prokaryotic cells (up to 17–21 amino acid residues per second) than in eukaryotic cells (up to 6–9 amino acid residues per second). Initiation involves
1450-777: The Yeast genome, it is often desired to be able to use alternative translation tables—namely for translation of the mitochondrial genes. Currently the following translation tables are defined by the NCBI Taxonomy Group for the translation of the sequences in GenBank : Evolutionary Classification of Protein Domains Too Many Requests If you report this error to the Wikimedia System Administrators, please include
1508-503: The paradigm that "useful models are simple and extendable". The simplest model M0 is represented by the reaction kinetic mechanism (Figure M0). It was generalised to include 40S, 60S and initiation factors (IF) binding (Figure M1'). It was extended further to include effect of microRNA on protein synthesis. Most of models in this hierarchy can be solved analytically. These solutions were used to extract 'kinetic signatures' of different specific mechanisms of synthesis regulation. It
1566-421: The primary structure of the protein. However, proteins tend to fold , depending in part on hydrophilic and hydrophobic segments along the chain. Secondary structure can often still be guessed at, but the proper tertiary structure is often very hard to determine. Whereas other aspects such as the 3D structure, called tertiary structure , of protein can only be predicted using sophisticated algorithms ,
1624-466: The 30S ribosomal subunit. The binding of these complementary sequences ensures that the 30S ribosomal subunit is bound to the mRNA and is aligned such that the initiation codon is placed in the 30S portion of the P-site. Once the mRNA and 30S subunit are properly bound, an initiation factor brings the initiator tRNA–amino acid complex, f-Met -tRNA, to the 30S P site. The initiation phase is completed once
1682-492: The DNA is the one the RNA is polymerized against; the other DNA strand would be the same as the RNA, but with thymine instead of uracil. Then split the RNA into triplets (groups of three bases). Note that there are 3 translation "windows", or reading frames , depending on where you start reading the code. Finally, use the table at Genetic code to translate the above into a structural formula as used in chemistry. This will give
1740-543: The ER; the newly created polypeptide can be stored inside the ER for future vesicle transport and secretion outside the cell, or immediately secreted. Many types of transcribed RNA, such as tRNA, ribosomal RNA, and small nuclear RNA, do not undergo a translation into proteins. Several antibiotics act by inhibiting translation. These include anisomycin , cycloheximide , chloramphenicol , tetracycline , streptomycin , erythromycin , and puromycin . Prokaryotic ribosomes have
1798-471: The P/E site and the uncharged tRNA leaves, and another aminoacyl-tRNA enters the A site to repeat the process. After the new amino acid is added to the chain, and after the tRNA is released out of the ribosome and into the cytosol, the energy provided by the hydrolysis of a GTP bound to the translocase EF-G (in bacteria ) and a/eEF-2 (in eukaryotes and archaea ) moves the ribosome down one codon towards
Pfam - Misplaced Pages Continue
1856-432: The Pfam database. If DNA is submitted, a six-frame translation is performed, then each frame is searched. Rather than performing a typical BLAST search, Pfam uses profile hidden Markov models , which give greater weight to matches at conserved sites, allowing better remote homology detection, making them more suitable for annotating genomes of organisms with no well-annotated close relatives. Pfam has also been used in
1914-399: The amino acid sequence, called primary structure, can be determined solely from the nucleic acid sequence with the aid of a translation table . This approach may not give the correct amino acid composition of the protein, in particular if unconventional amino acids such as selenocysteine are incorporated into the protein, which is coded for by a conventional stop codon in combination with
1972-417: The aminoacyl site (abbreviated A), and the peptidyl site/ exit site (abbreviated P/E). Concerning the mRNA, the three sites are oriented 5' to 3' E-P-A, because ribosomes move toward the 3' end of mRNA. The A-site binds the incoming tRNA with the complementary codon on the mRNA. The P/E-site holds the tRNA with the growing polypeptide chain. When an aminoacyl-tRNA initially binds to its corresponding codon on
2030-538: The bonding between specific tRNAs and the amino acids that their anticodon sequences call for. The product of this reaction is an aminoacyl-tRNA . The amino acid is joined by its carboxyl group to the 3' OH of the tRNA by an ester bond . When the tRNA has an amino acid linked to it, the tRNA is termed "charged". In bacteria, this aminoacyl-tRNA is carried to the ribosome by EF-Tu , where mRNA codons are matched through complementary base pairing to specific tRNA anticodons. Aminoacyl-tRNA synthetases that mispair tRNAs with
2088-412: The cell. The ribosome facilitates decoding by inducing the binding of complementary transfer RNA (tRNA) anticodon sequences to mRNA codons . The tRNAs carry specific amino acids that are chained together into a polypeptide as the mRNA passes through and is "read" by the ribosome. Translation proceeds in three phases: In prokaryotes (bacteria and archaea), translation occurs in the cytosol, where
2146-486: The chain are matched to successive nucleotide triplets in the mRNA. In this way, the sequence of nucleotides in the template mRNA chain determines the sequence of amino acids in the generated amino acid chain. The addition of an amino acid occurs at the C-terminus of the peptide; thus, translation is said to be amine-to-carboxyl directed. The mRNA carries genetic information encoded as a ribonucleotide sequence from
2204-553: The chromosomes to the ribosomes. The ribonucleotides are "read" by translational machinery in a sequence of nucleotide triplets called codons. Each of those triplets codes for a specific amino acid . The ribosome molecules translate this code to a specific sequence of amino acids. The ribosome is a multisubunit structure containing ribosomal RNA (rRNA) and proteins. It is the "factory" where amino acids are assembled into proteins. Transfer RNAs (tRNAs) are small noncoding RNA chains (74–93 nucleotides) that transport amino acids to
2262-637: The creation of other resources such as iPfam, which catalogs domain-domain interactions within and between proteins, based on information in structure databases and mapping of Pfam domains onto these structures. For each family in Pfam one can: Entries can be of several types: family, domain, repeat or motif. Family is the default class, which simply indicates that members are related. Domains are defined as an autonomous structural unit or reusable sequence unit that can be found in multiple protein contexts. Repeats are not usually stable in isolation, but rather are usually required to form tandem repeats in order to form
2320-606: The curators, in order for it to be linked in. It is anticipated that while community involvement will greatly improve the level of annotation of these families, some will remain insufficiently notable for inclusion in Misplaced Pages, in which case they will retain their original Pfam description. Some Misplaced Pages articles cover multiple families, such as the Zinc finger article. An automated procedure for generating articles based on InterPro and Pfam data has also been implemented, which populates
2378-450: The developers started a number of initiatives to allow greater community involvement in managing the database. A critical step in improving the pace of updating and improving entries was to open up the functional annotation of Pfam domains to the Misplaced Pages community in release 26.0. For entries that already had a Misplaced Pages entry, this was linked into the Pfam page, and for those that did not, the community were invited to create one and inform
Pfam - Misplaced Pages Continue
2436-417: The disassembly of the entire ribosome/mRNA complex by the hydrolysis of the polypeptide chain from the peptidyl transferase center of the ribosome. Drugs or special sequence motifs on the mRNA can change the ribosomal structure so that near-cognate tRNAs are bound to the stop codon instead of the release factors. In such cases of 'translational readthrough', translation continues until the ribosome encounters
2494-412: The experimental conditions. The rate of premature translation abandonment, instead, has been estimated to be of the order of magnitude of 10 events per translated codon. The process of translation is highly regulated in both eukaryotic and prokaryotic organisms. Regulation of translation can impact the global rate of protein synthesis which is closely coupled to the metabolic and proliferative state of
2552-418: The large and small subunits of the ribosome bind to the mRNA. In eukaryotes , translation occurs in the cytoplasm or across the membrane of the endoplasmic reticulum in a process called co-translational translocation . In co-translational translocation, the entire ribosome/mRNA complex binds to the outer membrane of the rough endoplasmic reticulum (ER), and the new protein is synthesized and released into
2610-420: The last four decades. Beyond chemical kinetics, various modeling formalisms such as Totally Asymmetric Simple Exclusion Process , Probabilistic Boolean Networks , Petri Nets and max-plus algebra have been applied to model the detailed kinetics of protein synthesis or some of its stages. A basic model of protein synthesis that takes into account all eight 'elementary' processes has been developed, following
2668-401: The mRNA, it is in the A site. Then, a peptide bond forms between the amino acid of the tRNA in the A site and the amino acid of the charged tRNA in the P/E site. The growing polypeptide chain is transferred to the tRNA in the A site. Translocation occurs, moving the tRNA to the P/E site, now without an amino acid; the tRNA that was in the A site, now charged with the polypeptide chain, is moved to
2726-413: The majority of proteins fell into just 1000 of these. Counter to this assertion, the Pfam database currently contains 16,306 entries corresponding to unique protein domains and families. However, many of these families contain structural and functional similarities indicating a shared evolutionary origin (see Clans ). A major point of difference between Pfam and other databases at the time of its inception
2784-421: The most basic "elementary" processes, consists of: The process of amino acid building to create protein in translation is a subject of various physic models for a long time starting from the first detailed kinetic models such as or others taking into account stochastic aspects of translation and using computer simulations. Many chemical kinetics-based models of protein synthesis have been developed and analyzed in
2842-503: The next stop codon. Even though the ribosomes are usually considered accurate and processive machines, the translation process is subject to errors that can lead either to the synthesis of erroneous proteins or to the premature abandonment of translation, either because a tRNA couples to a wrong codon or because a tRNA is coupled to the wrong amino acid. The rate of error in synthesizing proteins has been estimated to be between 1 in 10 and 1 in 10 misincorporated amino acids, depending on
2900-407: The ribosome. The repertoire of tRNA genes varies widely between species, with some bacteria having between 20 and 30 genes while complex eukaryotes could have thousands. tRNAs have a site for amino acid attachment, and a site called an anticodon. The anticodon is an RNA triplet complementary to the mRNA triplet that codes for their cargo amino acid . Aminoacyl tRNA synthetases ( enzymes ) catalyze
2958-465: The same clan. This threshold is used to assess whether a match to a family HMM should be included in the protein family. Upon each update of Pfam, gathering thresholds are reassessed to prevent overlaps between new and existing families. Domains of unknown function (DUFs) represent a growing fraction of the Pfam database. The families are so named because they have been found to be conserved across species, but perform an unknown role. Each newly added DUF
SECTION 50
#17327838829903016-463: The small subunit of the ribosome binding to the 5' end of mRNA with the help of initiation factors (IF). In bacteria and a minority of archaea, initiation of protein synthesis involves the recognition of a purine-rich initiation sequence on the mRNA called the Shine–Dalgarno sequence . The Shine–Dalgarno sequence binds to a complementary pyrimidine-rich sequence on the 3' end of the 16S rRNA part of
3074-558: The speed at which the database could be updated came in version 24.0, with the introduction of HMMER3, which is ~100 times faster than HMMER2 and more sensitive. Because the entries in Pfam-A do not cover all known proteins, an automatically generated supplement was provided called Pfam-B. Pfam-B contained a large number of small families derived from clusters produced by an algorithm called ADDA. Although of lower quality, Pfam-B families could be useful when no Pfam-A families were found. Pfam-B
3132-408: The translation of mRNA into a protein to provide a location for translation to be carried out and a polypeptide to be produced. The choice of amino acid type to add is determined by a messenger RNA (mRNA) molecule. Each amino acid added is matched to a three-nucleotide subsequence of the mRNA. For each such triplet possible, the corresponding amino acid is accepted. The successive amino acids added to
3190-418: The wrong amino acids can produce mischarged aminoacyl-tRNAs, which can result in inappropriate amino acids at the respective position in the protein. This "mistranslation" of the genetic code naturally occurs at low levels in most organisms, but certain cellular environments cause an increase in permissive mRNA decoding, sometimes to the benefit of the cell. The ribosome has two binding sites for tRNA. They are
3248-648: Was discontinued as of release 28.0, then reintroduced in release 33.1 using a new clustering algorithm, MMSeqs2. Pfam was originally hosted on three mirror sites around the world to preserve redundancy. However between 2012 and 2014, the Pfam resource was moved to EMBL-EBI , which allowed for hosting of the website from one domain (xfam.org), using duplicate independent data centres. This allowed for better centralisation of updates, and grouping with other Xfam projects such as Rfam , TreeFam , iPfam and others, whilst retaining critical resilience provided by hosting from multiple centres. From circa 2014 to 2016, Pfam underwent
3306-487: Was founded in 1995 by Erik Sonnhammer, Sean Eddy and Richard Durbin as a collection of commonly occurring protein domains that could be used to annotate the protein coding genes of multicellular animals. One of its major aims at inception was to aid in the annotation of the C. elegans genome. The project was partly driven by the assertion in ‘One thousand families for the molecular biologist’ by Cyrus Chothia that there were around 1500 different families of proteins and that
3364-520: Was the use of two alignment types for entries: a smaller, manually checked seed alignment, as well as a full alignment built by aligning sequences to a profile hidden Markov model built from the seed alignment. This smaller seed alignment was easier to update as new releases of sequence databases came out, and thus represented a promising solution to the dilemma of how to keep the database up to date as genome sequencing became more efficient and more data needed to be processed over time. A further improvement to
#989010