Misplaced Pages

European Nucleotide Archive

Article snapshot taken from Wikipedia with creative commons attribution-sharealike license. Give it a read and then ask your questions in the chat. We can research this topic together.

A nucleic acid sequence is a succession of bases within the nucleotides forming alleles within a DNA (using GACT) or RNA (GACU) molecule. This succession is denoted by a series of a set of five different letters that indicate the order of the nucleotides. By convention, sequences are usually presented from the 5' end to the 3' end . For DNA, with its double helix, there are two possible directions for the notated sequence; of these two, the sense strand is used. Because nucleic acids are normally linear (unbranched) polymers , specifying the sequence is equivalent to defining the covalent structure of the entire molecule. For this reason, the nucleic acid sequence is also termed the primary structure .

#555444

56-691: The European Nucleotide Archive ( ENA ) is a repository providing free and unrestricted access to annotated DNA and RNA sequences . It also stores complementary information such as experimental procedures, details of sequence assembly and other metadata related to sequencing projects . The archive is composed of three main databases: the Sequence Read Archive , the Trace Archive and the EMBL Nucleotide Sequence Database (also known as EMBL-bank). The ENA

112-462: A phosphate group and a sugar ( ribose in the case of RNA , deoxyribose in DNA ) make up the backbone of the nucleic acid strand, and attached to the sugar is one of a set of nucleobases . The nucleobases are important in base pairing of strands to form higher-level secondary and tertiary structures such as the famed double helix . The possible letters are A , C , G , and T , representing

168-429: A DNA sequence may be useful in practically any biological research . For example, in medicine it can be used to identify, diagnose and potentially develop treatments for genetic diseases . Similarly, research into pathogens may lead to treatments for contagious diseases. Biotechnology is a burgeoning discipline, with the potential for many useful products and services. RNA is not sequenced directly. Instead, it

224-423: A genome, and what those genes do. There may also be related projects to sequence ESTs or mRNAs to help find out where the genes actually are. Historically, when sequencing eukaryotic genomes (such as the worm Caenorhabditis elegans ) it was common to first map the genome to provide a series of landmarks across the genome. Rather than sequence a chromosome in one go, it would be sequenced piece by piece (with

280-514: A new genome sequence has steadily fallen (in terms of cost per base pair ) and newer technology has also meant that genomes can be sequenced far more quickly. When research agencies decide what new genomes to sequence, the emphasis has been on species which are either high importance as model organism or have a relevance to human health (e.g. pathogenic bacteria or vectors of disease such as mosquitos ) or species which have commercial importance (e.g. livestock and crop plants). Secondary emphasis

336-478: A rough measure of how conserved a particular region or sequence motif is among lineages. The absence of substitutions, or the presence of only very conservative substitutions (that is, the substitution of amino acids whose side chains have similar biochemical properties) in a particular region of the sequence, suggest that this region has structural or functional importance. Although DNA and RNA nucleotide bases are more similar to each other than are amino acids,

392-414: A sequence of amino acids making up a protein strand. Each group of three bases, called a codon , corresponds to a single amino acid, and there is a specific genetic code by which each possible combination of three bases corresponds to a specific amino acid. The central dogma of molecular biology outlines the mechanism by which proteins are constructed using information contained in nucleic acids. DNA

448-462: A significant storage challenge. As of 2012, the ENA's storage requirements continue to grow exponentially , with a doubling time of approximately 10 months. To manage this increase, the ENA selectively discards less-valuable sequencing platform data and implements advanced compression strategies. The CRAM reference-based compression toolkit was developed to help reduce ENA storage requirements. Currently

504-509: A single database entry. Following the uptake of Sanger sequencing , the Wellcome Trust Sanger Institute (then known as The Sanger Centre) had begun cataloguing sequence reads along with quality information in a database called The Trace Archive. The Trace Archive grew substantially with the commercialisation of high-throughput parallel sequencing technologies by companies such as Roche and Illumina . In 2008,

560-461: A variety of data derived from different sources including, but not limited to: The EMBL Nucleotide Sequence Database uses a flat file plaintext format to represent and store data which is typically referred to as EMBL-Bank format. EMBL-Bank format uses a different syntax to the records in DDBJ and GenBank, though each format uses certain standardised nomenclature, such as taxonomies as defined by

616-430: Is transcribed into mRNA molecules, which travel to the ribosome where the mRNA is used as a template for the construction of the protein strand. Since nucleic acids can bind to molecules with complementary sequences, there is a distinction between " sense " sequences which code for proteins, and the complementary "antisense" sequence, which is by itself nonfunctional, but can bind to the sense strand. DNA sequencing

SECTION 10

#1732793352556

672-443: Is a numerical sequence providing a quantitative measure of the local complexity of a DNA sequence, independently of the direction of processing. The manipulations of the information profiles enable the analysis of the sequences using alignment-free techniques, such as for example in motif and rearrangements detection. Genome project#Genome assembly Genome projects are scientific endeavours that ultimately aim to determine

728-415: Is believed to contain around 20,000–25,000 genes. In addition to studying chromosomes to the level of individual genes, genetic testing in a broader sense includes biochemical tests for the possible presence of genetic diseases , or mutant forms of genes associated with increased risk of developing genetic disorders. Genetic testing identifies changes in chromosomes, genes, or proteins. Usually, testing

784-443: Is copied to a DNA by reverse transcriptase , and this DNA is then sequenced. Current sequencing methods rely on the discriminatory ability of DNA polymerases, and therefore can only distinguish four bases. An inosine (created from adenosine during RNA editing ) is read as a G, and 5-methyl-cytosine (created from cytosine by DNA methylation ) is read as a C. With current technology, it is difficult to sequence small amounts of DNA, as

840-865: Is produced and maintained by the European Bioinformatics Institute and is a member of the International Nucleotide Sequence Database Collaboration (INSDC) along with the DNA Data Bank of Japan and GenBank . The ENA has grown out of the EMBL Data Library which was released in 1982 as the first internationally supported resource for nucleotide sequence data. As of early 2012, the ENA and other INSDC member databases each contained complete genomes of 5,682 organisms and sequence data for almost 700,000. Moreover,

896-419: Is produced by combining the information sequenced contigs and then employing linking information to create scaffolds. Scaffolds are positioned along the physical map of the chromosomes creating a "golden path". Originally, most large-scale DNA sequencing centers developed their own software for assembling the sequences that they produced. However, this has changed as the software has grown more complex and as

952-437: Is qualitatively related to the sequences' evolutionary distance from one another. Roughly speaking, high sequence identity suggests that the sequences in question have a comparatively young most recent common ancestor , while low identity suggests that the divergence is more ancient. This approximation, which reflects the " molecular clock " hypothesis that a roughly constant rate of evolutionary change can be used to extrapolate

1008-577: Is the fastest-growing repository in the ENA. In 2010 the Sequence Read Archive made up approximately 95% of the base pair data available through the ENA, encompassing over 500,000,000,000 sequence reads made up of over 60 trillion (6×10) base pairs. Almost half of this data was deposited in relation to the 1000 Genomes Project wherein the researchers published their sequence data to the SRA in real-time . In total, as of September 2010, 65% of

1064-454: Is the process of determining the nucleotide sequence of a given DNA fragment. The sequence of the DNA of a living thing encodes the necessary information for that living thing to survive and reproduce. Therefore, determining the sequence is useful in fundamental research into why and how organisms live, as well as in applied subjects. Because of the importance of DNA to living things, knowledge of

1120-467: Is the process of identifying attaching biological information to sequences , and particularly in identifying the locations of genes and determining what those genes do. When sequencing a genome, there are usually regions that are difficult to sequence (often regions with highly repetitive DNA ). Thus, 'completed' genome sequences are rarely ever complete, and terms such as 'working draft' or 'essentially complete' have been used to more accurately describe

1176-543: Is the section of the ENA which contains high-level genome assembly details, as well as assembled sequences and their functional annotation . EMBL-Bank is contributed to by direct submission from genome consortia and smaller research groups as well as by the retrieval of sequence data associated with patent applications . As of release 114 (December 2012), the EMBL Nucleotide Sequence Database contains approximately 5×10 nucleotides with an uncompressed filesize of 1.6 terabytes . The EMBL Nucleotide Sequence Database supports

SECTION 20

#1732793352556

1232-548: Is used to find changes that are associated with inherited disorders. The results of a genetic test can confirm or rule out a suspected genetic condition or help determine a person's chance of developing or passing on a genetic disorder. Several hundred genetic tests are currently in use, and more are being developed. In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA , RNA , or protein to identify regions of similarity that may be due to functional, structural , or evolutionary relationships between

1288-558: The EB-eye search engine. Additionally, sequence similarity -based searches implemented using De Bruijn graphs offer another method of retrieving records from the ENA. The ENA is accessible via the EBI SOAP and REST APIs, which also offer access to other databases hosted at the EBI, such as Ensembl and InterPro . The European Nucleotide Archive handles large volumes of data which pose

1344-461: The NCBI Taxon database. Each line of an EMBL-format file begins with a two-letter code, such as AC to label the accession number and KW for a list of keywords relevant to the record; each record ends with // . The ENA operates an instance of the Sequence Read Archive (SRA), an archival repository of sequence reads and analyses which are intended for public release. Originally called

1400-403: The information which directs the functions of an organism . Nucleic acids also have a secondary structure and tertiary structure . Primary structure is sometimes mistakenly referred to as "primary sequence". However there is no parallel concept of secondary or tertiary sequence. Nucleic acids consist of a chain of linked units called nucleotides. Each nucleotide consists of three subunits:

1456-489: The EBI combined the Trace Archive, EMBL Nucleotide Sequence Database (now also known as EMBL-Bank) and a newly developed Sequence (or Short) Read Archive (SRA) to make up the ENA, aimed at providing a comprehensive nucleotide sequence archive. As a member of the International Nucleotide Sequence Database Collaboration , the ENA exchanges data submissions each day with both the DNA Data Bank of Japan and GenBank . The EMBL Nucleotide Sequence Database (also known as EMBL-Bank)

1512-536: The EMBL Data Library, Kneale and Kennard remarked that "it was clear some years ago that a large computerized database of sequences would be essential for research in Molecular Biology". Despite the primary distribution method at the time being via magnetic tape , by 1987, the EMBL Data Library was being used by an estimated 10,000 scientists internationally. The same year, the EMBL File Server

1568-416: The ENA can be accessed manually or programmatically via REST URL through the ENA browser. Initially limited to the Sequence Read Archive, the ENA browser now also provides access to the Trace Archive and EMBL-Bank, allowing file retrieval in a range of formats including XML , HTML , FASTA and FASTQ. Individual records can be accessed using their accession numbers and other text queries are enabled through

1624-700: The ENA is funded jointly by the European Molecular Biology Laboratory , the European Commission and the Wellcome Trust . The emerging ELIXIR framework, coordinated by EBI director Janet Thornton , aims to secure a sustainable European funding infrastructure to support the continued availability of life science databases such as the ENA. Nucleic acid sequence The sequence represents genetic information . Biological deoxyribonucleic acid represents

1680-596: The Sequence Read Archive was human genomic sequence, with another 16% relating to human metagenome sequence reads. The preferred data format for files submitted to the SRA is the BAM format, which is capable of storing both aligned and unaligned reads. Internally the SRA relies on the NCBI SRA Toolkit, used at all three INSDC member databases, to provide flexible data compression , API access and conversion to other formats such as FASTQ . The data contained in

1736-607: The Short Read Archive, the name was changed in anticipation of future sequencing technologies being able to produce longer sequence reads. Currently, the archive accepts sequence reads generated by next-generation sequencing platforms such as the Illumina Genome Analyzer and ABI SOLiD as well as some corresponding analyses and alignments . The SRA operates under the guidance of the International Nucleotide Sequence Database Collaboration (INSDC) and

European Nucleotide Archive - Misplaced Pages Continue

1792-427: The complete genome sequence of an organism (be it an animal , a plant , a fungus , a bacterium , an archaean , a protist or a virus ) and to annotate protein-coding genes and other important genome-encoded features. The genome sequence of an organism includes the collective DNA sequences of each chromosome in the organism. For a bacterium containing a single chromosome, a genome project will aim to map

1848-408: The conservation of base pairs can indicate a similar functional or structural role. Computational phylogenetics makes extensive use of sequence alignments in the construction and interpretation of phylogenetic trees , which are used to classify the evolutionary relationships between homologous genes represented in the genomes of divergent species. The degree to which sequences in a query set differ

1904-399: The elapsed time since two genes first diverged (that is, the coalescence time), assumes that the effects of mutation and selection are constant across sequence lineages. Therefore, it does not account for possible differences among organisms or species in the rates of DNA repair or the possible functional conservation of specific regions in a sequence. (In the case of nucleotide sequences,

1960-402: The four nucleotide bases of a DNA strand – adenine , cytosine , guanine , thymine – covalently linked to a phosphodiester backbone. In the typical case, the sequences are printed abutting one another without gaps, as in the sequence AAAGTCTGAC, read left to right in the 5' to 3' direction. With regards to transcription , a sequence is on the coding strand if it has the same order as

2016-461: The goal of sequencing a genome is to obtain information about the complete set of genes in that particular genome sequence. The proportion of a genome that encodes for genes may be very small (particularly in eukaryotes such as humans, where coding DNA may only account for a few percent of the entire sequence). However, it is not always possible (or desirable) to only sequence the coding regions separately. Also, as scientists understand more about

2072-419: The many bases created through mutagen presence, both of them through deamination (replacement of the amine-group with a carbonyl-group). Hypoxanthine is produced from adenine , and xanthine is produced from guanine . Similarly, deamination of cytosine results in uracil . Given the two 10-nucleotide sequences, line them up and compare the differences between them. Calculate the percent difference by taking

2128-467: The molecular clock hypothesis in its most basic form also discounts the difference in acceptance rates between silent mutations that do not alter the meaning of a given codon and other mutations that result in a different amino acid being incorporated into the protein.) More statistically accurate methods allow the evolutionary rate on each branch of the phylogenetic tree to vary, thus producing better estimates of coalescence times for genes. Frequently

2184-415: The number of differences between the DNA bases divided by the total number of nucleotides. In this case there are three differences in the 10 nucleotide sequence. Thus there is a 30% difference. In biological systems, nucleic acids contain information which is used by a living cell to construct specific proteins . The sequence of nucleobases on a nucleic acid strand is translated by cell machinery into

2240-418: The number of sequencing centers has increased. An example of such assembler Short Oligonucleotide Analysis Package developed by BGI for de novo assembly of human-sized genomes, alignment, SNP detection, resequencing, indel finding, and structural variation analysis. Since the 1980s, molecular biology and bioinformatics have created the need for DNA annotation . DNA annotation or genome annotation

2296-437: The original chromosomes from which the DNA originated. In a shotgun sequencing project, all the DNA from a source (usually a single organism , anything from a bacterium to a mammal ) is first fractured into millions of small pieces. These pieces are then "read" by automated sequencing machines. A genome assembly algorithm works by taking all the pieces and aligning them to one another, and detecting all places where two of

European Nucleotide Archive - Misplaced Pages Continue

2352-666: The primary structure encodes motifs that are of functional importance. Some examples of sequence motifs are: the C/D and H/ACA boxes of snoRNAs , Sm binding site found in spliceosomal RNAs such as U1 , U2 , U4 , U5 , U6 , U12 and U3 , the Shine-Dalgarno sequence , the Kozak consensus sequence and the RNA polymerase III terminator . In bioinformatics , a sequence entropy, also known as sequence complexity or information profile,

2408-416: The prior knowledge of approximately where that piece is located on the larger chromosome). Changes in technology and in particular improvements to the processing power of computers, means that genomes can now be ' shotgun sequenced ' in one go (there are caveats to this approach though when compared to the traditional approach). Improvements in DNA sequencing technology have meant that the cost of sequencing

2464-405: The role of this noncoding DNA (often referred to as junk DNA ), it will become more important to have a complete genome sequence as a background to understanding the genetics and biology of any given organism. In many ways genome projects do not confine themselves to only determining a DNA sequence of an organism. Such projects may also include gene prediction to find out where the genes are in

2520-493: The sense strand. While A, T, C, and G represent a particular nucleotide at a position, there are also letters that represent ambiguity which are used when more than one kind of nucleotide could occur at that position. The rules of the International Union of Pure and Applied Chemistry ( IUPAC ) are as follows: For example, W means that either an adenine or a thymine could occur in that position without impairing

2576-415: The sequence of that chromosome. For the human species, whose genome includes 22 pairs of autosomes and 2 sex chromosomes, a complete genome sequence will involve 46 separate chromosome sequences. The Human Genome Project is a well known example of a genome project. Genome assembly refers to the process of taking a large number of short DNA sequences and reassembling them to create a representation of

2632-567: The sequence's functionality. These symbols are also valid for RNA, except with U (uracil) replacing T (thymine). Apart from adenine (A), cytosine (C), guanine (G), thymine (T) and uracil (U), DNA and RNA also contain bases that have been modified after the nucleic acid chain has been formed. In DNA, the most common modified base is 5-methylcytidine (m5C). In RNA, there are many modified bases, including pseudouridine (Ψ), dihydrouridine (D), inosine (I), ribothymidine (rT) and 7-methylguanosine (m7G). Hypoxanthine and xanthine are two of

2688-421: The sequences. If two sequences in an alignment share a common ancestor, mismatches can be interpreted as point mutations and gaps as insertion or deletion mutations ( indels ) introduced in one or both lineages in the time since they diverged from one another. In sequence alignments of proteins, the degree of similarity between amino acids occupying a particular position in the sequence can be interpreted as

2744-459: The short sequences, or reads , overlap. These overlapping reads can be merged, and the process continues. Genome assembly is a very difficult computational problem, made more difficult because many genomes contain large numbers of identical sequences, known as repeats . These repeats can be thousands of nucleotides long, and occur different locations, especially in the large genomes of plants and animals . The resulting (draft) genome sequence

2800-485: The signal is too weak to measure. This is overcome by polymerase chain reaction (PCR) amplification. Once a nucleic acid sequence has been obtained from an organism, it is stored in silico in digital format. Digital genetic sequences may be stored in sequence databases , be analyzed (see Sequence analysis below), be digitally altered and be used as templates for creating new actual DNA using artificial gene synthesis . Digital genetic sequences may be analyzed using

2856-418: The status of such genome projects. Even when every base pair of a genome sequence has been determined, there are still likely to be errors present because DNA sequencing is not a completely accurate process. It could also be argued that a complete genome project should include the sequences of mitochondria and (for plants) chloroplasts as these organelles have their own genomes. It is often reported that

SECTION 50

#1732793352556

2912-424: The tools of bioinformatics to attempt to determine its function. The DNA in an organism's genome can be analyzed to diagnose vulnerabilities to inherited diseases , and can also be used to determine a child's paternity (genetic father) or a person's ancestry . Normally, every person carries two variations of every gene , one inherited from their mother, the other inherited from their father. The human genome

2968-424: The transcribed RNA. One sequence can be complementary to another sequence, meaning that they have the base on each position in the complementary (i.e., A to T, C to G) and in the reverse order. For example, the complementary sequence to TTAC is GTAA. If one strand of the double-stranded DNA is considered the sense strand, then the other strand, considered the antisense strand, will have the complementary sequence to

3024-591: The volume of data is increasing exponentially with a doubling time of approximately 10 months. The European Nucleotide Archive originated from separate databases, the earliest of which was the EMBL Data Library, established in October 1980 at the European Molecular Biology Laboratory (EMBL), Heidelberg . The first release of this database was made in April 1982 and contained a total of 568 separate entries consisting of around 500,000 base pairs . In 1984, referring to

3080-402: Was introduced to serve database records over BITNET , EARN and the early Internet . In May 1988 the journal Nucleic Acids Research introduced a policy stating that "manuscripts submitted to [Nucleic Acids Research] and containing or discussing sequence data must be accompanied by evidence that the data have been deposited with the EMBL Data Library." During the 1990s the EMBL Data Library

3136-710: Was renamed the EMBL Nucleotide Sequence Database and was formally relocated to the European Bioinformatics Institute (EBI) from Heidelberg. In 2003, the Nucleotide Sequence Database was extended with the addition of the Sequence Version Archive (SVA), which maintains records of all current and previous entries in the database. A year later in June 2004, limits on the maximum sequence length for each record (then 350 kilobases ) were removed, allowing entire genome sequences to be stored as

#555444