Misplaced Pages

Text Encoding Initiative

Article snapshot taken from Wikipedia with creative commons attribution-sharealike license. Give it a read and then ask your questions in the chat. We can research this topic together.

A written language is the representation of a language by means of writing . This involves the use of visual symbols, known as graphemes , to represent linguistic units such as phonemes , syllables , morphemes , or words . However, written language is not merely spoken or signed language written down, though it can approximate that. Instead, it is a separate system with its own norms, structures, and stylistic conventions, and it often evolves differently than its corresponding spoken or signed language.

#898101

59-448: The Text Encoding Initiative ( TEI ) is a text-centric community of practice in the academic field of digital humanities , operating continuously since the 1980s. The community currently runs a mailing list, meetings and conference series, and maintains the TEI technical standard , a journal , a wiki , a GitHub repository and a toolchain . The TEI Guidelines collectively define

118-423: A collection of schema documents , which contain the source language definitions of these components. In popular usage, however, a schema document is often referred to as a schema. Schema documents are organized by namespace: all the named schema components belong to a target namespace, and the target namespace is a property of the schema document as a whole. A schema document may include other schema documents for

177-520: A complex type may be constrained by assertions— XPath 2.0 expressions evaluated against the content that must evaluate to true. After XML Schema-based validation, it is possible to express an XML document's structure and content in terms of the data model that was implicit during validation. The XML Schema data model includes: This collection of information is called the Post-Schema-Validation Infoset (PSVI). The PSVI gives

236-460: A customisation using a project-specific subset of the tags and attributes defined by the Guidelines. The TEI defines a sophisticated customization mechanism known as ODD for this purpose. In addition to documenting and describing each TEI tag, an ODD specification specifies its content model and other usage constraints, which may be expressed using schematron . TEI Lite is an example of such

295-406: A customization. It defines an XML-based file format for exchanging texts. It is a manageable selection from the extensive set of elements available in the full TEI Guidelines. As an XML-based format, TEI cannot directly deal with overlapping markup and non-hierarchical structures. A variety of options to represent this sort of data is suggested by the guidelines. The text of the TEI guidelines

354-575: A handful of different locations, namely Mesopotamia and Egypt ( c.  3200  – c.  3100 BCE ), China ( c.  1250 BCE ), and Mesoamerica ( c.  1 CE ). Scholars mark the difference between prehistory and history with the invention of the first written language. The first writing can be dated back to the Neolithic era, with clay tablets being used to keep track of livestock and commodities. The first example of written language can be dated to Uruk , at

413-809: A language community. Analogously, digraphia occurs when a language may be written in different scripts. For example, Serbian may be written using either the Cyrillic or Latin script , while Hindustani may be written in Devanagari or the Urdu alphabet . Writing systems can be broadly classified into several types based on the units of language they correspond with: namely logographic, syllabic, and alphabetic. They are distinct from phonetic transcriptions with technical applications, which are not used as writing as such. For example, notation systems for signed languages like SignWriting been developed, but it

472-418: A profound impact on its social organization, cultural identity, and technological profile. Writing , speech , and signing are three distinct modalities of language ; each has unique characteristics and conventions. When discussing properties common to the modes of language, the individual speaking, signing, or writing will be referred to as the sender , and the individual listening, viewing, or reading as

531-606: A recommendation of the World Wide Web Consortium ( W3C ), specifies how to formally describe the elements in an Extensible Markup Language ( XML ) document. It can be used by programmers to verify each piece of item content in a document, to assure it adheres to the description of the element it is placed in. Like all XML schema languages , XSD can be used to express a set of rules to which an XML document must conform to be considered "valid" according to that schema. However, unlike most other schema languages, XSD

590-781: A result, the written form of a language may retain archaic features or spellings that no longer reflect contemporary speech. Over time, this divergence may contribute to a dynamic of diglossia. There are too many grammatical differences to address, but here is a sample. In terms of clause types, written language is predominantly declarative (e.g. It's red. ) and typically contains fewer imperatives (e.g. Make it red. ), interrogatives (e.g. Is it red? ), and exclamatives (e.g. How red it is! ) than spoken or signed language. Noun phrases are generally predominantly third person , but they are even more so in written language. Verb phrases in spoken English are more likely to be in simple aspect than in perfect or progressive aspect, and almost all of

649-429: A schema are: Other more specialized components include annotations, assertions, notations, and the schema component which contains information about the schema as a whole. Simple types (also called data types) constrain the textual values that may appear in an element or attribute. This is one of the more significant ways in which XML Schema differs from DTDs. For example, an attribute might be constrained to hold only

SECTION 10

#1732773210899

708-434: A single language community in different social contexts. The "high variety", often the written language, is used in formal contexts, such as literature, formal education, or official communications. This variety tends to be more standardized and conservative, and may incorporate older or more formal vocabulary and grammar. The "low variety", often the spoken language, is used in everyday conversation and informal contexts. It

767-451: A type of XML format, and are the defining output of the community of practice. The format differs from other well-known open formats for text (such as HTML and OpenDocument ) in that it is primarily semantic rather than presentational: the semantics and interpretation of every tag and attribute are specified. There are some 500 different textual components and concepts: word , sentence , character , glyph , person , etc. Each

826-468: A valid XML document its "type" and facilitates treating the document as an object, using object-oriented programming (OOP) paradigms. The primary reason for defining an XML schema is to formally describe an XML document; however the resulting schema has a number of other uses that go beyond simple validation. The schema can be used to generate code, referred to as XML Data Binding . This code allows contents of XML documents to be treated as objects within

885-465: A valid date or a decimal number. XSD provides a set of 19 primitive data types ( anyURI , base64Binary , boolean , date , dateTime , decimal , double , duration , float , hexBinary , gDay , gMonth , gMonthDay , gYear , gYearMonth , NOTATION , QName , string , and time ). It allows new data types to be constructed from these primitives by three mechanisms: Twenty-five derived types are defined within

944-620: Is a literate programming language for XML schemas . In literate-programming style, ODD documents combine human-readable documentation and machine-readable models using the Documentation Elements module of the Text Encoding Initiative. Tools generate localised and internationalised HTML , ePub , or PDF human-readable output and DTDs , W3C XML Schema , Relax NG Compact Syntax, or Relax NG XML Syntax machine-readable output. The Roma web application

1003-402: Is a key driver of social mobility . Firstly, it underpins success in formal education, where the ability to comprehend textbooks, write essays, and interact with written instructional materials is fundamental. High literacy skills can lead to better academic performance, opening doors to higher education and specialized training opportunities. In the job market, proficiency in written language

1062-493: Is a wider range of vocabulary used and individual words are less likely to be repeated. It also includes fewer first and second-person pronouns and fewer interjections. Written English has fewer verbs and more nouns than spoken English, but even accounting for that, verbs like think , say , know , and guess appear relatively less commonly with a content clause complement (e.g. I think that it's OK . ) in written English than in spoken English. Writing developed independently in

1121-589: Is built around the ODD format and can use it to generate schemas in DTD , W3C XML Schema , Relax NG Compact Syntax, or Relax NG XML Syntax formats, as used by many XML validation tools and services. ODD is the format used internally by the Text Encoding Initiative for the TEI technical standard . Although ODD files generally describe the difference between a customized XML format and the full TEI model, ODD also can be used to describe XML formats that are entirely separate from

1180-519: Is crucial for promoting social mobility and reducing inequality. The Canadian philosopher Marshall McLuhan (1911–1980) primarily presented his ideas about written language in The Gutenberg Galaxy (1962). Therein, McLuhan argued that the invention and spread of the printing press , and the shift from oral tradition to written culture that it spurred, fundamentally changed the nature of human society. This change, he suggested, led to

1239-404: Is grounded in one or more academic disciplines and examples are given. The standard is split into two parts, a discursive textual description with extended examples and discussion and set of tag-by-tag definitions. Schemata in most of the modern formats ( DTD , RELAX NG and XML Schema (W3C) ) are generated automatically from the tag-by-tag definitions. A number of tools support the production of

SECTION 20

#1732773210899

1298-574: Is not universally agreed that these constitute a written form of the sign language in themselves. Orthography comprises the rules and conventions for writing a given language, including how its graphemes are understood to correspond with speech. In some orthographies, there is a one-to-one correspondence between phonemes and graphemes, as in Serbian and Finnish . These are known as shallow orthographies . In contrast, orthographies like that of English and French are considered deep orthographies due to

1357-792: Is often a determinant of employment opportunities. Many professions require a high level of literacy, from drafting reports and proposals to interpreting technical manuals. The ability to effectively use written language can lead to higher paying jobs and upward career progression. Literacy enables additional ways for individuals to participate in civic life, including understanding news articles and political debates to navigating legal documents. However, disparities in literacy rates and proficiency with written language can contribute to social inequalities . Socio-economic status, race, gender, and geographic location can all influence an individual's access to quality literacy instruction. Addressing these disparities through inclusive and equitable education policies

1416-518: Is relatively much more common in written language than in spoken language. Another example is that a construction like it was difficult to follow him is relatively more common in written language than in spoken language, compared to the alternative packaging to follow him was difficult . A final example, again from English, is that the passive voice is relatively more common in writing than in speaking. Written language typically has higher lexical density than spoken or signed language, meaning there

1475-401: Is rich in examples. There is also a samples page on the TEI wiki, which gives examples of real-world projects that expose their underlying TEI. TEI allows texts to be marked up syntactically at any level of granularity, or mixture of granularities. For example, this paragraph (p) has been marked up into sentences (s) and clauses (cl). TEI has tags for marking up verse. This example (taken from

1534-441: Is successful in that it has been widely adopted and largely achieves what it set out to, it has been the subject of a great deal of severe criticism, perhaps more so than any other W3C Recommendation. Good summaries of the criticisms are provided by James Clark, Anders Møller and Michael Schwartzbach, Rick Jelliffe and David Webber. General problems: Practical limitations of expressibility: Technical problems: XSD 1.1 became

1593-507: Is the name used in this article. In its appendix of references, the XSD specification acknowledges the influence of DTDs and other early XML schema efforts such as DDML , SOX , XML-Data, and XDR . It has adopted features from each of these proposals but is also a compromise among them. Of those languages, XDR and SOX continued to be used and supported for a while after XML Schema was published. A number of Microsoft products supported XDR until

1652-504: Is typically more dynamic and innovative, and may incorporate regional dialects, slang, and other informal language features. Diglossic situations are common in many parts of the world, including the Arab world , where the high Modern Standard Arabic variety coexists with other, low varieties of Arabic local to specific regions. Diglossia can have significant implications for language education, literacy, and sociolinguistic dynamics within

1711-412: The receiver ; senders and receivers together will be collectively termed agents . The spoken, signed, and written modes of language mutually influence one another, with the boundaries between conventions for each being fluid—particularly in informal written contexts like taking quick notes or posting on social media. Spoken and signed language is typically more immediate, reflecting the local context of

1770-531: The French translation of the TEI Guidelines) shows a sonnet. The choice tag is used to represent sections of text that might be encoded or tagged in more than one possible way. In the following example, based on one in the standard, choice is used twice, once to indicate an original and a corrected number, and once to indicate an original and regularised spelling. One Document Does it all ("ODD")

1829-496: The TEI Guidelines are based on a TEI customization documented in a TEI ODD file. Even when users choose one of the off-the-shelf pre-generated schemas to validate against, these have been created from freely available customization files. The format is used by many projects worldwide. Practically all projects are associated with one or more universities. Some well-known projects that encode texts using TEI include: Prior to

Text Encoding Initiative - Misplaced Pages Continue

1888-528: The TEI. One example of this is the W3C's Internationalization Tag Set which uses the ODD format to generate schemas and document its vocabulary. TEI customizations are specializations of the TEI XML specification for use in particular fields or by specific communities. Customization in the TEI is done through the ODD mechanism mentioned above. In truth since its P5 version, all so-called 'TEI Conformant' uses of

1947-463: The W3C. Because of confusion between XML Schema as a specific W3C specification, and the use of the same term to describe schema languages in general, some parts of the user community referred to this language as WXS , an initialism for W3C XML Schema, while others referred to it as XSD , an initialism for XML Schema Definition. In Version 1.1 the W3C has chosen to adopt XSD as the preferred name, and that

2006-501: The aid of tone of voice, facial expressions, or body language, which often results in more explicit and detailed descriptions. While a speaker can typically be identified by the quality of their voice, the author of a written text is often not obvious to a reader only analyzing the text itself. Writers may nevertheless indicate their identity via the graphical characteristics of their handwriting . Written languages generally change more slowly than their spoken or signed counterparts. As

2065-450: The client invoking validation to trust the document sufficiently to know that it is being validated against the correct schema. "xsi" is the conventional prefix for the namespace " http://www.w3.org/2001/XMLSchema-instance ".) XML Schema Documents usually have the filename extension ".xsd". A unique Internet Media Type is not yet registered for XSDs, so "application/xml" or "text/xml" should be used, as per RFC 3023. The main components of

2124-778: The complex relationships between sounds and symbols. For instance, in English, the phoneme / f / can be represented by the graphemes ⟨f⟩ as in ⟨fish⟩ , ⟨ph⟩ as in ⟨phone⟩ , or ⟨gh⟩ as in ⟨enough⟩ . Orthographies also include rules about punctuation, capitalization, word breaks, and emphasis. They may also include specific conventions for representing foreign words and names, and for handling spelling changes to reflect changes in pronunciation or meaning over time. XML Schema (W3C) 1.0, Part 2 Datatypes (Recommendation) , 1.1, Part 1 Structures (Recommendation) , XSD ( XML Schema Definition ),

2183-406: The conversation and the emotions of the agents, often via paralinguistic cues like body language . Utterances are typically less premeditated, and are more likely to feature informal vocabulary and shorter sentences. They are also primarily used in dialogue, and as such include elements that facilitate turn-taking ; these including prosodic features such as trailing off and fillers that indicate

2242-410: The creation of TEI, humanities scholars had no common standards for encoding electronic texts in a manner that would serve their academic goals ( Hockey 1993, p. 41). In 1987, a group of scholars representing fields in humanities, linguistics, and computing convened at Vassar College to put forth a set of guidelines known as the “Poughkeepsie Principles”. These guidelines directed the development of

2301-400: The emergence of new written genres and conventions, such as interactions via social media . This has implications for social relationships, education, and professional communication. Literacy is the ability to read and write. From a graphemic perspective, this ability requires the capability of correctly recognizing or reproducing graphemes, the smallest units of written language. Literacy

2360-487: The end of the 4th millennium BCE. An ancient Mesopotamian poem tells a tale about the invention of writing: Because the messenger's mouth was heavy and he couldn't repeat, the Lord of Kulaba patted some clay and put words on it, like a tablet. Until then, there had been no putting words on clay. The origins of written language are tied to the development of human civilization. The earliest forms of writing were born out of

2419-429: The first TEI standard, "P1". Written language Written languages serve as crucial tools for communication, enabling the recording, preservation, and transmission of information, ideas, and culture across time and space. The orthography of a written language comprises the norms by which it is expected to function, including rules regarding spelling and typography. A society's use of written language generally has

Text Encoding Initiative - Misplaced Pages Continue

2478-406: The guidelines and the application of the guidelines to specific projects. A number of special tags are used to circumvent restrictions imposed by the underlying Unicode ; glyph to allow representation of characters that do not qualify for Unicode inclusion and choice to allow overcome the required strict linearity. Most users of the format do not use the complete range of tags, but produce

2537-486: The modern age. Furthermore, he theorized about the effects of different media on human consciousness and society. He famously asserted that " the medium is the message ", meaning that the form of a medium embeds itself in any message it would transmit or convey, creating a symbiotic relationship by which the medium influences how the message is perceived. While McLuhan's ideas are influential, they have also been critiqued and debated. Some scholars argue that he overemphasized

2596-426: The necessity to record commerce, historical events, and cultural traditions. The first known true writing systems were developed during the early Bronze Age (late 4th millennium BCE) in ancient Sumer , present-day southern Iraq. This system, known as cuneiform , was pictographic at first, but later evolved into an alphabet, a series of wedge-shaped signs used to represent language phonemically . At roughly

2655-401: The past perfect verbs appear in written fiction. Information packaging is the way that information is packaged within a sentence, that is the linear order in which information is presented. For example, On the hill, there was a tree has a different informational structure than There was a tree on the hill . While, in English, at least, the second structure is more common, the first example

2714-451: The permitted content of an element, including its element and text children and its attributes. A complex type definition consists of a set of attribute uses and a content model. Varieties of content model include: A complex type can be derived from another complex type by restriction (disallowing some elements, attributes, or values that the base type permits) or by extension (allowing additional attributes and elements to appear). In XSD 1.1,

2773-453: The preservation and transmission of culture, history, and knowledge across time and space, allowing societies to develop complex systems of law, administration, and education. For example, the invention of writing in ancient Mesopotamia enabled the creation of detailed legal codes, like the Code of Hammurabi . The advent of digital technology has revolutionized written communication, leading to

2832-469: The programming environment. The schema can be used to generate human-readable documentation of an XML file structure; this is especially useful where the authors have made use of the annotation elements. No formal standard exists for documentation generation, but a number of tools are available, such as the Xs3p stylesheet, that will produce high-quality readable HTML and printed material. Although XML Schema

2891-612: The release of MSXML 6.0 (which dropped XDR in favor of XML Schema) in December 2006. Commerce One , Inc. supported its SOX schema language until declaring bankruptcy in late 2004. The most obvious features offered in XSD that are not available in XML's native Document Type Definitions (DTDs) are namespace awareness and datatypes, that is, the ability to define element and attribute content as containing values such as integers and dates rather than arbitrary text. The XSD 1.0 specification

2950-471: The rise of individualism , nationalism , and other aspects of modernity. McLuhan proposed that written language, especially as reproduced in large quantities by the printing press, contributed to a linear and sequential mode of thinking, as opposed to the more holistic and contextual thinking fostered by oral cultures. He associated this linear mode of thought with a shift towards more detached and objective forms of reasoning, which he saw as characteristic of

3009-443: The role of the medium (in this case, written language) at the expense of the content of communication. It has also been suggested that his theories are overly deterministic, not sufficiently accounting for the ways in which people can use and interpret media in varied ways. Diglossia is a sociolinguistic phenomenon where two distinct varieties of a language – often one spoken and one written – are used by

SECTION 50

#1732773210899

3068-457: The same namespace, and may import schema documents for a different namespace. When an instance document is validated against a schema (a process known as assessment ), the schema to be used for validation can either be supplied as a parameter to the validation engine, or it can be referenced directly from the instance document using two special attributes, xsi:schemaLocation and xsi:noNamespaceSchemaLocation . (The latter mechanism requires

3127-512: The same time, the system of Egyptian hieroglyphs was developing in the Nile valley, also evolving from pictographic proto-writing to include phonemic elements. The Indus Valley civilization developed a form of writing known as the Indus script c.  2600 BCE , although its precise nature remains undeciphered. The Chinese script , one of the oldest continuously used writing systems in

3186-440: The sender has not yet finished their turn. Errors encountered in spoken and signed language include disfluencies and hesitation. By contrast, written language is typically more structured and formal. While speech and signing are transient, writing is permanent. It allows for planning, revision, and editing, which can lead to more complex sentences and a more extensive vocabulary. Written language also has to convey meaning without

3245-470: The specification itself, and further derived types can be defined by users in their own schemas. The mechanisms available for restricting data types include the ability to specify minimum and maximum values, regular expressions, constraints on the length of strings, and constraints on the number of digits in decimal values. XSD 1.1 again adds assertions, the ability to specify an arbitrary constraint by means of an XPath 2.0 expression. Complex types describe

3304-465: The voice of Socrates , expressed concerns in the dialogue " Phaedrus " that a reliance on writing would weaken one's ability to memorize and understand, as written words would "create forgetfulness in the learners' souls, because they will not use their memories". He further argued that written words, being unable to answer questions or clarify themselves, are inferior to the living, interactive discourse of oral communication. Written language facilitates

3363-410: The world, originated around the late 2nd millennium BCE, evolving from oracle bone script used for divination purposes. The development and use of written language has had profound impacts on human societies, influencing everything from social organization and cultural identity to technology and the dissemination of knowledge. Plato ( c.  427  – 348 BCE), through

3422-505: Was also designed with the intent that determination of a document's validity would produce a collection of information adhering to specific data types . Such a post-validation infoset can be useful in the development of XML document processing software. XML Schema , published as a W3C recommendation in May 2001, is one of several XML schema languages . It was the first separate schema language for XML to achieve Recommendation status by

3481-459: Was originally published in 2001, with a second edition following in 2004 to correct large numbers of errors. XSD 1.1 became a W3C Recommendation in April 2012 . Technically, a schema is an abstract collection of metadata, consisting of a set of schema components : chiefly element and attribute declarations and complex and simple type definitions. These components are usually created by processing

#898101