In computer programming , an S-expression (or symbolic expression , abbreviated as sexpr or sexp ) is an expression in a like-named notation for nested list ( tree -structured) data. S-expressions were invented for and popularized by the programming language Lisp , which uses them for source code as well as data.
65-484: In the usual parenthesized syntax of Lisp, an S-expression is classically defined as This definition reflects LISP's representation of a list as a series of "cells", each one an ordered pair . In plain lists, y points to the next cell (if any), thus forming a list . The recursive clause of the definition means that both this representation and the S-expression notation can represent any binary tree . However,
130-411: A BEGIN statement, and Perl function prototypes may alter the syntactic interpretation, and possibly even the syntactic validity of the remaining code. Colloquially this is referred to as "only Perl can parse Perl" (because code must be executed during parsing, and can modify the grammar), or more strongly "even Perl cannot parse Perl" (because it is undecidable). Similarly, Lisp macros introduced by
195-499: A computer language is the rules that define the combinations of symbols that are considered to be correctly structured statements or expressions in that language. This applies both to programming languages , where the document represents source code , and to markup languages , where the document represents data. The syntax of a language defines its surface form. Text-based computer languages are based on sequences of characters , while visual programming languages are based on
260-439: A concrete syntax tree; the parser writer must then manually write code describing how this is converted to an abstract syntax tree. Contextual analysis is also generally implemented manually. Despite the existence of these automatic tools, parsing is often implemented manually, for various reasons – perhaps the phrase structure is not context-free, or an alternative implementation improves performance or error-reporting, or allows
325-470: A symbol table which stores names and types for each scope. Tools have been written that automatically generate a lexer from a lexical specification written in regular expressions and a parser from the phrase grammar written in BNF: this allows one to use declarative programming , rather than need to have procedural or functional programming. A notable example is the lex - yacc pair. These automatically produce
390-402: A Type-2 grammar, i.e., they are context-free grammars , though the overall syntax is context-sensitive (due to variable declarations and nested scopes), hence Type-1. However, there are exceptions, and for some languages the phrase grammar is Type-0 (Turing-complete). In some languages like Perl and Lisp the specification (or implementation) of the language allows constructs that execute during
455-400: A combination of symbols is handled by semantics (either formal or hard-coded in a reference implementation ). Valid syntax must be established before semantics can make meaning out of it. Not all syntactically correct programs are semantically correct. Many syntactically correct programs are nonetheless ill-formed, per the language's rules; and may (depending on the language specification and
520-401: A finite list of other S-expressions. It describes three interchange formats for expressing this structure. One is the "advanced transport", which is very flexible in terms of formatting, and is syntactically similar to Lisp-style expressions, but they are not identical. The advanced transport, for example, allows octet-strings to be represented verbatim (the string's length followed by a colon and
585-601: A function during runtime". We can also remove clauses from the database with abolish/1 , or retract/1 . * The number after the clause's name is the number of arguments it can take. It is also called arity . We can also query the database to get the body of a clause: call is analogous to Lisp's eval function. The concept of treating code as data and the manipulation and evaluation thereof can be demonstrated very neatly in Rebol . (Rebol, unlike Lisp, does not require parentheses to separate expressions). The following
650-417: A function may be composed and manipulated as a data structure in the meta layer, and then evaluated . It can be much easier to understand how to manipulate the code since it can be more easily understood as simple data (since the format of the language itself is as a data format). A typical demonstration of homoiconicity is the meta-circular evaluator . All Von Neumann architecture systems, which includes
715-495: A look at the list of "Hello, World!" program examples: Homoiconic In computer programming , homoiconicity (from the Greek words homo- meaning "the same" and icon meaning "representation") is an informal property of some programming languages . A language is homoiconic if a program written in it can be manipulated as data using the language. The program's internal representation can thus be inferred just by reading
SECTION 10
#1732787805154780-624: A notion of unique identifiers and references to them. For simple use cases, S-expressions are simpler than XML, but for more advanced use cases, XML has a query language so called XPath , many tools and third party libraries to simplify the handling of XML data. Standards for some Lisp-derived programming languages include a specification for their S-expression syntax. These include Common Lisp (ANSI standard document ANSI INCITS 226-1994 (R2004)), Scheme (R5RS and R6RS ), and ISLISP . In May 1997, Ron Rivest submitted an Internet Draft to be considered for publication as an RFC . The draft defined
845-494: A quote, while an unquoted identifier atom can typically contain anything but quotes, whitespace characters, parentheses, brackets, braces, backslashes, and semicolons. In either case, a prohibited character can typically be included by escaping it with a preceding backslash. Unicode support varies. The recursive case of the s-expr definition is traditionally implemented using cons cells . S-expressions were originally intended only for data to be manipulated by M-expressions , but
910-458: A string of characters exactly as the user typed them at the keyboard. If the TRAC procedures themselves evolve new procedures, these new procedures should also be stated in the same script. The TRAC processor in its action interprets this script as its program. In other words, the TRAC translator program (the processor) effectively converts the computer into a new computer with a new program language --
975-483: A syntax based on Lisp S-expressions but intended for general-purpose data storage and exchange (similar to XML ) rather than specifically for programming. It was never approved as an RFC, but it has since been cited and used by other RFCs (e.g. RFC 2693) and several other publications. It was originally intended for use in SPKI . Rivest's format defines an S-expression as being either an octet-string (a series of bytes ) or
1040-400: Is syntactic sugar for a quoted S-expression , in this case (quote x) . S-expressions are often compared to XML : one key difference is that S-expressions have just one form of containment, the dotted pair, while XML tags can contain simple attributes, other tags, or CDATA , each using different syntax. Another is that S-expressions do not define a reference mechanism, where XML provides
1105-508: Is a Type-3 grammar, generally given as regular expressions . Phrases are in a context-free language (CFL), generally a deterministic context-free language (DCFL), specified in a phrase structure grammar , which is a Type-2 grammar, generally given as production rules in Backus–Naur form (BNF). Phrase grammars are often specified in much more constrained grammars than full context-free grammars , in order to make them easier to parse; while
1170-802: Is a two-element S-expression whose elements are also two-element S-expressions. The whitespace-separated notation used in Lisp (and this article) is typical. Line breaks (newline characters) usually qualify as separators. This is a simple context-free grammar for a tiny subset of English written as an S-expression (Gazdar/Melish, Natural Language Processing in Lisp), where S=sentence, NP=Noun Phrase, VP=Verb Phrase, V=Verb: Program code can be written in S-expressions, usually using prefix notation. Example in Common Lisp : S-expressions can be read in Lisp using
1235-452: Is an example of code in Rebol (Note that >> represents the interpreter prompt; spaces between some elements have been added for readability): ( repeat is in fact a built-in function in Rebol and is not a language construct or keyword). By enclosing the code in square brackets, the interpreter does not evaluate it, but merely treats it as a block containing words: This block has
1300-482: Is more likely that the compiler will use a parsing rule that allows all expressions of the form "LiteralOrIdentifier + LiteralOrIdentifier" and then the error will be detected during contextual analysis (when type checking occurs). In some cases this validation is not done by the compiler, and these errors are only detected at runtime. In a dynamically typed language, where type can only be determined at runtime, many type errors can only be detected at runtime. For example,
1365-470: Is sometimes possible, but in many real-world languages an earlier step depends on a later step – for example, the lexer hack in C is because tokenization depends on context. Even in these cases, syntactical analysis is often seen as approximating this ideal model. The levels generally correspond to levels in the Chomsky hierarchy . Words are in a regular language , specified in the lexical grammar , which
SECTION 20
#17327878051541430-416: Is the "basic transport representation", which is either the canonical form or the same encoded as Base64 and surrounded by braces , the latter intended to safely transport a canonically encoded S-expression in a system which might change spacing (e.g. an email system which has 80-character-wide lines and wraps anything longer than that). This format has not been widely adapted for use outside of SPKI (some of
1495-570: Is the special end-of-list object (alternatively written () , which is the only representation in Scheme ). In the Lisp family of programming languages, S-expressions are used to represent both source code and data. Other uses of S-expressions are in Lisp-derived languages such as DSSSL , and as mark-up in communication protocols like IMAP and John McCarthy 's CBCL . It's also used as text representation of WebAssembly . The details of
1560-407: Is usually defined using a combination of regular expressions (for lexical structure) and Backus–Naur form (a metalanguage for grammatical structure) to inductively specify syntactic categories ( nonterminal ) and terminal symbols. Syntactic categories are defined by rules called productions , which specify the values that belong to a particular syntactic category. Terminal symbols are
1625-559: Is usually the case when compiling strongly-typed languages), though it is common to classify these kinds of error as semantic errors instead. As an example, the Python code contains a type error because it adds a string literal to an integer literal. Type errors of this kind can be detected at compile-time: They can be detected during parsing (phrase analysis) if the compiler uses separate rules that allow "integerLiteral + integerLiteral" but not "stringLiteral + integerLiteral", though it
1690-421: The defmacro syntax also execute during parsing, meaning that a Lisp compiler must have an entire Lisp run-time system present. In contrast, C macros are merely string replacements, and do not require code execution. The syntax of a language describes the form of a valid program, but does not provide any information about the meaning of the program or the results of executing that program. The meaning given to
1755-508: The LR parser can parse any DCFL in linear time, the simple LALR parser and even simpler LL parser are more efficient, but can only parse grammars whose production rules are constrained. In principle, contextual structure can be described by a context-sensitive grammar , and automatically analyzed by means such as attribute grammars , though, in general, this step is done manually, via name resolution rules and type checking , and implemented via
1820-449: The abstract syntax tree (AST), which simplifies this into a usable form. The AST and contextual analysis steps can be considered a form of semantic analysis, as they are adding meaning and interpretation to the syntax, or alternatively as informal, manual implementations of syntactical rules that would be difficult or awkward to describe or implement formally. Thirdly, the contextual analysis resolves names and checks types. This modularity
1885-423: The backend (and middle end, if this phase is distinguished). Computer language syntax is generally distinguished into three levels: Distinguishing in this way yields modularity, allowing each level to be described and processed separately and often independently. First, a lexer turns the linear sequence of characters into a linear sequence of tokens; this is known as " lexical analysis " or "lexing". Second,
1950-670: The reader or printer to detect and thus trigger evaluation or display of cycles without infinitely recursing The definition of an atom varies per context; in the original definition by John McCarthy , it was assumed that there existed "an infinite set of distinguishable atomic symbols " represented as "strings of capital Latin letters and digits with single embedded blanks" (a subset of character string and numeric literals ). Most modern sexpr notations allow more general quoted strings (for example including punctuation or full Unicode ), and use an abbreviated notation to represent lists with more than 2 members, so that stands for NIL
2015-452: The "Purest" forms of homoiconicity, as these languages use the same representation for both data and code. Other languages provide data structures for easily and efficiently manipulating code. Notable examples of this weaker form of homoiconicity include Julia , Nim , and Elixir . Languages often considered to be homoiconic include: Lisp uses S-expressions as an external representation for data and code. S-expressions can be read with
S-expression - Misplaced Pages Continue
2080-523: The Python code is syntactically valid at the phrase level, but the correctness of the types of a and b can only be determined at runtime, as variables do not have types in Python, only values do. Whereas there is disagreement about whether a type error detected by the compiler should be called a syntax error (rather than a static semantic error), type errors which can only be detected at program execution time are always regarded as semantic rather than syntax errors. The syntax of textual programming languages
2145-446: The TRAC language. At any time, it should be possible to display program or procedural information in the same form as the TRAC processor will act upon it during its execution. It is desirable that the internal character code representation be identical to, or very similar to, the external code representation. In the present TRAC implementation, the internal character representation is based upon ASCII . Because TRAC procedures and text have
2210-402: The code, and is contrasted with semantics – the meaning . In processing computer languages, semantic processing generally comes after syntactic processing; however, in some cases, semantic processing is necessary for complete syntactic analysis, and these are done together or concurrently . In a compiler , the syntactic analysis comprises the frontend , while the semantic analysis comprises
2275-423: The concrete characters or strings of characters (for example keywords such as define , if , let , or void ) from which syntactically valid programs are constructed. Syntax can be divided into context-free syntax and context-sensitive syntax. Context-free syntax are rules directed by the metalanguage of the programming language. These would not be constrained by the context surrounding or referring that part of
2340-418: The difference in the surname from the note) and philosopher, logician and mathematician Charles Sanders Peirce . Pierce indeed used the term "icon" in his Semiotic Theory. According to Peirce, there are three kinds of sign in communication: the icon, the index and the symbol. The icon is the simplest representation: an icon physically resembles that which it denotes. Alan Kay used and possibly popularized
2405-627: The entire raw string), a quoted form allowing escape characters, hexadecimal , Base64 , or placed directly as a "token" if it meets certain conditions. (Rivest's tokens differ from Lisp tokens in that the former are just for convenience and aesthetics, and treated exactly like other strings, while the latter have specific syntactical meaning.) Rivest's draft defines a canonical representation "for digital signature purposes". It's intended to be compact, easier to parse, and unique for any abstract S-expression. It only allows verbatim strings, and prohibits whitespace as formatting outside strings. Finally there
2470-529: The first element of an S-expression is commonly an operator or function name and any remaining elements are treated as arguments. This is called "prefix notation" or " Polish notation ". As an example, the Boolean expression written 4 == (2 + 2) in C , is represented as (= 4 (+ 2 2)) in Lisp's s-expr-based prefix notation. As noted above, the precise definition of "atom" varies across LISP-like languages. A quoted string can typically contain anything but
2535-510: The first error – all it knows is that, after producing the token LEFT_PAREN, '(' the remainder of the program is invalid, since no word rule begins with '_'. The second error is detected at the parsing stage: The parser has identified the "list" production rule due to the '(' token (as the only match), and thus can give an error message; in general it may be ambiguous . Type errors and undeclared variable errors are sometimes considered to be syntax errors when they are detected at compile-time (which
2600-425: The first implementation of Lisp was an interpreter of S-expression encodings of M-expressions, and Lisp programmers soon became accustomed to using S-expressions for both code and data. This means that Lisp is homoiconic ; that is, the primary representation of programs is also a data structure in a primitive type of the language itself. Nested lists can be written as S-expressions: ((milk juice) (honey marmalade))
2665-423: The following: Here the decimal digits, upper- and lower-case characters, and parentheses are terminal symbols. The following are examples of well-formed token sequences in this grammar: ' 12345 ', ' () ', ' (A B C232 (1)) ' The grammar needed to specify a programming language can be classified by its position in the Chomsky hierarchy . The phrase grammar of most programming languages can be specified using
S-expression - Misplaced Pages Continue
2730-437: The function PPRINT (note: with two Ps, short for pretty -print). Lisp programs are valid S-expressions, but not all S-expressions are valid Lisp programs. (1.0 + 3.1) is a valid S-expression, but not a valid Lisp program, since Lisp uses prefix notation and a floating point number (here 1.0) is not valid as an operation (the first element of the expression). An S-expression preceded by a single quotation mark, as in 'x ,
2795-437: The function READ. READ reads the textual representation of an S-expression and returns Lisp data. The function PRINT can be used to output an S-expression. The output then can be read with the function READ, when all printed data objects have a readable representation. Lisp has readable representations for numbers, strings, symbols, lists and many other data types. Program code can be formatted as pretty printed S-expressions using
2860-404: The grammar to be changed more easily. Parsers are often written in functional languages, such as Haskell , or in scripting languages, such as Python or Perl , or in C or C++ . As an example, (add 1 1) is a syntactically valid Lisp program (assuming the 'add' function exists, else name resolution fails), adding 1 and 1. However, the following are invalid: The lexer is unable to identify
2925-403: The language itself. This makes metaprogramming easier than in a language without this property: reflection in the language (examining the program's entities at runtime ) depends on a single, homogeneous structure, and it does not have to handle several different structures that would appear in a complex syntax. Homoiconic languages typically include full support of syntactic macros , allowing
2990-418: The parser turns the linear sequence of tokens into a hierarchical syntax tree; this is known as " parsing " narrowly speaking. This ensures that the line of tokens conform to the formal grammars of the programming language. The parsing stage itself can be divided into two parts: the parse tree , or "concrete syntax tree", which is determined by the grammar, but is generally far too detailed for practical use, and
3055-450: The parsing phase. Furthermore, these languages have constructs that allow the programmer to alter the behavior of the parser. This combination effectively blurs the distinction between parsing and execution, and makes syntax analysis an undecidable problem in these languages, meaning that the parsing phase may not finish. For example, in Perl it is possible to execute code during parsing using
3120-519: The primitive Lisp function LIST and set the variable EXPRESSION to the result Change the COS term to SIN Evaluate the expression Print the expression to a string Read the expression from a string On line 4 we create a new clause. The operator :- separates the head and the body of a clause. With assert/1 * we add it to the existing clauses (add it to the "database"), so we can call it later. In other languages we would call it "creating
3185-529: The primitive Lisp function READ . READ returns Lisp data: lists, symbols , numbers, strings. The primitive Lisp function EVAL uses Lisp code represented as Lisp data, computes side-effects and returns a result. The result will be printed by the primitive function PRINT , which creates an external S-expression from Lisp data. Lisp data, a list using different data types: (sub)lists, symbols, strings and integer numbers. Lisp code. The example uses lists, symbols and numbers. Create above expression with
3250-406: The program itself. This property is often summarized by saying that the language treats code as data . The informality of the property arises from the fact that, strictly, this applies to almost all programming languages. No consensus exists on a precise definition of the property. In a homoiconic language, the primary representation of programs is also a data structure in a primitive type of
3315-468: The programmer to express transformations of programs in a concise way. A commonly cited example is Lisp , which was created to allow for easy list manipulations and where the structure is given by S-expressions that take the form of nested lists, and can be manipulated by other Lisp code. Other examples are the programming languages Clojure (a contemporary dialect of Lisp), Rebol (also its successor Red ), Refal , Prolog , and possibly Julia (see
SECTION 50
#17327878051543380-534: The representation can in principle allow circular references , in which case the structure is not a tree at all, but a cyclic graph , and cannot be represented in classical S-expression notation unless a convention for cross-reference is provided (analogous to SQL foreign keys , SGML / XML IDREFs, etc.). Modern Lisp dialects such as Common Lisp and Scheme provide such syntax via datum labels , with which objects can be marked, which can then recur elsewhere, indicating shared rather than duplicated structure, enabling
3445-464: The same representation inside and outside the processor, the term homoiconic is applicable, from homo meaning the same, and icon meaning representation. The last sentence above is annotated with footnote 4, which gives credit for the origin of the term: Following suggestion of McCullough W. S., based upon terminology due to Peirce, C. S. The researchers implicated in this quote might be neurophysiologist and cybernetician Warren Sturgis McCulloch (note
3510-515: The section “Implementation methods” for more details). The term first appeared in connection with the TRAC programming language , developed by Calvin Mooers : One of the main design goals was that the input script of TRAC (what is typed in by the user) should be identical to the text which guides the internal action of the TRAC processor. In other words, TRAC procedures should be stored in memory as
3575-650: The sentence may be false: The following C language fragment is syntactically correct, but performs an operation that is not semantically defined (because p is a null pointer , the operations p -> real and p -> im have no meaning): As a simpler example, is syntactically valid, but not semantically defined, as it uses an uninitialized variable . Even though compilers for some programming languages (e.g., Java and C#) would detect uninitialized variable errors of this kind, they should be regarded as semantic errors rather than syntax errors. To quickly compare syntax of various programming languages, take
3640-403: The soundness of the implementation) result in an error on translation or execution. In some cases, such programs may exhibit undefined behavior . Even when a program is well-defined within a language, it may still have a meaning that is not intended by the person who wrote it. Using natural language as an example, it may not be possible to assign a meaning to a grammatically correct sentence or
3705-400: The spatial layout and connections between symbols (which may be textual or graphical). Documents that are syntactically invalid are said to have a syntax error . When designing the syntax of a language, a designer might start by writing down examples of both legal and illegal strings , before trying to figure out the general rules from these examples. Syntax therefore refers to the form of
3770-502: The syntax and supported data types vary in the different languages, but the most common feature among these languages is the use of S-expressions and prefix notation. There are many variants of the S-expression format, supporting a variety of different syntaxes for different datatypes. The most widely supported are: The character # is often used to prefix extensions to the syntax, e.g. #x10 for hexadecimal integers, or #\C for characters. When representing source code in Lisp,
3835-515: The syntax, whereas context-sensitive syntax would. A language can have different equivalent grammars, such as equivalent regular expressions (at the lexical levels), or different phrase rules which generate the same language. Using a broader category of grammars, such as LR grammars, can allow shorter or simpler grammars compared with more restricted categories, such as LL grammar, which may require longer grammars with more rules. Different but equivalent phrase grammars yield different parse trees, though
3900-472: The term "homoiconic" through his use of the term in his 1969 PhD thesis: A notable group of exceptions to all the previous systems are Interactive LISP [...] and TRAC. Both are functionally oriented (one list, the other string), both talk to the user with one language, and both are "homoiconic" in that their internal and external representations are essentially the same. They both have the ability to dynamically create new functions which may then be elaborated at
3965-402: The type block! and can furthermore be assigned as the value of a word by using what appears to be a syntax for assignment, but is actually understood by the interpreter as a special type ( set-word! ) and takes the form of a word followed by a colon: The block can still be interpreted by using the do function provided in Rebol (similar to eval in Lisp ). It is possible to interrogate
SECTION 60
#17327878051544030-401: The underlying language (set of valid documents) is the same. Below is a simple grammar, defined using the notation of regular expressions and Extended Backus–Naur form . It describes the syntax of S-expressions , a data syntax of the programming language Lisp , which defines productions for the syntactic categories expression , atom , number , symbol , and list : This grammar specifies
4095-450: The users being GnuPG , libgcrypt, Nettle , and GNU lsh). Rivest's S-expressions web page provides C source code for a parser and generator (available under the MIT license ), which could be adapted and embedded into other programs. In addition, there are no restrictions on independently implementing the format. Syntax (programming languages) In computer science , the syntax of
4160-517: The users's pleasure. Their only great drawback is that programs written in them look like King Burniburiach 's letter to the Sumerians done in Babylonian cuniform! [...] One advantage of homoiconicity is that extending the language with new concepts typically becomes simpler, as data representing code can be passed between the meta and base layer of the program. The abstract syntax tree of
4225-427: The vast majority of general purpose computers today, can implicitly be described as homoiconic due to the way that raw machine code executes in memory, the data type being bytes in memory. However, this feature can also be abstracted to the programming language level. Languages such as Lisp and its dialects, such as Scheme , Clojure , and Racket employ S-expressions to achieve homoiconicity, and are considered
#153846