Misplaced Pages

OPTICS algorithm

Article snapshot taken from Wikipedia with creative commons attribution-sharealike license. Give it a read and then ask your questions in the chat. We can research this topic together.

Ordering points to identify the clustering structure ( OPTICS ) is an algorithm for finding density-based clusters in spatial data. It was presented by Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel and Jörg Sander. Its basic idea is similar to DBSCAN , but it addresses one of DBSCAN's major weaknesses: the problem of detecting meaningful clusters in data of varying density. To do so, the points of the database are (linearly) ordered such that spatially closest points become neighbors in the ordering. Additionally, a special distance is stored for each point that represents the density that must be accepted for a cluster so that both points belong to the same cluster. This is represented as a dendrogram .

#639360

65-491: Like DBSCAN , OPTICS requires two parameters: ε , which describes the maximum distance (radius) to consider, and MinPts , describing the number of points required to form a cluster. A point p is a core point if at least MinPts points are found within its ε -neighborhood N ε ( p ) {\displaystyle N_{\varepsilon }(p)} (including point p itself). In contrast to DBSCAN , OPTICS also considers points that are part of

130-436: A family of sets , it is possible to construct an infinite graph such that every spanning tree of the graph corresponds to a choice function of the family of sets. Therefore, if every infinite connected graph has a spanning tree, then the axiom of choice is true. The idea of a spanning tree can be generalized to directed multigraphs. Given a vertex v on a directed multigraph G , an oriented spanning tree T rooted at v

195-461: A cluster is started. Otherwise, the point is labeled as noise. Note that this point might later be found in a sufficiently sized ε-environment of a different point and hence be made part of a cluster. If a point is found to be a dense part of a cluster, its ε-neighborhood is also part of that cluster. Hence, all points that are found within the ε-neighborhood are added, as is their own ε-neighborhood when they are also dense. This process continues until

260-461: A cluster, but they form its "edge", since they cannot be used to reach more points. Reachability is not a symmetric relation: by definition, only core points can reach non-core points. The opposite is not true, so a non-core point may be reachable, but nothing can be reached from it. Therefore, a further notion of connectedness is needed to formally define the extent of the clusters found by DBSCAN. Two points p and q are density-connected if there

325-582: A database index for better performance, or using a slow linear scan: The DBSCAN algorithm can be abstracted into the following steps: A naive implementation of this requires storing the neighborhoods in step 1, thus requiring substantial memory. The original DBSCAN algorithm does not require this by performing these steps for one point at a time. DBSCAN optimizes the following loss function: For any possible clustering C = { C 1 , … , C l } {\displaystyle C=\{C_{1},\ldots ,C_{l}\}} out of

390-475: A flat result. In 1972, Robert F. Ling published a closely related algorithm in "The Theory and Construction of k-Clusters" in The Computer Journal with an estimated runtime complexity of O(n³). DBSCAN has a worst-case of O(n²), and the database-oriented range-query formulation of DBSCAN allows for index acceleration. The algorithms slightly differ in their handling of border points. Consider

455-568: A geometric space such as the Euclidean plane . For such an input, a spanning tree is again a tree that has as its vertices the given points. The quality of the tree is measured in the same way as in a graph, using the Euclidean distance between pairs of points as the weight for each edge. Thus, for instance, a Euclidean minimum spanning tree is the same as a graph minimum spanning tree in a complete graph with Euclidean edge weights. However, it

520-429: A hierarchical result. Extracting clusters from this plot can be done manually by selecting ranges on the x-axis after visual inspection, by selecting a threshold on the y-axis (the result is then similar to a DBSCAN clustering result with the same ε {\displaystyle \varepsilon } and minPts parameters; here a value of 0.1 may yield good results), or by different algorithms that try to detect

585-568: A more densely packed cluster, so each point is assigned a core distance that describes the distance to the MinPts th closest point: The reachability-distance of another point o from a point p is either the distance between o and p , or the core distance of p , whichever is bigger: If p and o are nearest neighbors, this is the ε ′ < ε {\displaystyle \varepsilon '<\varepsilon } we need to assume to have p and o belong to

650-486: A set of points in some space to be clustered. Let ε be a parameter specifying the radius of a neighborhood with respect to some point. For the purpose of DBSCAN clustering, the points are classified as core points , ( directly -) reachable points and outliers , as follows: Now if p is a core point, then it forms a cluster together with all points (core or non-core) that are reachable from it. Each cluster contains at least one core point; non-core points can be part of

715-484: A single DBSCAN run. Like DBSCAN , OPTICS processes each point once, and performs one ε {\displaystyle \varepsilon } -neighborhood query during this processing. Given a spatial index that grants a neighborhood query in O ( log ⁡ n ) {\displaystyle O(\log n)} runtime, an overall runtime of O ( n ⋅ log ⁡ n ) {\displaystyle O(n\cdot \log n)}

SECTION 10

#1732793877640

780-619: A spanning tree (or many such trees) as intermediate steps in the process of finding the minimum spanning tree . The Internet and many other telecommunications networks have transmission links that connect nodes together in a mesh topology that includes some loops. In order to avoid bridge loops and routing loops , many routing protocols designed for such networks—including the Spanning Tree Protocol , Open Shortest Path First , Link-state routing protocol , Augmented tree-based routing , etc.—require each router to remember

845-575: A spanning tree will create a cycle; such a cycle is called a fundamental cycle with respect to that tree. There is a distinct fundamental cycle for each edge not in the spanning tree; thus, there is a one-to-one correspondence between fundamental cycles and edges not in the spanning tree. For a connected graph with V vertices, any spanning tree will have V  − 1 edges, and thus, a graph of E edges and one of its spanning trees will have E  −  V  + 1 fundamental cycles (The number of edges subtracted by number of edges included in

910-463: A spanning tree. A special kind of spanning tree, the Xuong tree , is used in topological graph theory to find graph embeddings with maximum genus . A Xuong tree is a spanning tree such that, in the remaining graph, the number of connected components with an odd number of edges is as small as possible. A Xuong tree and an associated maximum-genus embedding can be found in polynomial time . A tree

975-407: A spanning tree; giving the number of edges not included in the spanning tree). For any given spanning tree the set of all E  −  V  + 1 fundamental cycles forms a cycle basis , i.e., a basis for the cycle space . Dual to the notion of a fundamental cycle is the notion of a fundamental cutset with respect to a given spanning tree. By deleting just one edge of the spanning tree,

1040-460: A special case of a class of spanning trees called Trémaux trees , named after the 19th-century discoverer of depth-first search. Spanning trees are important in parallel and distributed computing, as a way of maintaining communications between a set of processors; see for instance the Spanning Tree Protocol used by OSI link layer devices or the Shout (protocol) for distributed computing. However,

1105-426: Is a stack (in the case of depth-first search) or a queue (in the case of breadth-first search). In either case, one can form a spanning tree by connecting each vertex, other than the root vertex v , to the vertex from which it was discovered. This tree is known as a depth-first search tree or a breadth-first search tree according to the graph exploration algorithm used to construct it. Depth-first search trees are

1170-415: Is a connected undirected graph with no cycles . It is a spanning tree of a graph G if it spans G (that is, it includes every vertex of G ) and is a subgraph of G (every edge in the tree belongs to G ). A spanning tree of a connected graph G can also be defined as a maximal set of edges of G that contains no cycle, or as a minimal set of edges that connect all vertices. Adding just one edge to

1235-480: Is a hierarchical version of DBSCAN which is also faster than OPTICS, from which a flat partition consisting of the most prominent clusters can be extracted from the hierarchy. Different implementations of the same algorithm were found to exhibit enormous performance differences, with the fastest on a test data set finishing in 1.4 seconds, the slowest taking 13803 seconds. The differences can be attributed to implementation quality, language and compiler differences, and

1300-421: Is a point o such that both p and q are reachable from o . Density-connectedness is symmetric. A cluster then satisfies two properties: DBSCAN requires two parameters: ε (eps) and the minimum number of points required to form a dense region (minPts). It starts with an arbitrary starting point that has not been visited. This point's ε-neighborhood is retrieved, and if it contains sufficiently many points,

1365-580: Is an improvement over HiSC that can find more complex hierarchies. FOPTICS is a faster implementation using random projections. HDBSCAN* is based on a refinement of DBSCAN, excluding border-points from the clusters and thus following more strictly the basic definition of density-levels by Hartigan. Java implementations of OPTICS, OPTICS-OF, DeLi-Clu, HiSC, HiCO and DiSH are available in the ELKI data mining framework (with index acceleration for several distance functions, and with automatic cluster extraction using

SECTION 20

#1732793877640

1430-620: Is available in the hdbscan library. DBSCAN Density-based spatial clustering of applications with noise ( DBSCAN ) is a data clustering algorithm proposed by Martin Ester , Hans-Peter Kriegel , Jörg Sander , and Xiaowei Xu in 1996. It is a density-based clustering non-parametric algorithm: given a set of points in some space, it groups together points that are closely packed (points with many nearby neighbors ), and marks as outliers points that lie alone in low-density regions (those whose nearest neighbors are too far away). DBSCAN

1495-427: Is available, this comes at additional cost in managing the heap. Therefore, ε {\displaystyle \varepsilon } should be chosen appropriately for the data set. OPTICS-OF is an outlier detection algorithm based on OPTICS. The main use is the extraction of outliers from an existing run of OPTICS at low cost compared to using a different outlier detection method. The better known version LOF

1560-439: Is based on the same concepts. DeLi-Clu, Density-Link-Clustering combines ideas from single-linkage clustering and OPTICS, eliminating the ε {\displaystyle \varepsilon } parameter and offering performance improvements over OPTICS. HiSC is a hierarchical subspace clustering (axis-parallel) method based on OPTICS. HiCO is a hierarchical correlation clustering algorithm based on OPTICS. DiSH

1625-503: Is easy to calculate t ( G ) directly: More generally, for any graph G , the number t ( G ) can be calculated in polynomial time as the determinant of a matrix derived from the graph, using Kirchhoff's matrix-tree theorem . Specifically, to compute t ( G ), one constructs the Laplacian matrix of the graph, a square matrix in which the rows and columns are both indexed by the vertices of G . The entry in row i and column j

1690-472: Is not necessary to construct this graph in order to solve the optimization problem; the Euclidean minimum spanning tree problem, for instance, can be solved more efficiently in O ( n  log  n ) time by constructing the Delaunay triangulation and then applying a linear time planar graph minimum spanning tree algorithm to the resulting triangulation. A spanning tree chosen randomly from among all

1755-421: Is obtained. The worst case however is O ( n 2 ) {\displaystyle O(n^{2})} , as with DBSCAN. The authors of the original OPTICS paper report an actual constant slowdown factor of 1.6 compared to DBSCAN. Note that the value of ε {\displaystyle \varepsilon } might heavily influence the cost of the algorithm, since a value too large might raise

1820-427: Is one of the few exceptions. A single spanning tree of a graph can be found in linear time by either depth-first search or breadth-first search . Both of these algorithms explore the given graph, starting from an arbitrary vertex v , by looping through the neighbors of the vertices they discover and adding each unexplored neighbor to a data structure to be explored later. They differ in whether this data structure

1885-455: Is one of the most commonly used and cited clustering algorithms. In 2014, the algorithm was awarded the Test of Time Award (an award given to algorithms which have received substantial attention in theory and practice) at the leading data mining conference, ACM SIGKDD . As of July 2020 , the follow-up paper "DBSCAN Revisited, Revisited: Why and How You Should (Still) Use DBSCAN" appears in

1950-497: Is one of three values: The resulting matrix is singular , so its determinant is zero. However, deleting the row and column for an arbitrarily chosen vertex leads to a smaller matrix whose determinant is exactly  t ( G ). If G is a graph or multigraph and e is an arbitrary edge of G , then the number t ( G ) of spanning trees of G satisfies the deletion-contraction recurrence t ( G ) =  t ( G  −  e ) +  t ( G / e ), where G  −  e

2015-429: Is required to cut off the density of clusters that are no longer interesting, and to speed up the algorithm. The parameter ε is, strictly speaking, not necessary. It can simply be set to the maximum possible value. When a spatial index is available, however, it does play a practical role with regards to complexity. OPTICS abstracts from DBSCAN by removing this parameter, at least to the extent of only having to give

OPTICS algorithm - Misplaced Pages Continue

2080-448: Is the multigraph obtained by deleting e and G / e is the contraction of G by e . The term t ( G  −  e ) in this formula counts the spanning trees of  G that do not use edge  e , and the term t ( G / e ) counts the spanning trees of  G that use  e . In this formula, if the given graph G is a multigraph , or if a contraction causes two vertices to be connected to each other by multiple edges, then

2145-640: Is used that executes a neighborhood query in O(log n ) , an overall average runtime complexity of O( n log n ) is obtained (if parameter ε is chosen in a meaningful way, i.e. such that on average only O(log n ) points are returned). Without the use of an accelerating index structure, or on degenerated data (e.g. all points within a distance less than ε ), the worst case run time complexity remains O( n ²) . The ( n 2 ) {\displaystyle \textstyle {\binom {n}{2}}} - n = ( n ²- n )/2 -sized upper triangle of

2210-561: The edges of G are also edges of a spanning tree T of G , then G is a tree and is identical to T (that is, a tree has a unique spanning tree and it is itself). Several pathfinding algorithms, including Dijkstra's algorithm and the A* search algorithm , internally build a spanning tree as an intermediate step in solving the problem. In order to minimize the cost of power networks, wiring connections, piping, automatic speech recognition, etc., people often use algorithms that gradually build

2275-555: The ξ extraction method). Other Java implementations include the Weka extension (no support for ξ cluster extraction). The R package "dbscan" includes a C++ implementation of OPTICS (with both traditional dbscan-like and ξ cluster extraction) using a k-d tree for index acceleration for Euclidean distance only. Python implementations of OPTICS are available in the PyClustering library and in scikit-learn . HDBSCAN*

2340-430: The base, and fundamental cutsets are defined in the same way from the dual matroid . A collection of disjoint (unconnected) trees is described as a forest . A spanning forest in a graph is a subgraph that is a forest with an additional requirement. There are two incompatible requirements in use, of which one is relatively rare. To avoid confusion between these two definitions, Gross & Yellen (2005) suggest

2405-418: The cost of a neighborhood query to linear complexity. In particular, choosing ε > max x , y d ( x , y ) {\displaystyle \varepsilon >\max _{x,y}d(x,y)} (larger than the maximum distance in the data set) is possible, but leads to quadratic complexity, since every neighborhood query returns the full data set. Even when no spatial index

2470-527: The density predicate uses the polygon areas instead of just the object count. Various extensions to the DBSCAN algorithm have been proposed, including methods for parallelization, parameter estimation, and support for uncertain data. The basic idea has been extended to hierarchical clustering by the OPTICS algorithm . DBSCAN is also used as part of subspace clustering algorithms like PreDeCon and SUBCLU . HDBSCAN*

2535-453: The density-connected cluster is completely found. Then, a new unvisited point is retrieved and processed, leading to the discovery of a further cluster or noise. DBSCAN can be used with any distance function (as well as similarity functions or other predicates). The distance function (dist) can therefore be seen as an additional parameter. The algorithm can be expressed in pseudocode as follows: where RangeQuery can be implemented using

2600-480: The depth-first and breadth-first methods for constructing spanning trees on sequential computers are not well suited for parallel and distributed computers. Instead, researchers have devised several more specialized algorithms for finding spanning trees in these models of computation. In certain fields of graph theory it is often useful to find a minimum spanning tree of a weighted graph . Other optimization problems on spanning trees have also been studied, including

2665-405: The distance matrix can be materialized to avoid distance recomputations, but this needs O( n ²) memory, whereas a non-matrix based implementation of DBSCAN only needs O( n ) memory. See the section below on extensions for algorithmic modifications to handle these issues. Every data mining task has the problem of parameters. Every parameter influences the algorithm in specific ways. For DBSCAN,

OPTICS algorithm - Misplaced Pages Continue

2730-429: The edges of the graph are assigned random weights and then the minimum spanning tree of the weighted graph is constructed. Because a graph may have exponentially many spanning trees, it is not possible to list them all in polynomial time . However, algorithms are known for listing all spanning trees in polynomial time per tree. Every finite connected graph has a spanning tree. However, for infinite connected graphs,

2795-527: The existence of spanning trees is equivalent to the axiom of choice . An infinite graph is connected if each pair of its vertices forms the pair of endpoints of a finite path. As with finite graphs, a tree is a connected graph with no finite cycles, and a spanning tree can be defined either as a maximal acyclic set of edges or as a tree that contains every vertex. The trees within a graph may be partially ordered by their subgraph relation, and any infinite chain in this partial order has an upper bound (the union of

2860-399: The list of the 8 most downloaded articles of the prestigious ACM Transactions on Database Systems (TODS) journal. Another follow-up, HDBSCAN* , was initially published by Ricardo J. G. Campello, David Moulavi, and Jörg Sander in 2013, then expanded upon with Arthur Zimek in 2015. It revises some of the original decisions such as the border points, and produces a hierarchical instead of

2925-422: The lower part shows the reachability plot as computed by OPTICS. Colors in this plot are labels, and not computed by the algorithm; but it is well visible how the valleys in the plot correspond to the clusters in above data set. The yellow points in this image are considered noise, and no valley is found in their reachability plot. They are usually not assigned to clusters, except the omnipresent "all data" cluster in

2990-500: The maximum spanning tree, the minimum tree that spans at least k vertices, the spanning tree with the fewest edges per vertex , the spanning tree with the largest number of leaves , the spanning tree with the fewest leaves (closely related to the Hamiltonian path problem ), the minimum diameter spanning tree, and the minimum dilation spanning tree. Optimal spanning tree problems have also been studied for finite sets of points in

3055-505: The maximum value. The basic approach of OPTICS is similar to DBSCAN , but instead of maintaining known, but so far unprocessed cluster members in a set, they are maintained in a priority queue (e.g. using an indexed heap ). In update(), the priority queue Seeds is updated with the ε {\displaystyle \varepsilon } -neighborhood of p {\displaystyle p} and q {\displaystyle q} , respectively: OPTICS hence outputs

3120-429: The minimum cluster size to find. While the algorithm is much easier to parameterize than DBSCAN, the results are a bit more difficult to use, as it will usually produce a hierarchical clustering instead of the simple data partitioning that DBSCAN produces. Recently, one of the original authors of DBSCAN has revisited DBSCAN and OPTICS, and published a refined version of hierarchical DBSCAN (HDBSCAN*), which no longer has

3185-423: The notion of border points. Instead, only the core points form the cluster. A spectral implementation of DBSCAN is related to spectral clustering in the trivial case of determining connected graph components — the optimal clusters with no edges cut. However, it can be computationally intensive, up to O ( n 3 ) {\displaystyle O(n^{3})} . Additionally, one has to choose

3250-460: The number of eigenvectors to compute. For performance reasons, the original DBSCAN algorithm remains preferable to its spectral implementation. Generalized DBSCAN (GDBSCAN) is a generalization by the same authors to arbitrary "neighborhood" and "dense" predicates. The ε and minPts parameters are removed from the original algorithm and moved to the predicates. For example, on polygon data, the "neighborhood" could be any intersecting polygon, whereas

3315-399: The parameters ε and minPts are needed. The parameters must be specified by the user. Ideally, the value of ε is given by the problem to solve (e.g. a physical distance), and minPts is then the desired minimum cluster size. OPTICS can be seen as a generalization of DBSCAN that replaces the ε parameter with a maximum value that mostly affects performance. MinPts then essentially becomes

SECTION 50

#1732793877640

3380-427: The points in a particular ordering, annotated with their smallest reachability distance (in the original algorithm, the core distance is also exported, but this is not required for further processing). [REDACTED] Using a reachability-plot (a special kind of dendrogram ), the hierarchical structure of the clusters can be obtained easily. It is a 2D plot, with the ordering of the points as processed by OPTICS on

3445-401: The redundant edges should not be removed, as that would lead to the wrong total. For instance a bond graph connecting two vertices by k edges has k different spanning trees, each consisting of a single one of these edges. The Tutte polynomial of a graph can be defined as a sum, over the spanning trees of the graph, of terms computed from the "internal activity" and "external activity" of

3510-405: The same cluster. Both core-distance and reachability-distance are undefined if no sufficiently dense cluster (w.r.t. ε ) is available. Given a sufficiently large ε , this never happens, but then every ε -neighborhood query returns the entire database, resulting in O ( n 2 ) {\displaystyle O(n^{2})} runtime. Hence, the ε parameter

3575-933: The set of all clusterings C {\displaystyle {\mathcal {C}}} , it minimizes the number of clusters under the condition that every pair of points in a cluster is density-reachable, which corresponds to the original two properties "maximality" and "connectivity" of a cluster: min C ⊂ C ,   d d b ( p , q ) ≤ ε   ∀ p , q ∈ C i   ∀ C i ∈ C | C | {\displaystyle \min _{C\subset {\mathcal {C}},~d_{db}(p,q)\leq \varepsilon ~\forall p,q\in C_{i}~\forall C_{i}\in C}|C|} where d d b ( p , q ) {\displaystyle d_{db}(p,q)} gives

3640-451: The smallest ε {\displaystyle \varepsilon } such that two points p and q are density-connected. DBSCAN visits each point of the database, possibly multiple times (e.g., as candidates to different clusters). For practical considerations, however, the time complexity is mostly governed by the number of regionQuery invocations. DBSCAN executes exactly one such query for each point, and if an indexing structure

3705-425: The spanning tree can only appear in the cutsets of the other edges in the cycle; and vice versa : edges in a cutset can only appear in those cycles containing the edge corresponding to the cutset. This duality can also be expressed using the theory of matroids , according to which a spanning tree is a base of the graphic matroid , a fundamental cycle is the unique circuit within the set formed by adding one element to

3770-399: The spanning trees with equal probability is called a uniform spanning tree . Wilson's algorithm can be used to generate uniform spanning trees in polynomial time by a process of taking a random walk on the given graph and erasing the cycles created by this walk. An alternative model for generating spanning trees randomly but not uniformly is the random minimal spanning tree . In this model,

3835-417: The term "full spanning forest" for a spanning forest with the same number of components as the given graph (i.e., a maximal forest), while Bondy & Murty (2008) instead call this kind of forest a "maximal spanning forest" (which is redundant, as a maximal forest necessarily contains every vertex). The number t ( G ) of spanning trees of a connected graph is a well-studied invariant . In some cases, it

3900-495: The tree. Its value at the arguments (1,1) is the number of spanning trees or, in a disconnected graph, the number of maximal spanning forests. The Tutte polynomial can also be computed using a deletion-contraction recurrence, but its computational complexity is high: for many values of its arguments, computing it exactly is #P-complete , and it is also hard to approximate with a guaranteed approximation ratio . The point (1,1), at which it can be evaluated using Kirchhoff's theorem,

3965-408: The trees in the chain). Zorn's lemma , one of many equivalent statements to the axiom of choice, requires that a partial order in which all chains are upper bounded have a maximal element; in the partial order on the trees of the graph, this maximal element must be a spanning tree. Therefore, if Zorn's lemma is assumed, every infinite connected graph has a spanning tree. In the other direction, given

SECTION 60

#1732793877640

4030-406: The use of indexes for acceleration. Spanning tree In the mathematical field of graph theory , a spanning tree T of an undirected graph G is a subgraph that is a tree which includes all of the vertices of G . In general, a graph may have several spanning trees, but a graph that is not connected will not contain a spanning tree (see about spanning forests below). If all of

4095-462: The valleys by steepness, knee detection, or local maxima. A range of the plot beginning with a steep descent and ending with a steep ascent is considered a valley, and corresponds to a contiguous area of high density. Additional care must be taken to the last points in a valley to assign them to the inner or outer cluster, this can be achieved by considering the predecessor. Clusterings obtained this way usually are hierarchical , and cannot be achieved by

4160-416: The vertices are partitioned into two disjoint sets. The fundamental cutset is defined as the set of edges that must be removed from the graph G to accomplish the same partition. Thus, each spanning tree defines a set of V  − 1 fundamental cutsets, one for each edge of the spanning tree. The duality between fundamental cutsets and fundamental cycles is established by noting that cycle edges not in

4225-439: The x-axis and the reachability distance on the y-axis. Since points belonging to a cluster have a low reachability distance to their nearest neighbor, the clusters show up as valleys in the reachability plot. The deeper the valley, the denser the cluster. The image above illustrates this concept. In its upper left area, a synthetic example data set is shown. The upper right part visualizes the spanning tree produced by OPTICS, and

#639360