Title: | Circular, Periodic, or Framed Data Clustering: Fast, Optimal, and Reproducible |
---|---|
Description: | Fast, optimal, and reproducible clustering algorithms for circular, periodic, or framed data. The algorithms introduced here are based on a core algorithm for optimal framed clustering the authors have developed (Debnath & Song 2021) <doi:10.1109/TCBB.2021.3077573>. The runtime of these algorithms is O(K N log^2 N), where K is the number of clusters and N is the number of circular data points. On a desktop computer using a single processor core, millions of data points can be grouped into a few clusters within seconds. One can apply the algorithms to characterize events along circular DNA molecules, circular RNA molecules, and circular genomes of bacteria, chloroplast, and mitochondria. One can also cluster climate data along any given longitude or latitude. Periodic data clustering can be formulated as circular clustering. The algorithms offer a general high-performance solution to circular, periodic, or framed data clustering. |
Authors: | Tathagata Debnath [aut]
|
Maintainer: | Joe Song <[email protected]> |
License: | LGPL (>= 3) |
Version: | 0.0.4 |
Built: | 2025-02-05 02:55:02 UTC |
Source: | https://github.com/cran/OptCirClust |
Perform clustering on circular data to minimize the within-cluster sum of squared distances.
CirClust(O, K, Circumference, method = c("FOCC", "HEUC", "BOCC"))
CirClust(O, K, Circumference, method = c("FOCC", "HEUC", "BOCC"))
O |
a vector of circular data points. They can be coordinates along the circle based on distance, or angles around the circle. |
K |
the number of clusters |
Circumference |
the circumference of the circle where data are located |
method |
the circular clustering method.
|
By circular data, we broadly refer to data points on any non-self-intersecting loop.
In clustering circular points into
clusters, the "FOCC" algorithm
is reproducible with runtime
(Debnath and Song 2021);
The "HEUC" algorithm, not always reproducible, calls the
kmeans
function repeatedly;
The "BOCC" algorithm with runtime , reproducible but slow, is done via
repeatedly calling the
Ckmeans.1d.dp
function.
An object of class "CirClust"
which has a plot
method. It is a list with the following components:
cluster |
a vector of clusters assigned to each element in |
centers |
a numeric vector of the means for each cluster in the circular data. |
withinss |
a numeric vector of the within-cluster sum of squares for each cluster. |
size |
a vector of the number of elements in each cluster. |
totss |
the total sum of squared distances between each element and the sample mean. This statistic is not dependent on the clustering result. |
tot.withinss |
the total sum of within-cluster squared distances between each element and its cluster mean. This statistic is minimized given the number of clusters. |
betweenss |
the sum of squared distances between each cluster mean and sample mean. This statistic is maximized given the number of clusters. |
ID |
the starting index of the frame with minimum SSQ |
Border |
the borders of |
Border.mid |
the middle point of the last and first points of two consequitive clusters. |
O_name |
a character string. The actual name of the |
Circumference |
the circumfarence of the circular or periodic data. |
Debnath T, Song M (2021). “Fast optimal circular clustering and applications on round genomes.” IEEE/ACM Transactions on Computational Biology and Bioinformatics. doi:10.1109/TCBB.2021.3077573.
O <- c(1,2, 10,11,12,13,14,15, 27,28,29,30,31,32, 40,41) K <- 3 Circumference <- 42 # Perform circular clustering: output <- CirClust(O, K, Circumference) # Visualize the circular clusters: opar <- par(mar=c(1,1,2,1)) plot(output) par(opar)
O <- c(1,2, 10,11,12,13,14,15, 27,28,29,30,31,32, 40,41) K <- 3 Circumference <- 42 # Perform circular clustering: output <- CirClust(O, K, Circumference) # Visualize the circular clusters: opar <- par(mar=c(1,1,2,1)) plot(output) par(opar)
Find a frame of given size, among all possible such frames on the input data, to minimize the minimum within-cluster sum of square distances.
FramedClust( X, K, frame.size, first.frame = 1, last.frame = length(X) - frame.size + 1, method = c("linear.polylog", "kmeans", "Ckmeans.1d.dp") )
FramedClust( X, K, frame.size, first.frame = 1, last.frame = length(X) - frame.size + 1, method = c("linear.polylog", "kmeans", "Ckmeans.1d.dp") )
X |
a vector of data points to perform framed clustering |
K |
the number of clusters in each frame |
frame.size |
the number of points from X to be included in each frame. It is not the width of the frame. |
first.frame |
starting index of the first frame to be clustered.
The first point in the first frame is |
last.frame |
starting index of the last frame to be clustered.
The first point in the first frame is |
method |
the framed clustering method. See Details. |
The method option "linear.polylog"
(default) performs
fast optimal framed clustering. The runtime is
(Debnath and Song 2021).
The "kmeans"
option repeatedly calling the heuristic
k-means algorithm in all frames without any guarantee of
cluster optimality.
The method option "Ckmeans.1d.dp"
performs optimal framed
clustering by repeatedly finding the best clustering within
each frame using the "Ckmeans.1d.dp"
method. At a runtime
of , the algorithm is slow but optimal.
It is included to provide a baseline.
An object of class "FramedClust"
which has a plot
method. It is a list with the following components:
cluster |
a vector of clusters assigned to each element in x. Each cluster is indexed by an integer from 1 to K. NA represents points from X that are outside the optimal frame, thus not part of any cluster. |
centers |
a numeric vector of the means for each cluster in the frame. |
withinss |
a numeric vector of the within-cluster sum of squared distances for each cluster. |
size |
a vector of the number of elements in each cluster. |
totss |
total sum of squared distances between each element and the sample mean. This statistic is not dependent on the clustering result. |
tot.withinss |
total sum of within-cluster squared distances between each element and its cluster mean. This statistic is minimized given the number of clusters. |
betweenss |
sum of squared distances between each cluster mean and sample mean. This statistic is maximized given the number of clusters. |
X_name |
a character string. The actual name of
the |
Debnath T, Song M (2021). “Fast optimal circular clustering and applications on round genomes.” IEEE/ACM Transactions on Computational Biology and Bioinformatics. doi:10.1109/TCBB.2021.3077573.
N <- 100 X <- rnorm(N) K <- 5 frame.size <- 60 result <- FramedClust(X, K, frame.size) plot(result, main="Example 1. Framed clustering on all frames") frame.size <- 40 first.frame <- 30 last.frame <- 50 method <- "linear.polylog" result <- FramedClust(X, K, frame.size, first.frame, last.frame, method) plot(result, main="Example 2. Framed clustering on a subset of frames")
N <- 100 X <- rnorm(N) K <- 5 frame.size <- 60 result <- FramedClust(X, K, frame.size) plot(result, main="Example 1. Framed clustering on all frames") frame.size <- 40 first.frame <- 30 last.frame <- 50 method <- "linear.polylog" result <- FramedClust(X, K, frame.size, first.frame, last.frame, method) plot(result, main="Example 2. Framed clustering on a subset of frames")
The plot
method for circular data clustering result
object of class CirClust
.
It visualizes circular clusters on the input data.
## S3 method for class 'CirClust' plot( x, xlab = "", ylab = "", main = NULL, sub = "", col.clusters = c("blue", "red3", "green3", "orange", "purple", "brown"), axes = FALSE, xlim = c(-1.75, 1.75), ylim = c(-1.75, 1.75), fill = "floralwhite", border = "gray36", border.lty = "dotted", ... )
## S3 method for class 'CirClust' plot( x, xlab = "", ylab = "", main = NULL, sub = "", col.clusters = c("blue", "red3", "green3", "orange", "purple", "brown"), axes = FALSE, xlim = c(-1.75, 1.75), ylim = c(-1.75, 1.75), fill = "floralwhite", border = "gray36", border.lty = "dotted", ... )
x |
an object of class as returned by |
xlab |
a character string. The x-axis label for the plot. Default is no string. |
ylab |
a character string. The y-axis label for the plot. Default is no string. |
main |
a character string. The title for the plot. |
sub |
a character string. The subtitle for the plot. |
col.clusters |
a vector of colors, defined either by integers or by color names. If the length is shorter than the number of clusters, the colors will be reused. By default the blue, red3, green3, orange, purple, brown colors are used in the plot. |
axes |
the axis will be ploted if set TRUE. Default is FALSE. |
xlim |
range of the x axis in the plot. Default is from -1.75 to 1.75. |
ylim |
range of the y axis in the plot. Default is from -1.75 to 1.75. |
fill |
the color to fill inside the ring as the background of data points. |
border |
the color to draw cluster borders. |
border.lty |
the line type to draw cluster borders. |
... |
other arguments associated with the plot function |
A copy of the input object of class CirClust
.
opar <- par(mar=c(1,1,2,1)) # Example 1. Circular data clustering n <- 100 Circumference <- 7 O <- runif(n, 0, Circumference) result <- CirClust(O, K=3, Circumference=Circumference) plot(result, fill="mintcream", main="Example 1. Circular clustering") # Example 2. Circular data clustering n <- 40 m <- 5 O <- c(rnorm(n,mean=5,sd=m), rnorm(n,mean=15,sd=m), rnorm(n,mean=26,sd=m)) K <- 3 Circumference <- 28 result <- CirClust(O, K, Circumference, method = "FOCC") color <- c("royalblue", "green3", "firebrick") # c("#0000CD","#808080", "#DC143C") par(mar=c(1,1,2,1)) plot(result, col.clusters = color, fill="floralwhite", main="Example 2. Circular clustering") # Example 3. Periodic data clustering n <- 100 period <- 5.2 O <- rnorm(n) result <- CirClust(O, K=5, Circumference=period) plot(result, fill="navy", border="gray", border.lty="dotted", main="Example 3. Periodic clustering") par(opar)
opar <- par(mar=c(1,1,2,1)) # Example 1. Circular data clustering n <- 100 Circumference <- 7 O <- runif(n, 0, Circumference) result <- CirClust(O, K=3, Circumference=Circumference) plot(result, fill="mintcream", main="Example 1. Circular clustering") # Example 2. Circular data clustering n <- 40 m <- 5 O <- c(rnorm(n,mean=5,sd=m), rnorm(n,mean=15,sd=m), rnorm(n,mean=26,sd=m)) K <- 3 Circumference <- 28 result <- CirClust(O, K, Circumference, method = "FOCC") color <- c("royalblue", "green3", "firebrick") # c("#0000CD","#808080", "#DC143C") par(mar=c(1,1,2,1)) plot(result, col.clusters = color, fill="floralwhite", main="Example 2. Circular clustering") # Example 3. Periodic data clustering n <- 100 period <- 5.2 O <- rnorm(n) result <- CirClust(O, K=5, Circumference=period) plot(result, fill="navy", border="gray", border.lty="dotted", main="Example 3. Periodic clustering") par(opar)
The plot
method for framed data clustering result object.
It visualizes clusters on the input data that are within a best
frame.
## S3 method for class 'FramedClust' plot( x, xlab = NULL, ylab = NULL, main = NULL, sub = NULL, col.clusters = c("blue", "red3", "green3", "orange", "purple", "brown"), ... )
## S3 method for class 'FramedClust' plot( x, xlab = NULL, ylab = NULL, main = NULL, sub = NULL, col.clusters = c("blue", "red3", "green3", "orange", "purple", "brown"), ... )
x |
an object of class |
xlab |
a character string. The x-axis label for the plot. Default is NULL. |
ylab |
a character string. The y-axis label for the plot. Default is NULL. |
main |
a character string. The title for the plot. Default is NULL. |
sub |
a character string. The subtitle for the plot. Default is NULL. |
col.clusters |
a vector of colors, defined either by integers or by color names. If the length is shorter than the number of clusters, the colors will be reused. By default the blue, red3, green3, orange, purple, brown colors are used in the plot. |
... |
other arguments associated with the plot function |
An object of class "FramedClust"
,
identical to the input x
N <- 100 X <- rchisq(N, 5) K <- 3 frame.size <- 40 result <- FramedClust(X, K, frame.size) plot(result)
N <- 100 X <- rchisq(N, 5) K <- 3 frame.size <- 40 result <- FramedClust(X, K, frame.size) plot(result)