Figure: x

The rPref Package

Database Preferences and Skyline Computation in R

Examples

On this page we show some typical applications of Skylines and Preferences and explain the usage of our package with some code snippets. Take a look at the examples in the documentation to get more ideas on how to work with this package. Some of these examples also require the ggplot2, dplyr and igraph packages. All packages needed for the following examples can be installed with

install.packages("rPref")
install.packages("dplyr")
install.packages("igraph")
install.packages("ggplot")

All rPref functions are printed bold in the following.

Skyline plot

Consider goals which tend to be conflicting, e.g. horsepower and fuel consumption for cars. Let us take the mtcars data set from R, where mtcars$hp is the horsepower and mtcars$mpg is the inverse fuel consumption (miles per gallon, i.e. a high value indicates low fuel consumption).

In the following code snippet the optimal set of cars with respect to the preference "high horsepower and low fuel consumption" is calculated and the result is plotted:

# Calculate Skyline
sky1 <- psel(mtcars, high(mpg) * high(hp))

# Plot mpg and hp values of mtcars and highlight the skyline
ggplot(mtcars, aes(x = mpg, y = hp)) + geom_point(shape = 21) +   geom_point(data = sky1, size = 3)
The result of the visualization is:
Figure:
Plot of the mpg and hp values of mtcars. The Pareto-optimal points maximizing both dimensions are bold.

Pareto frontier and level value

Consider again the mtcars data set from R together with the high(mpg) * high(hp) preference.

Next to the Skyline points we want to plot the following information:

  • The level value of each car. Level 1 means that the car is optimal for this preference. The optimal cars from the remainder (i.e., the data set without the optima) have level value 2. In general, the tuples of level n are retrieved by taking the maxima from the (n-1)-th remainder. Note that the layers of the Hasse diagram in the second example correspond to these level numbers.
  • The Pareto frontier, i.e. the line connecting all optimal points such that the dominance area of these tuples is bounded by the frontier.
In the following code snippet the optimal set of cars with respect to the preference "high horsepower and low fuel consumption" is calculated and the result is plotted:
# Consider again the preference from above
p <- high(mpg) * high(hp)

# Calculate the level-value w.r.t. p by using top-all
res <- psel(mtcars, p, top = nrow(mtcars))

# Visualize the level values by the color of the points
gp <- ggplot(res, aes(x = mpg, y = hp, color = factor(.level))) +
  geom_point(size = 3)
gp

The result of the visualization is:

Figure:
Plot of the mpg and hp values of mtcars. The color visualizes the level (proximity to optimum).

Additionally we want to plot the Pareto front line for every, indicating the area of all points of a lower level. To this end we use the geom_step function from ggplot:

gp + geom_step(direction = "vh")

This results in the following graphic:

Figure:
Plot of the mpg and hp values of mtcars, where the Pareto front line for each level is shown.

Grouped Skyline

Sometimes one may be interested in the Skyline on a partitioned data set where the Skyline is calculated for each partition separately. The dplyr package provides a very convenient way to partition data sets. The rPref package respects the given grouping, i.e., the preference selection preserves these groups and operates on each group separately.

The following code builds partitions of the mtcars data set with regard to the amount of cylinders (the mtcars$cyl variable). Next, the same Skyline as above (high mpg and high horsepower) is calculated for each group. This is done in the following code snippet:

# Get grouped data set using dplyr
library(dplyr)
df <- group_by(mtcars, cyl)

# Calculate Grouped Skyline
sky2 <- psel(df, high(mpg) * high(hp))
Now we can get the size of each Skyline as follows, using summarise from dplyr:
> summarise(sky2, skyline_size = n())
Source: local data frame [3 x 2]

  cyl skyline_size
1   4            3
2   6            2
3   8            4
To visualize this result we plot the Skyline using the number of cylinders as color:
ggplot(mtcars, aes(x = mpg, y = hp)) + geom_point(shape = 21) +
  geom_point(aes(color = factor(sky2$cyl)), sky2, size = 3)
This produces the following graphic:
Figure:
Grouped skyline. The red, green and blue points are the Skyline points of each cylinder group.

Better-Than-Graph for the preference order

To get a better understanding of the preference order we will visualize the Better-Than-Graph of a preference. Formally it is a Hasse diagram showing all the better-than-relationships of the preference on a given domain. As an example we consider the Pareto preference high(mpg) * low(wt) to search for cars with a low fuel consumption and a low weight.

The plotting itself can be done in three different ways:

  • With the plot_btg function using the Rgraphviz package (if available) based on the dot layouter.
  • With the plot_btg function with parameter use_dot = FALSE using the igraph package.
  • With the get_btg_dot function and an external Graphviz/dot interpreter (see ?get_btg_dot, not explained here)
In the following we show the resulting diagrams on the data set mtcars[1:8,] using plot_btg.
# Pick a small data set and create preference / BTG
df <- mtcars[1:8,]
pref <- high(mpg) * low(wt)
btg <- get_btg(df, pref)

# Create labels for the nodes containing relevant values
labels <- paste0(df$mpg, "\n", df$wt)
At first we use Graphviz. If the Rgraphviz package (from Bioconductor) is available, the dot layouter is used to plot the graph resulting in the left figure.
plot_btg(df, pref, labels)
If Rgraphviz is not available or we explicitly do not use it via
plot_btg(df, pref, labels, use_dot = FALSE)
then the igraph package is used to plot the graph. The result is shown on the right. Note that the dot layouter is more appropriate for strict orders. It ensures that all edges are pointing from top to bottom.

We get the following two diagrams:
Figure:
Better-Than-Graphs for mtcars[1:8,]. The upper number is the miles-per-gallon value, the lower number is the weight in lb/1000. An edge from A->B only exists if at least one value of A is strictly better (higher mpg value and lower weight) than the corresponding value of B and the other value is better or equal.