Monday, December 5, 2016

Optimization matchup: R's glpkAPI vs Julia's JuMP

tl;dr: although I use R every day and love it, doing mathematical programming using Julia is much simpler and more flexible than anything I know that is currently available in R.

Recently I have learned that Iain Dunning and Joey Huchette and Miles Lubin have received 2016 INFORMS Computing Society prize for the development of JuMP, a Julia-based domain-specific modeling language. Congratulations!

Together with Wit Jakuczun we have decided to test drive Julia's JuMP against R's package glpkAPI.

The problem we decided to solve is a standard MIP model for finding clusters in data using k-medoids method (we have used this specification of the model without relaxation). The work breakdown was that Wit writes a solution in R and I developed Julia code.

Data preparation

The data set used for the analysis can be generated using this Julia code:

function gendata(N, K, scaling)
    x = randn(N, 2)
    for i in 1:N
        α = 2π * i / K
        x[i, :] += scaling * [cos(α); sin(α)] / (2 * sin(π/K))
    end
    return x
end

srand(1)
d = gendata(50, 4, 5)
writecsv("test.txt", d)

or similar R code respectively:

gen_data <- function(N, K, scaling) {
    alpha <- 2 * pi * (1:N) / K
    sin_pi_K <- 2 * sin(pi / K)

    X <- as.data.frame(matrix(data=rnorm(n=2*N), nrow=N, ncol=2) +
                       scaling*matrix(data = c(cos(alpha), sin(alpha)),
                                      nrow = N, ncol = 2) / sin_pi_K)
    return(data.frame(x = X$V1, y = X$V2))
}

set.seed(1)
write.table(x = gen_data(200, 4, 5), file = "test.txt",
            col.names = TRUE, row.names = FALSE,
            sep = ",", dec = ".")

(Julia codes below assume that data set was generated using Julia and R assumes data was generated in R: the difference is that file generated in R has a header)

Comparing the above codes (although they are not the main objective of the exercise) highlights two things about Julia: 1) you can use unicode literals in your code and 2) in Julia using explicit loop is OK (fast). Those two points are in this case of course only an issue of taste but I actually find that both of them make the code a bit more readable.

The end result is 200 points placed in 2D plane in four clusters.

Now we move to the actual challenge: write a code that executes k-medoids algorithm for number of clusters from 2 to 6 and compare the results.

Here is the solution in Julia:

using JuMP
using GLPKMathProgInterface
using Distances

function find_clusters(K, ds)
    N = size(ds, 1)
    rho = pairwise(Euclidean(), ds')

    m = Model(solver=GLPKSolverMIP())
    @variable(m, y[1:N], Bin)
    @variable(m, x[1:N,1:N], Bin)

    @objective(m, Min, sum{x[i,j]*rho[i,j], i=1:N, j=1:N})

    @constraint(m, sum(y) == K)
    for i in 1:N
        @constraint(m, sum(x[i,:]) == 1)
    end
    for i in 1:N, j in 1:N
        @constraint(m, x[i,j] <= y[j])
    end

    solve(m)
    return getvalue(x), getvalue(y), getobjectivevalue(m)
end

function exec()
    ds = readcsv("test.txt")
    N = size(ds, 1)
    for K in 6:-1:2
        println("--- K=$K ---")
        @time x, y, obj = find_clusters(K, ds)
        println("Objective: $obj")
        println("Centers:")
        for i in 1:N
            y[i] > 0.0 && println("\t$i:\t", ds[i,:])
        end
    end
end

exec()

and here is the R code:

library(glpkAPI)

find_clusters <- function(K, ds) {
    N <- nrow(ds)
    
    rho <- cbind(expand.grid(i = 1:N, j = 1:N),
        dist = as.numeric(as.matrix(dist(ds, upper=TRUE, diag=TRUE))))
    write.table(x = rho, file = "dist.csv",
                sep = ",", dec = ".",
                col.names = TRUE, row.names = FALSE)

    #dat file
    cat("data;\n", file = "kmedoids.dat", append = FALSE)
    cat(sprintf("param pN := %s;\n", N),file="kmedoids.dat", append=T)
    cat(sprintf("param pK := %s;\n", K),file="kmedoids.dat", append=T)
    cat("end;", file = "kmedoids.dat", append = TRUE)

    wk <- mplAllocWkspGLPK()
    termOutGLPK(GLP_OFF)
    model <-  mplReadModelGLPK(wk, "kmedoids.mod", TRUE)
    data <- mplReadDataGLPK(wk, "kmedoids.dat")
    mplGenerateGLPK(wk)
    lp <- initProbGLPK()
    mplBuildProbGLPK(wk, lp)
    scaleProbGLPK(lp, GLP_SF_AUTO)
    setDefaultMIPParmGLPK()
    setSimplexParmGLPK(parm = c(MSG_LEV), val = c(GLP_MSG_OFF))
    setMIPParmGLPK(parm = c(MSG_LEV), val = c(GLP_MSG_OFF))
    solveSimplexGLPK(lp)
    solveMIPGLPK(lp)    
    
    obj  <- mipObjValGLPK(lp)
    all_values <- mipColsValGLPK(lp)
    all_names <- rep(NA_character_, length(all_values))
    for (i in 1:length(all_values)) {
        all_names[i] <- getColNameGLPK(lp, i)
    }
    sol <- data.frame(var = all_names, val = all_values)
    
    return(list(
      obj = obj,
      y = sol[grep(pattern = "vSegmentCenter", x = all_names), ],
      x = sol[grep(pattern = "vPointToSegment", x = all_names), ]))
}


ds <- read.table(file = "test.txt", sep = ",",
                 dec = ".", header = TRUE)
N <- nrow(ds)

for (K in seq(6,2)) {
  print(sprintf("--- K = %s ---", K))
  start_time <- Sys.time()
  sol <- find_clusters(K, ds)
  calc_duration <- Sys.time() - start_time
  print(sprintf("    %s sec", calc_duration))
  print(sprintf("Objective: %s", sol$obj))
  print("Centers:")
  print(ds[sol$y$val > 0, ])
}

which requires auxiliary kmedoids.mod file:

param pN;
param pK;
set pPoints := 1..pN;
set pSegments := 1..pK;

param pDist{i in pPoints, j in pPoints} >= 0;

table T_dist IN "CSV" "dist.csv":
[i,j], pDist ~ dist;

var vPointToSegment{i in pPoints, j in pPoints}, binary;
var vSegmentCenter{i in pPoints}, binary;

minimize total_cost: sum{i in pPoints, j in pPoints} (vPointToSegment[i,j] * pDist[i,j]);

subject to

cPointToOnlyOneSegment{i in pPoints}:
sum{j in pPoints} vPointToSegment[i,j] = 1;

cSegmentsCnt:
sum{i in pPoints} vSegmentCenter[i] = pK;

cPoinOnlyToActiveSegment_1{i in pPoints, j in pPoints}:
vPointToSegment[i, j] <= vSegmentCenter[j];

cPoinOnlyToActiveSegment_2{j in pPoints}:
sum{i in pPoints} vPointToSegment[i, j] <= card(pPoints)*vSegmentCenter[j];

end;

I do not want to go through the code in detail - please judge it yourself. However, the two things are immediately apparent in my opinion:

  1. the ability to be able to specify the model within Julia beats hands down glpkAPI, where you have to use external file for specification of the model and external files to communicate data to the model;
  2. in Julia if we want to switch the solver from GLPK to any other (eg. CBC) the only line we would have to change is m = Model(solver=GLPKSolverMIP()). In R you have bindings to other solvers but they are much more low-level and most importantly you would have to rewrite most of the code.

6 comments:

  1. Great post! Nice to have the comparisons, pretty clear that Julia's syntax really wins the day here, is there much of a speed difference?

    Also, have you seen the ompr package (https://github.com/dirkschumacher/ompr) that Dirk Schumacher is working on? It's inspired by the JuMP project.

    ReplyDelete
    Replies
    1. Thank you for the comment about ompr package. This is exactly what I have been looking for! I will have to check it out in detail, but it looks very good.

      As for the speed - most of the time is spent in solver and the solver is the same in both cases.

      Delete
  2. I really understand that glpkAPI has poor designed interface in comparison with JuMP. But why R code is so unnecessary obfuscated and complicated in your example? Especially confusing is two type-castings to data.frame in 'gen_data'. In Julia you use matrix type for your data. Why don't you use matrix in R? Just for completeness - in real life R code for this tasks looks like this:

    gen_data = function(N, K, scaling) {
    alpha = 2 * pi * (1:N) / K
    sin_pi_K = 2 * sin(pi / K)
    x = matrix(rnorm(n = 2*N), nrow = N)
    x + cbind(cos(alpha), sin(alpha)) * scaling / sin_pi_K
    }

    set.seed(1)
    x = gen_data(200, 4, 5)

    library(cluster)
    lapply(2:6, function(k) pam(x, k = k))

    ReplyDelete
    Replies
    1. Thanks for a nicer version of gen_data func!
      Regarding pam. The post is about optimization in general. pam is limited in functionality and do not handle side constraints. For a real life problem you can check my presentation I gave at EARL 2014 (https://goo.gl/YY1d31).

      Delete
  3. An absolutely killer-feature of Julia JuMP is the ability to define lazy-constraints via callbacks in a solver-independent manner. Many of the best algorithms for problems make use of such techniques (such as subtour-elimination in Travelling Salesman) and the ability to easily program them is something that I have not seen in any library other than JuMP.

    ReplyDelete