Tuesday, January 6, 2015

Sequence generation in R, Python and Julia

Recently I was comparing implementation of sequence generation functions in R, Python (numpy) and Julia. Interestingly even such basic functions have slight differences in implementation. In my opinion Julia provides the best solution and Python the worst.

The test case will be a 6 element sequence from 0.1 to 1.1 namely [0.1, 0.3, 0.5, 0.7, 0.9, 1.1]. In all languages we can generate it in three ways: by direct specification, by sequence giving its length and by sequence giving its step (there actually three flavors here: with exact and inexact right bound).

The test codes follow but let me first give the conclusions:

  • seq will generate the same sequence in all cases (even if to parameter is a bit too small - see seq help for details about 'just beyond', but then the last value in the sequence is corrected);
  • the sequence is not guaranteed to be equal to manually entered literals that represent the sequence most exactly;
  • if you generate the same sequence in the reversed order the result does not have to be equal.
  • it matters if we generate sequence using linspace or arange;
  • arange excludes right end of the range specification; this actually can result in unexpected results; check numpy.arange(0.2, 0.6, 0.4) vs numpy.arange(0.2, 1.6, 1.4);
  • the sequence is not guaranteed to be equal to manually entered literals that represent the sequence most exactly;
  • if you generate the same sequence in the reversed order the result does not have to be equal.
  • it may be important if we supply exact value of the last element of the sequence or it is inexact when we specify step (equivalent of by in R), check out function colon in range.jl;
  • sequence generated by step and by number of elements can differ even if they have the same endpoints;
  • Julia actually is able to generate sequence equal to sequence that is equal to manually entered literals (sequence using step with exact right bound);
  • generating the sequence in the reversed order gives you the same result.

Below you can check the codes I used. Here are the results for R:

x <- c(0.1, 0.3, 0.5, 0.7, 0.9, 1.1)
y1 <- seq(from = 0.1, to = 1.1, length.out = 6)
y2 <- seq(from = 0.1, to = 1.1, by = 0.2)
y3 <- seq(from = 0.1, to = 1.2, by = 0.2)
y4 <- seq(from = 0.1, to = 1.099999999999, by = 0.2)

print(x == y1)

print(x == y2)

print(y1 == y2)

print(y2 == y3)

print(y2 == y4)

print(rev(seq(1.1,0.1, len=6)) == seq(0.1, 1.1, len=6))

Now comes Python:

import numpy

x = numpy.array([0.1, 0.3, 0.5, 0.7, 0.9, 1.1])
y1 = numpy.linspace(0.1, 1.1, 6)
y2 = numpy.arange(0.1, 1.1, 0.2)
y3 = numpy.arange(0.1, 1.11, 0.2)

print(x == y1)
# [ True False  True False  True  True]

# [ 0.1  0.3  0.5  0.7  0.9]

print(x == y3)
# [ True False False False False False]

print(y1 == y3)
# [ True  True False  True False False]

print(list(reversed(numpy.linspace(1.1, 0.1, 6))) == y1)
# [ True  True  True  True False  True]

And finally Julia:

x = [0.1, 0.3, 0.5, 0.7, 0.9, 1.1]
y1 = linspace(0.1, 1.1, 6)
y2 = [0.1:0.2:1.1]
y3 = [0.1:0.2:nextfloat(1.1)]
y4 = [0.1:0.2:prevfloat(1.1)]

println(x .== y1)
# Bool[true,false,true,false,false,true]

println(x .== y2)
# Bool[true,true,true,true,true,true]

println(y1 .== y3)
# Bool[true,true,true,true,false,true]

println(y2 .== y3)
# Bool[true,false,true,false,true,true]

# [0.1,0.30000000000000004,0.5,0.7000000000000001,0.9]

println(linspace(0.1, 1.1, 6) .== reverse(linspace(1.1, 0.1, 6)))
# Bool[true,true,true,true,true,true]


  1. Julia is designed by people who really cares about these kind of issues, so it uses a nice algorithm to make floating point ranges work more intuitively. You should still note that there are fundamental issues with binary floating point numbers and decimal literals in code, so it is probably possible to find edge cases in Julia too.

  2. If you are interested in details you can out this discussion https://github.com/JuliaLang/julia/issues/9637.

  3. Hi,

    well, maybe I miss the point, why you prefer Julia. Julia is giving 6 times "false" and R produces 6 times FALSE. Equal. The point is, however, that every programmer should consider using "==" or ".==" with floating point numbers to be bad und to be avoided. Python producing the most unexspected behaviour has the biggest chance, that the programmer notices a bug in his code before delivering or real-case using it. +1 for python.
    Once you learned to avoid "==" with floats, the topic discussed here is of no relevance anymore (or, is it?).


  4. Thx for the comment. However, notice that I use == only to show the results and it is clear that you would not use == in practice.

    But consider two things (as an example):
    1) should it matter if you generate from low to hi or from hi to low (I would say it should not matter and in Python it does)
    2) should you have your generated sequence as evenly spaced as possible (I would say yes and for example in R you have 'just beyond' that breaks this this assumption)