Groovy for data science, a killer application for Groovy?

classic Classic list List threaded Threaded
28 messages Options
123
Reply | Threaded
Open this post in threaded view
|

Re: Groovy for data science, a killer application for Groovy?

Russel Winder-3

On Mon, 2015-01-12 at 14:20 +0100, Paolo Di Tommaso wrote:

> :)
>
>
> However, I disagree on this. For example in the genomics field,
> though the
> most used programming languages are C and Python, there are really
> important pieces of software that are written in Java.
>
> In this context Groovy could really play a role, as an easy-to-go
> scripting
> alternative to Java, well founded parallelisation libraries, and
> good performance (not fast as C code, but much better then Python).
>

Java performance often surpasses that of C, C++, D and Fortran. There
are no "always" on this any more. Indeed Python + Numba is as fast as C
as well for many CPU bound activities.

And PyPy runs CPU-bound Python code 5 to 30 times faster than CPython.

--
Russel.
=============================================================================
Dr Russel Winder      t: +44 20 7585 2200   voip: sip:[hidden email]
41 Buckmaster Road    m: +44 7770 465 077   xmpp: [hidden email]
London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder


---------------------------------------------------------------------
To unsubscribe from this list, please visit:

    http://xircles.codehaus.org/manage_email


Reply | Threaded
Open this post in threaded view
|

Re: Groovy for data science, a killer application for Groovy?

Dylan Cali
On Mon, Jan 12, 2015 at 8:28 AM, Russel Winder <[hidden email]> wrote:
>
> Java performance often surpasses that of C, C++, D and Fortran. There
> are no "always" on this any more. Indeed Python + Numba is as fast as C
> as well for many CPU bound activities.

To follow up on the performance aspect of this conversation, I took
the time to read JEP 191 that Jochen mentioned.  Unless I'm missing
something this seems to just be a more streamlined, first class libffi
interface.  I'm confused about the performance claims they make, since
ffi calls have a definite overhead.. do they just mean it will perform
better than the 'bolt on' libffi JNI interfaces currently being used
(JNA/JNR) ?

To use libffi everywhere that JNI is currently being used seems...
dubious.  Sometimes you need 'real' native level bindings, which is
why Python has both ctypes and native extensions after all.

It seems like there should be two JEPs, one for a first-class JVM FFI
sure, but also one focused on improving (or replacing) JNI itself...

---------------------------------------------------------------------
To unsubscribe from this list, please visit:

    http://xircles.codehaus.org/manage_email


Reply | Threaded
Open this post in threaded view
|

Re: Groovy for data science, a killer application for Groovy?

Jochen Theodorou
In the end, what makes JNI slow is imho the conversion of objects, call
conventions and primitives conversion.... well and of course there is
the issue of multithreading

libffi would probably be able to address the part of call conventions.
What stays is the problem of converting objects... though... if we
degrade the JVM to a mere frontend, then I think we can do here a lot
with invokedynamic and work "directly" on the native code generated
object or struct. Again ffi helps here.

Any object and primitive from Java that you would maybe use to talk to
the backend would maybe require conversion... I think primitives won't
hurt too much - though doing a complete conversion is something to think
about, but Strings for example can be a problem. I am also thinking
about for example an unsigned 32bit value. Would you keep that in an
integer and live with the number appearing wrong to avoid overhead for
conversions? And let us not forget that there are numbers with more than
64 bit too.

And of course there is a lot more...

bye Jochen


Am 13.01.2015 03:32, schrieb Dylan Cali:

> On Mon, Jan 12, 2015 at 8:28 AM, Russel Winder <[hidden email]> wrote:
>>
>> Java performance often surpasses that of C, C++, D and Fortran. There
>> are no "always" on this any more. Indeed Python + Numba is as fast as C
>> as well for many CPU bound activities.
>
> To follow up on the performance aspect of this conversation, I took
> the time to read JEP 191 that Jochen mentioned.  Unless I'm missing
> something this seems to just be a more streamlined, first class libffi
> interface.  I'm confused about the performance claims they make, since
> ffi calls have a definite overhead.. do they just mean it will perform
> better than the 'bolt on' libffi JNI interfaces currently being used
> (JNA/JNR) ?
>
> To use libffi everywhere that JNI is currently being used seems...
> dubious.  Sometimes you need 'real' native level bindings, which is
> why Python has both ctypes and native extensions after all.
>
> It seems like there should be two JEPs, one for a first-class JVM FFI
> sure, but also one focused on improving (or replacing) JNI itself...
>
> ---------------------------------------------------------------------
> To unsubscribe from this list, please visit:
>
>      http://xircles.codehaus.org/manage_email
>
>


--
Jochen "blackdrag" Theodorou - Groovy Project Tech Lead
blog: http://blackdragsview.blogspot.com/
german groovy discussion newsgroup: de.comp.lang.misc
For Groovy programming sources visit http://groovy-lang.org


---------------------------------------------------------------------
To unsubscribe from this list, please visit:

    http://xircles.codehaus.org/manage_email


Reply | Threaded
Open this post in threaded view
|

Re: Groovy for data science, a killer application for Groovy?

Eric MacAdie-2
To take this in another direction:

Could use by sysadmins be a killer app/niche for Groovy?

Anecdotally, I have noticed there seem to be some shops that have their main app in Java, and their sysadmins/dev-ops people use Ruby or Python. It seems that Groovy would be a good fit for those Java shops.

= Eric MacAdie

On Tue, Jan 13, 2015 at 4:33 AM, Jochen Theodorou <[hidden email]> wrote:
In the end, what makes JNI slow is imho the conversion of objects, call conventions and primitives conversion.... well and of course there is the issue of multithreading

libffi would probably be able to address the part of call conventions. What stays is the problem of converting objects... though... if we degrade the JVM to a mere frontend, then I think we can do here a lot with invokedynamic and work "directly" on the native code generated object or struct. Again ffi helps here.

Any object and primitive from Java that you would maybe use to talk to the backend would maybe require conversion... I think primitives won't hurt too much - though doing a complete conversion is something to think about, but Strings for example can be a problem. I am also thinking about for example an unsigned 32bit value. Would you keep that in an integer and live with the number appearing wrong to avoid overhead for conversions? And let us not forget that there are numbers with more than 64 bit too.

And of course there is a lot more...

bye Jochen


Am 13.01.2015 03:32, schrieb Dylan Cali:

On Mon, Jan 12, 2015 at 8:28 AM, Russel Winder <[hidden email]> wrote:

Java performance often surpasses that of C, C++, D and Fortran. There
are no "always" on this any more. Indeed Python + Numba is as fast as C
as well for many CPU bound activities.

To follow up on the performance aspect of this conversation, I took
the time to read JEP 191 that Jochen mentioned.  Unless I'm missing
something this seems to just be a more streamlined, first class libffi
interface.  I'm confused about the performance claims they make, since
ffi calls have a definite overhead.. do they just mean it will perform
better than the 'bolt on' libffi JNI interfaces currently being used
(JNA/JNR) ?

To use libffi everywhere that JNI is currently being used seems...
dubious.  Sometimes you need 'real' native level bindings, which is
why Python has both ctypes and native extensions after all.

It seems like there should be two JEPs, one for a first-class JVM FFI
sure, but also one focused on improving (or replacing) JNI itself...

---------------------------------------------------------------------
To unsubscribe from this list, please visit:

     http://xircles.codehaus.org/manage_email




--
Jochen "blackdrag" Theodorou - Groovy Project Tech Lead
blog: http://blackdragsview.blogspot.com/
german groovy discussion newsgroup: de.comp.lang.misc
For Groovy programming sources visit http://groovy-lang.org


---------------------------------------------------------------------
To unsubscribe from this list, please visit:

   http://xircles.codehaus.org/manage_email



Reply | Threaded
Open this post in threaded view
|

Re: Groovy for data science, a killer application for Groovy?

Jochen Theodorou
In reply to this post by Russel Winder-3
Am 06.01.2015 00:38, schrieb Russel Winder:

> I hate to be a damper on this and all the other supportive posts but
> wishing things to be true will achieve nothing. Data science is full of
> PhD statistics folks who like R or perhaps
> Python/SciPy/Matplotlib/Pandas. Trust me I run workshops for these folk.
> Currently "data science" is an R, Python, Julia, place mostly because
> the infrastructure is already there and everyone already uses R, Python,
> Julia.
>
> The core issue is that R, Python, Julia already have the infrastructure
> for analysing and (more importantly) visualizing data and algorithms
> over it. I am sure JVM and Groovy can do this, but it doesn't have the
> systems these folk use today.
>


coming back to that again.... Did you ever look at projects like for
example renjin (R on the JVM)?

bye blackdrag



---------------------------------------------------------------------
To unsubscribe from this list, please visit:

    http://xircles.codehaus.org/manage_email


Reply | Threaded
Open this post in threaded view
|

Re: Groovy for data science, a killer application for Groovy?

simonz
I am a bit late to this conversation, but I thought I would mention that I did make a kind of experimental framework for enabling some "groovy"-ish features for data science on top of commons-math:


It's pretty simple but gives a basic "data frames" style framework for Groovy.

As I said, I consider this an "experiment", but I do use it extensively myself as part of a broader bioinformatics package for groovy. Having done so, I tend to agree with others that there are some basic challenges for Groovy and really any JVM based package. Firstly, there are not a wealth of established peer reviewed algorithms for data analysis. There are some, but not nearly what you observe for Python and R. The second problem though is deeper, and it is that the actual data types and computational operations employed by the JVM are not designed to be numerically robust and stable in the way that true numerical computing requires. I run the same algorithm in R and in Groovy and I get different results. Sometimes highly divergent results, more often than not it's either gone to infinity or NaN while R or Python are sailing through producing a reasonable result from the same algorithm. Python achieves this by completely re-inventing data types and operations via Numpy. Essentially we need the same kind of thing for the JVM if we are to be serious in the scientific world.

Cheers,

Simon
Reply | Threaded
Open this post in threaded view
|

Re: Groovy for data science, a killer application for Groovy?

Dylan Cali
On Thu, Jan 22, 2015 at 7:06 PM, Simon <[hidden email]> wrote:

> The second problem though is
> deeper, and it is that the actual data types and computational operations
> employed by the JVM are not designed to be numerically robust and stable in
> the way that true numerical computing requires. I run the same algorithm in
> R and in Groovy and I get different results. Sometimes highly divergent
> results, more often than not it's either gone to infinity or NaN while R or
> Python are sailing through producing a reasonable result from the same
> algorithm. Python achieves this by completely re-inventing data types and
> operations via Numpy. Essentially we need the same kind of thing for the JVM
> if we are to be serious in the scientific world.

So I have zero knowledge of what goes into the Groovy internals, but
does Groovy have to _only_ be a JVM language?  It sounds like the
limitations people are raising are more limitations with the JVM, and
not Groovy itself.

Obviously this is a pretty out-there idea, but if you look at Python,
Ruby, etc... these languages have versions that can run on the JVM, on
the CLR, as well as independent implementations.  Is it impossible for
there to ever be a non-JVM version of Groovy?

Yes, being able to use Groovy to leverage the JVM ecosystem is
definitely a killer feature, but IMHO Groovy itself is also a killer
feature :).  As part of my job I have to program in a wide variety of
languages.. Perl, Python, Java, Groovy, C#, C, C++, Tcl.  But Groovy
is my favorite (hint: it's the only language whose mailing list I'm
on!).  The language just makes really sensible design decisions, and
to the credit of the core devs Groovy just continues to get better and
better.

Seamless integration with Java is a bonus (a realy, really big bonus),
but what makes Groovy a pleasure to program in is Groovy itself.

---------------------------------------------------------------------
To unsubscribe from this list, please visit:

    http://xircles.codehaus.org/manage_email


Reply | Threaded
Open this post in threaded view
|

Re: Groovy for data science, a killer application for Groovy?

Jochen Theodorou
In reply to this post by simonz
Am 23.01.2015 02:06, schrieb Simon:
[...]
> The second problem though is deeper, and it is that the actual data
> types and computational operations employed by the JVM are not designed
> to be numerically robust and stable in the way that true numerical
> computing requires.

If the problem is understood, we can do something about it. If it is
then fast or not will be another story of course. Normally "true"
numerical computing takes the inexactness of number representations into
account... at least it was like this when I did physics simulations at
the university. If you need "arbitrary precision" then of course you
should not use data types like float or double. BigDecimal would be a
better choice then to get a more stable result.

> I run the same algorithm in R and in Groovy and I
> get different results.  Sometimes highly divergent results, more often
> than not it's either gone to infinity or NaN while R or Python are
> sailing through producing a reasonable result from the same algorithm.

Maybe you can show an example?

> Python achieves this by completely re-inventing data types and
> operations via Numpy. Essentially we need the same kind of thing for the
> JVM if we are to be serious in the scientific world.

Letting Groovy understand a new number type should be the smallest
problem, as long as it is 100% clear how it is supposed to behave

bye blackdrag

--
Jochen "blackdrag" Theodorou - Groovy Project Tech Lead
blog: http://blackdragsview.blogspot.com/
german groovy discussion newsgroup: de.comp.lang.misc
For Groovy programming sources visit http://groovy-lang.org


---------------------------------------------------------------------
To unsubscribe from this list, please visit:

    http://xircles.codehaus.org/manage_email


123