[ANN] Sparksoniq 0.9.5 "Larch"

From: "Ghislain Fourny" <gfourny@inf.ethz.ch>
To: "sparksoniq-users@googlegroups.com" <sparksoniq-users@googlegroups.com>,"talk@x..." <talk@x...>, "xml-dev@l..."<xml-dev@l...>
Date: Mon, 4 Mar 2019 10:56:24 +0000

Play the video

Dear all,

We are happy to announce the latest alpha release of Sparksoniq.

Sparksoniq runs JSONiq queries on top of Spark, taking as input JSON data sets stored on distributed file systems such as (but not only) HDFS. Its goal is to increase productivity when querying heterogeneous, nested datasets that are challenging to handle with DataFrames.

JSONiq is the JSON brother of XQuery (XQuery - XML + JSON) and shares 90% of its DNA.

Sparksoniq is open source (Apache 2.0) and can be downloaded for free. The jar as well as the documentation can be found on http://sparksoniq.org/.


Since the announcement of our initial prototype last year, the following progress was made:

- Many bugfixes following user feedback. It is getting stable enough to consider soon going to beta, and was already used in large classrooms.

- All FLWOR clauses are supported both in parallel and (new) locally. Locally means without invoking Spark transformations with parallelize() or json-file() calls.

- FLWOR expressions can fully nest, with the only exception that those that run in parallel cannot nest with each other (because Spark jobs do not nest).

E.g.:

for $i in json-file("hdfs://path/to/orders.json") (: this will be executed in parallel on that large file, split after HDFS blocks :)
where $i.customer eq "John Smith"
return {
  "total": sum($i.items[].amount),
  "sorted-items" : [
    for $j in $i.items[]
    order by $j.amount
    return $j
  ]
}

- We improved the memory footprint, in particular filtering queries are streamed through (within a task) rather than materialized.

- We worked on performance: it can handle files of 10,000,000+ objects on a regular laptop for count, filtering, grouping and ordering with a local Spark execution. Performance also noticeably improved querying bigger datasets on clusters (tested with several billion objects on 64 machines).

Feedback is, as always, appreciated.

Kind regards
Ghislain

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Subscribe in XML format

RSS 2.0
Atom 0.3

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.

Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

XML Editor - Download a 15 Day Free Trial Now >

See What's New in Stylus Studio >

Buy Stylus Studio - XML Editor - Now >