Big Data ST_Geometry Queries up to 20X Faster in Hive

The Big Data development team at Esri is excited to announce a major performance speedup in ST_Geometry for Hive, which is part of Esri’s open-source Spatial Framework for Hadoop.  The amount of performance gain depends on the type of spatial query run and on the size of the table in Hive.  The biggest gain comes with relational operations such as ST_Contains and ST_Overlaps.  In general, the performance gain will be greater with larger tables — exactly where it helps the most.


In testing by the developer, queries ran 3 to almost 20 times as fast on the new version of the Spatial Framework, as compared to the previous version.  Specifically, as shown in the chart, the query from the point-in-polygon aggregation sample, on 173 million points of New York City taxi cab data, ran about 20 times as fast in the updated version.

SELECT counties.name, count(*) cnt FROM counties
JOIN taxi_trips
WHERE ST_Contains(counties.boundaryshape, ST_Point(taxi_trips.pickup_longitude, taxi_trips.pickup_latitude))
GROUP BY counties.name
ORDER BY cnt desc;

This big performance gain in the Spatial Framework for Hadoop is available now on Github.

Programmers interested in the internals can see all the details on Github.

Kudos to Randall Whitman from the Big Data team for supplying the info for this post.

This entry was posted in Geodata and tagged , , , , , , . Bookmark the permalink.

Leave a Reply

One Comment

  1. randallwhitman says:

    Credits to Mike Park and Sarah Ambrose for the sample query, the performance data, and the chart.