Be successful overlaying large, complex datasets in Geoprocessing

Spatial datasets often have large numbers of features.  Furthermore, these datasets may also contain individual features such as road casings that are very large and complex.

As a result of such large datasets, massive amounts of feature overlap (number of features and complexity of the overlapping features) within feature classes can result in a geoprocessing tool failure or poor performance.  These failures are often due to system resources limitations, mismanagement of resources, “bad” data, or massive overlap in an area that cannot be processed within the available resources.  Failure messages such as “Invalid Topology [4GB_FILE_LIMIT] “, or “Out of memory” can occur.  In rare cases, an application crash is possible.

At 10.1, significant work has been done for managing the processing of large complex data.  We have removed internal limitations and overhauled our overlay engine’s memory management.  Some overlay processes that previously failed can now complete successfully, and with reasonable performance.  However, following these guidelines and best practices should ensure your large overlay processes complete successfully.

  1. System Requirements. For very large processing, the minimum system requirements specified for running ArcGIS may not be enough. The more RAM/Virtual memory the more likely your process will be successful and perform adequately.
  2. Large overlay operations should be run without interference from other processes. Do not start other applications or processes while an overlay operation is running.  Before starting any overlay operation we query the system to see how much memory is available (details here).  If the amount of available memory changes significantly after the overlay operation has begun (e.g. by starting another application) it could cause the operation to fail.
  3. Input data requirements. Your input data should meet the data requirements recommended by the Geodatabase.
  4. Check for bad geometry. Run the CheckGeometry tool to make sure the geometries contain no errors.  The onus is on the data’s consumer to ensure that the feature class contains valid geometries before the data is used in analysis.  Unexpected errors or results may be generated if bad geometries are used.  See “Checking and repairing geometries” for more details.  Run the RepairGeometry tool to fix any geometry issues reported by CheckGeometry.
  5. Data format limitations. Shapefiles and Personal Geodatabases have fairly small limits on the size of feature class they can store.  If the output of your overlay process is larger than the limitations of Shapefiles or the Personal Geodatabase then the operation will fail.  A File Geodatabase has larger limits, in terms of the size of feature classes and, is often a better option for your output data.
  6. Do not use in_memory feature classes for large overlay output.  Using the in_memory workspace for storing feature classes may affect the success of overlay operations by using up memory the overlay processing requires to complete successfully.  Only use the in_memory workspace for storing feature classes when the data being processed is fairly small.
  7. Check for huge features. Check for very large individual features and if necessary break these features up into smaller features using the Dice tool.  Some features are so large that they will not fit into memory for processing. The Dice tool (new at ArcGIS 10) can be used to break very large features into smaller features.  For more information on identifying very large individual features see the Dicing Godzillas blog post.
  8. Analyze your feature vertices density. If your features have an extreme number of vertices (e.g. millions) you may want to analyze your workflow to determine if this level of vertex accuracy is required.  If it is not required you could realize better performance by simplifying your data to reduce the number of vertices that will need to be processed.
  9. Analyze your Geodatabase design.  A re-design of how you store and use your data (i.e. review all topics in this section of the help documentation) may be very beneficial in avoiding performance and functional issues down the road.  Can the data be broken up into many feature classes rather than being all in one feature class?  Can the data be simplified for this analysis without affecting the results? Is all the data required for this particular analysis question?
  10. Spatial Reference.  Consider the Spatial Reference and ask if it is appropriate for your analysis.  See this whitepaper for how the GDB stores coordinate geometry, and the impacts on storage and analysis your decisions may cause.

 

Additional Recommendations

Use a 64bit operating system with plenty of physical memory

For very large processing, we recommend that you run ArcGIS on 64bit operating systems with an ample amount of RAM for two reasons.

First, using Windows 7 as an example, the 32bit version of Windows can access up to 3Gb of physical memory (with user memory setting increased. See below.).  For Windows 64bit the operating system can access up to 192Gb of physical memory which represents a large difference in the operating system’s ability to handle very large processing loads.

Secondly, when 32bit applications are run on 64bit, Windows can access more memory (See here), and 32bit applications, such as ArcGIS Desktop, can access almost twice as memory when run on the 64bit operating system (due to being Large Address Aware, see below for more info).

Use a 64bit offering of ArcGIS

Using 64bit applications allow your large processes to take advantage of a much more memory (if available) than using a 32bit version of the application.

ArcGIS for Server is a natively built 64 bit application.  Using overlay tools via Python scripts using the ArcGIS for Server binaries will allow you to take advantage of the 64bit application memory architecture.  You do not need to start the ArcGIS for Server process or run these scripts as services. You simply need to run your python scripts in an environment where they have access to the ArcGIS for Server 64bit libraries and the Python 64bit install.

Starting at 10.1 SP1, 64 bit background processing will be available in ArcGIS Desktop.  Instead of using an ArcGIS Server 64 bit install to run your process you can install the 64 bit background package for ArcGIS Desktop (providing you are on a Windows 64bit install) and run your process using 64 bit background in ArcGIS Desktop.  Although background processing allows you to use ArcGIS Desktop while your process is working in the background, you should not perform any other analysis or start any other processes while your large overlay process is running in the background.  Doing so may interfere with the overlay operation’s success.

32bit Python executable – make it Large Address Aware

Because the Python installed with ArcGIS Desktop (or any other ArcGIS 32bit application) is 32bit, processing from a command line, either from a script or the Python prompt or some Python IDE, will limit the Python process to using just 2GB of RAM.  ArcGIS 32bit applications such as ArcMap take advantage of something called Large Address Aware (LAA) which allows it to take advantage of addresses larger than 2GB (3GB on Win32, 4GB on Win64).  If you want your script to take advantage of this larger address space to possibly improve performance and scalability you can run the script from the ArcMap Python window.  Since it’s running from within a parent process that is LAA, it too will take advantage of LAA.

Note: On Windows 32bit systems you will also have to modify a boot option in order to allow applications built LAA to take advantage of the larger memory addresses.  For information on setting up Windows 32bit to take advantage of a Large Address Aware application see here. Please be sure to look up the possible side effects of doing so.  For Windows 64bit systems you do not need to perform any additional setup.

What if you want to continue to run your Python scripts containing large overlay processing from the command line?  At this time, Python’s 32bit offerings on windows are not built LAA and python.org has no plans to do so.  However, there is a solution.  If you would like to be able to run your Python scripts from the command line and not from within an ArcGIS application UI you can alter your installed Python executables to be LAA by building/editing python.exe and pythonw.exe to be Large Address Aware (LAA).  This then will allow the Python process to take advantage of extra memory space for large, complex processing.

Please see www.python.org for details on rebuilding python.  If you want to modify the Python executables rather than rebuild python from scratch to take advantage of LAA the Microsoft COFF Binary File Editor EDITBIN.EXE (see http://msdn.microsoft.com/en-us/library/xd3shwhf.aspx) is recommended.   This will need to be done on python.exe and pythonw.exe under the Python install folder. EDITBIN.EXE comes with Visual Studio.  Please see the appropriate Microsoft documentation for using EDITBIN.

Multiprocessing and Overlay tools

You should not use multiprocessing to break up your data and send overlay processes for each batch of data to a separate core on one machine.  The engine that performs the overlay operation checks how much available memory there is and then works within 60% of that to perform the analysis.  Since all the cores share the same RAM on a machine, a second overlay operation will ask for the available memory and get an inaccurate estimate of the amount of memory that is available.  If process #1 hasn’t used up all the memory it will require, process #2 will get an inflated value for the amount of available RAM.  This could cause an out of memory situation for both processes.  If process #1 is using its maximum amount of memory for its processing, process #2 will get a very low estimation of the amount of available memory, and only use 60% of this low number.  This may cause performance issues (it may have to tile the data more extensively to work within the small amount of memory) and perhaps even cause a failure.

It may be possible to send batches of features to separate nodes for processing there but it is still not recommended that you break up your data to send it to separate overlay operations. Consider the following:

  1. You must merge all the overlay outputs.  After all the overlay processing outputs have been generated by each node, use the Merge tool to merge all the results.
  2. After you have merged all the results, you most likely will have to run the overlay operation on the merged results in order to get to a correct overlay output that is as close as possible to having run one overlay on the entire dataset.  Overlap within one of the input feature classes may not be complete due to how the batches of features were determined so an additional overlay may be required to discover this overlap.

The benefits of breaking up the data to take advantage of multiprocessing are most often lost by these two required steps.  If you are interested in finding out the best conditions under which you can take advantage of multiprocessing with GP tools, you can find guidance here.

This post contributed by Ken Hartling, a product engineer on the Analysis and Geoprocessing team

This entry was posted in Analysis & Geoprocessing, Python and tagged , , , , , , , . Bookmark the permalink.

Leave a Reply

6 Comments

  1. nilsbabel says:

    Thanks for the great tips. One question: are python scripts run from an ArcToolbox script tool LAA?

  2. sjones says:

    Nice nad Instructional Best Practices for workign with GP Server Tasks.

  3. safraeli says:

    Hi Ken and thank you very much for the important posting.

    Can you kindly provide us with some numbers.
    What will you call “Large feature class?”
    How many vertices are in a “large feature”?

    Thanks,
    Eli Safra

    • KenH says:

      Hi Eli,
      We don’t provide any prediction for what constitutes a large input feature class or a large individual feature that will cause the overlay process to be really large. It’s much more complicated than just the number of features or the size of the feature being processed. The resources available to the process also has a huge part in determining what are small or large processing jobs as do the complexity of the interactions between the features in the overlay operation within the resources available. There are probably a nearly infinite number of numbers we could provide based on all the variable’s involved, so providing some sort of specific statistics really weren’t deemed helpful. It would be very misleading to do so, and attempting to run a prediction may take longer than actually running the job!

      We consider a feature class or an individual feature as large for an overlay process when those inputs present a problem with processing the data within the provided resources. For example, the creation or use of Godzilla’s – http://blogs.esri.com/esri/arcgis/2010/07/23/dicing-godzillas-features-with-too-many-vertices/ . If processing your data works efficiently on your current configuration, it’s not considered large.

      Ken