Python Multiprocessing – Approaches and Considerations

The multiprocessing Python module provides functionality for distributing work between multiple processes, taking advantage of multiple CPU cores and larger amounts of available system memory. When analyzing or working with large amounts of data in ArcGIS, there are scenarios where multiprocessing can improve performance and scalability. However, there are many cases where multiprocessing can negatively affect performance, and even some instances where it should not be used.

There are two approaches to using multiprocessing for improving performance or scalability:

  • Processing many individual datasets
  • Processing datasets with many features

The goal of this article is to share simple coding patterns for effectively performing multiprocessing for geoprocessing. The article will cover relevant considerations and limitations, which are important when attempting to implement multiprocessing.

1. Processing large numbers of datasets

The first example performs a specific operation on a large number of datasets, in a workspace or set of workspaces. In cases where there are large numbers of datasets, taking advantage of multiprocessing can help get the job done faster. The following code demonstrates a multiprocessing module used to define a projection, add a field, and calculate the field for a large list of shapefiles. This Python code will create a pool of processes equal to the number of CPUs or CPU cores available. This pool of processes will then be used to processes the feature classes.

import os
import re
import multiprocessing
import arcpy

def update_shapefiles(shapefile):

  # Define the projection to wgs84 — factory code is 4326.

  arcpy.management.DefineProjection(shapefile, 4326)

  # Add a field named CITY of type TEXT.

  arcpy.management.AddField(shapefile, ‘CITY’, ‘TEXT’)

  # Calculate field ‘CITY’ stripping ‘_base’ from the shapefile name.

  city_name = shapefile.split(‘_base’)[0]
  city_name = re.sub(‘_’, ‘ ‘, city_name)
  arcpy.management.CalculateField(shapefile, ‘CITY’, ‘”{0}”‘.format(city_name.upper()), ‘PYTHON’)

# End update_shapefiles

def main():

  # Create a pool class and run the jobs–the number of jobs is equal to the number of shapefiles

  workspace = r’C:GISDataUSAusa’
  arcpy.env.workspace = workspace

 
fcs = arcpy.ListFeatureClasses(‘*’)
 
fc_list = [os.path.join(workspace, fc) for fc in fcs]

 
pool = multiprocessing.Pool()
 
pool.map(update_shapefiles, fc_list)

  # Synchronize the main process with the job processes to ensure proper cleanup.

  pool.close()
 
pool.join()

# End main

if __name__ == ‘__main__’:
  main()

2. Processing a individual dataset with a lot of features and records

This second example looks at geoprocessing tools analyzing an individual dataset with a lot of features and records. In this situation, we can benefit from multiprocessing by splitting data into groups to be processed simultaneously. For example, finding identical features may be faster when you split a large feature class into groups, based on spatial extents. The following code uses a pre-defined fishnet of polygons covering the extent of 1 million points (Figure 1).

Figure 1: A fishnet of polygons covering the extent of one million points.

import multiprocessing
import arcpy

def find_identical(oid): 

  # Create a feature layer for the tile in the fishnet.

  tile = arcpy.management.MakeFeatureLayer(r’c:testingtesting.gdbfishnet’, ‘layer{0}’.format(oid[0]),
                                                                                              
“”OID = {0}”"”.format((oid[0])))

  # Get the extent of the feature layer and set the extent environment.

  tile_row = arcpy.SearchCursor(tile)
 
geometry = tile_row.next().shape
 
arcpy.env.extent = geometry.extent

  # Execute Find Identical

  identical_table = arcpy.management.FindIdentical(r’c:testingtesting.gdbrandom1mil’, r’c:cursortestingidentical{0}.dbf’.format(oid[0]),  ‘Shape’)
  return identical_table.getOutput(0)

# End find_identical

def main():

  # Create a list of OID’s used to chunk the inputs

  fishnet_rows = arcpy.SearchCursor(r’c:testingtesting.gdbfishnet’, ”, ”, ‘OID’)
 
oids = [[row.getValue('OID')] for row in fishnet_rows]

  # Create a pool class and run the jobs–the number of jobs is equal to the length of the oids list

  pool = multiprocessing.Pool()
 
result_tables = pool.map(find_identical, oids)

  # Merge the all the temporary output tables — this is optional. Omitting this can increase performance.

  arcpy.management.Merge(result_tables, r’C:cursortestingctesting.gdbfind_identical’)

  # Synchronize the main process with the job processes to ensure proper cleanup.

  pool.close()
  pool.join()

# End main

if __name__ == ‘__main__’:
 
main()

There are tools that do not require data be split spatially. The Generate Near Table example below, shows the data processed in groups of 250000 features by selecting them based on object ID ranges.

import multiprocessing
import arcpy

def generate_near_table(oid_range):

 
i = oid_range[0]
 
j = oid_range[1]

 
lyr = arcpy.management.MakeFeatureLayer(r’c:testingtesting.gdbrandom1mil’, ‘layer{0}’.format(i),
                                             
“”"OID >= {0} AND OID <= {1}”"”.format(i, j))

 
gn_table = arcpy.analysis.GenerateNearTable(lyr, r’c:testingtesting.gdbrandom10000′,
                                                                                        
r’c:testingoutnear{0}.dbf’.format(i))
  return gn_table.getOutput(0)

# End generate_near_table function

def main():

 
oid_ranges = [[0, 250000], [250001, 500000], [500001, 750000], [750001, 1000001]]
 
arcpy.env.overwriteOutput = True

  # Create a pool class and run the jobs

  pool = multiprocessing.Pool()
 
result_tables = pool.map(generate_near_table, oid_ranges)

  # Merge resulting tables is optional. Can add overhead if not required.

  arcpy.management.Merge(result_tables, r’c:cursortestingctesting.gdbgenerate_near_table’)

  # Synchronize the main process with the job processes to ensure proper cleanup.

  pool.close()
  pool.join()

# End main

if __name__ == ‘__main__’:
 
main()

Considerations

Here are some important considerations before deciding to use multiprocessing:

The scenario demonstrated in the first example, will not work with feature classes in a file geodatabase because each update must acquire a schema lock on the workspace. A schema lock effectively prevents any other process from simultaneously updating the FGDB. This example will work with shapefiles and ArcSDE geodatabase data.

For each process, there is a start-up cost loading the arcpy library (1-3 seconds). Depending on the complexity and size of the data, this can cause the multiprocessing script to take longer to run than a script without multiprocessing. In many cases, the final step in the multiprocessing workflow is to aggregate all results together, which is an additional cost.

Determining if multiprocessing is appropriate for your workflow can often be a trial and error process. This process can invalidate the gains made using multiprocessing in a one off operation; however, the trial and error process may be very valuable if the final workflow is to be run multiple times, or applied to similar workflows using large data. For example, if you are running the Find Identical tool on a weekly basis, and it is running for hours with your data, multiprocessing may be worth the effort.

Whenever possible, take advantage of the “in_memory” workspace for creating temporary data to improve performance. However, depending on the size of data being created in-memory, it may be necessary to write temporary data to disk. Temporary datasets cannot be created in a file geodatabase because of schema locking. Deleting the in-memory dataset when you are finished can prevent out of memory errors.

Summary

These are just a few examples showing how multiprocessing can be used to increase performance and scalability when doing geoprocessing. However, it is important to remember that multiprocessing does not always mean better performance.

The multiprocessing module was included in Python 2.6 and the examples above will work in ArcGIS 10.0. For more information about the multiprocessing module, refer the Python documentation.

Please provide any feedback and comments to this blog posting, and stay tuned for another posting coming soon about “Being successful processing large complex data with the geoprocessing overlay tools”.

 

This post contributed by Jason Pardy, a product engineer on the Analysis and Geoprocessing team

This entry was posted in Analysis & Geoprocessing and tagged , , . Bookmark the permalink.

Leave a Reply

7 Comments

  1. charles.morton says:

    Will this work for spatial analyst tools as well? I tried this a while back and I found no speed increase when multiprocessing spatial analyst tools, but hopefully I was just doing it wrong.
    Thanks for the great summary. I can’t write to try these methods.
    Charles

  2. jpardy84 says:

    Hi Charles,

    Thanks for you comment and question. Yes, these approaches can also be applied to Spatial Analyst tools. However, be aware that multiprocessing may not provide better performance, and the same considerations listed in this blog must be considered.

    Jason

  3. charles.morton says:

    Jason,
    When I apply multiprocessing to spatial analyst functions, I get a noticeable speed improvement (maybe because everything is queued up in memory?), but it only appears as if one CPU is being used and only one file processed at a time. On the ideas site, someone posted that you can’t have concurrent processing of spatial analyst functions (http://ideas.arcgis.com/ideaView?id=087300000008HwnAAE), which seems to agree with what I am seeing. Can you spatial analyst functions be run concurrently, or is the idea post incorrect?
    Also, if using spatial analyst tools, I have found you have to check out the spatial analyst extension and set all necessary environment parameters each time within the target function.
    Sorry if this is the wrong place to have this discussion, but I thought it might interest other people who are looking into using multiprocessing.
    Charles

  4. Hornbydd says:

    Jason,

    I found this article very interesting as I often do a lot of data crunching. I wanted to teach myself with your first sample so I took the code and tweaked it to go through a folder of rasters setting their projection (something I do regularly). The code works fine if run from within PyScripter but if I load it into the Python window in ArcMap and execute the same code a pop up window appears with the message:

    Could not find file: from multiprocessing.forking import main; main().mxd

    You have to click on OK and an ArcMap application fires up and the same error window pops up and the whole multi-processing stalls. The only way out is to to use task manager to bomb out. Any ideas why this happens?

    Duncan

  5. jpardy84 says:

    Hi Duncan,

    Yeah, this will not work from the Python window in ArcGIS or as a script tool that runs in process. You can create a script tool and run your script out of process (there’s a checkbox in the script tool settings dialog) to get it to work. As our developer explains, ArcMap sets up a runtime environment that is not compatible with the way that multiprocessing bootstraps its runtime.

    Here’s a forum post that also may be of interest to you.

    http://forums.arcgis.com/threads/33602-Arcpy-Multiprocessor-issues

    Jason

  6. bjebn says:

    Hi Jason,

    I have been researching the possibility of implementing multiprocessing techniques with arcpy but the only examples I have found (including those above) involve processing with vector or management toolboxes. Is it feasible to apply multiprocessing techniques to spatial analyst tools or is that not available?

    Bjebn

  7. samsung_460 says:

    Your code in “1. Processing large numbers of datasets” is only correct theoretically. I use your workflow to put tiling rasters(clip within rectangular extent) in a pool of worker precesses. These raster layere number have 130 and are from a mxd file(arcpy.mapping.listRasters()). My work computer has 64 cores. Unless I set “processors” to 1 in “pool = Pool(processes = processors)”, the tiling processing will quit in the middle way. What is the reason?