This script demonstrates how to upload multiple PDF documents to DocumentCloud in parallel across your CPU cores using Python. It reads a list of PDF URLs from a CSV file, uploads each document to DocumentCloud, and saves the resulting document IDs back to the CSV.
It was created to demonstrate the use of the parallelize function for concurrent processing, which can significantly speed up tasks. In practice, our data team at Reuters uses this function to spread tens of thousands of operations across dozens of computer cores.
A more basic example would be something like the following:
from main import parallelize
def your_function(item):
# Process the item
return item * 2
items = [1, 2, 3, 4, 5]
results = parallelize(items, your_function)
print(results) # Output: [2, 4, 6, 8, 10]