External Merge Sort in Python (Example)
This article shows how to implement the external merge sort algorithm in Python.
Table of contents:
The external merge sort algorithm is used for sorting large data sets that cannot fit into the main memory. It is a two-step process, where the first step involves dividing the data into smaller chunks and sorting them using a standard sorting algorithm (like quicksort, heapsort, etc.) in the main memory, while the second step involves merging the sorted chunks into a single sorted output file.
Let’s take a look at some Python codes in action!
Creation of Example Data & Module Importation
First, we need to import the relevant packages:
import heapq import random
As a next step, we’ll need to create some example data by executing the code below:
example_data = # insert example_data path here with open(example_data, 'w') as example: for x in range(1000): example.writelines(str(random.randint(0, 100000)) + '\n')
The code snippet will produce a text full of random integers in the range of [0, 10000]. You should enter a path for the example_data
file to generate the example data by yourself.
We’re done with the example generation, so let’s dive into the implementation!
Example: Implementing External Merge Sort
Our first function create_sorted_chunks()
takes an input_path
, and chunk_size
as input parameters. The input file is split into smaller txt files called chunks
, where each chunk contains chunk_size
amount of data.
Note: Since this algorithm is designed to handle vast amounts of data and lists have a limit they can handle, we will utilize the with open() clauses. These clauses are used to do read, write and append operations on files.
Let’s get straight into the first step of defining the algorithm’s function!
def create_sorted_chunks(input_path, chunk_size): chunks = [] with open(input_path, 'r') as file: chunk = [] for line in file: chunk.append(int(line.strip())) if len(chunk) == chunk_size: chunk.sort() chunks.append(chunk) chunk = [] # Handle the remaining elements in the last chunk if chunk: chunk.sort() chunks.append(chunk) return chunks
Now that we have separated the data into chunks and sorted them individually, we can continue with the second step, which merges the sorted temporary files into a single sorted output file!
The merge_sorted_chunks()
function uses the heapq module to manage the sorted chunks during the merge phase efficiently. Essentially, a heap is a data structure which makes accessing the minimum/maximum element of a set computationally cheaper.
In this code, we initialize a min-heap called heap
by taking the first element from each chunk (if it exists) and adding it as a tuple (value
, chunk_index
, element_index
) to the heap. We then use a while loop to repeatedly pop the smallest element from the heap and write it to the output file. If there are more elements remaining in the corresponding chunk, we push the next element from that chunk into the heap.
def merge_sorted_chunks(chunks, output_path): heap = [(chunk[0], i, 0) for i, chunk in enumerate(chunks) if chunk] heapq.heapify(heap) with open(output_path, 'w') as file: while heap: value, chunk_index, element_index = heapq.heappop(heap) file.write(str(value) + '\n') if element_index + 1 < len(chunks[chunk_index]): next_element = chunks[chunk_index][element_index + 1] heapq.heappush(heap, (next_element, chunk_index, element_index + 1))
Now that we have the functions to create sorted chunks and merge them, we can combine them into a complete External Merge Sort implementation. Here’s an example:
input_path = example_data output_path = # insert output path here chunk_size = 100 external_merge_sort(input_path, output_path, chunk_size)
To test the implementation yourself, don’t forget to set a valid input_path
, output_path
and adjust the chunk_size according to the data size you can handle in memory. Below you can see the template of how to set the parameters and execute the algorithm.
Here’s the complete code for your reference:
import heapq import random example_data = # insert example_data path here with open(example_data, 'w') as example: for x in range(1000): example.writelines(str(random.randint(0, 100000)) + '\n') def create_sorted_chunks(input_path, chunk_size): chunks = [] with open(input_path, 'r') as file: chunk = [] for line in file: chunk.append(int(line.strip())) if len(chunk) == chunk_size: chunk.sort() chunks.append(chunk) chunk = [] # Handle the remaining elements in the last chunk if chunk: chunk.sort() chunks.append(chunk) return chunks def merge_sorted_chunks(chunks, output_path): heap = [(chunk[0], i, 0) for i, chunk in enumerate(chunks) if chunk] heapq.heapify(heap) with open(output_path, 'w') as file: while heap: value, chunk_index, element_index = heapq.heappop(heap) file.write(str(value) + '\n') if element_index + 1 < len(chunks[chunk_index]): next_element = chunks[chunk_index][element_index + 1] heapq.heappush(heap, (next_element, chunk_index, element_index + 1)) def external_merge_sort(input_path, output_path, chunk_size): chunks = create_sorted_chunks(input_path, chunk_size) merge_sorted_chunks(chunks, output_path) input_path = example_data output_path = # insert output path here chunk_size = 100 external_merge_sort(input_path, output_path, chunk_size)
I hope this tutorial helps you understand how to implement External Merge Sort in Python! If you have any further questions, feel free to ask. See you in the next tutorial!
Video, Further Resources & Summary
I have recently published a video on the Statistics Globe YouTube channel, which demonstrates the examples of this article. You can find the video below.
The YouTube video will be added soon.
In addition, you might read the related tutorials on this website.
- Recursive Merge Sort in Python
- Merge Sort Counting Inversions in Python
- Visualization of Merge Sort in Python
- k-way Merge Sort in Python
- Function & Library for Merge Sort in Python
- Learn Python Programming
In this tutorial, I have shown how to sort large input files efficiently via external merge sort in Python programming. In case you have any further questions, please let me know in the comments.
This page was created in collaboration with Ömer Ekiz. Have a look at Ömer’s author page to get further information about his professional background, a list of all his tutorials, as well as an overview of his other tasks on Statistics Globe.