External Merge Sort in Python (Example)

 

This article shows how to implement the external merge sort algorithm in Python.

Table of contents:

The external merge sort algorithm is used for sorting large data sets that cannot fit into the main memory. It is a two-step process, where the first step involves dividing the data into smaller chunks and sorting them using a standard sorting algorithm (like quicksort, heapsort, etc.) in the main memory, while the second step involves merging the sorted chunks into a single sorted output file.

Let’s take a look at some Python codes in action!

 

Creation of Example Data & Module Importation

First, we need to import the relevant packages:

import heapq
import random

As a next step, we’ll need to create some example data by executing the code below:

example_data = # insert example_data path here
with open(example_data, 'w') as example:
    for x in range(1000):
        example.writelines(str(random.randint(0, 100000)) + '\n')

The code snippet will produce a text full of random integers in the range of [0, 10000]. You should enter a path for the example_data file to generate the example data by yourself.

We’re done with the example generation, so let’s dive into the implementation!

 

Example: Implementing External Merge Sort

Our first function create_sorted_chunks() takes an input_path, and chunk_size as input parameters. The input file is split into smaller txt files called chunks, where each chunk contains chunk_size amount of data.

Note: Since this algorithm is designed to handle vast amounts of data and lists have a limit they can handle, we will utilize the with open() clauses. These clauses are used to do read, write and append operations on files.

Let’s get straight into the first step of defining the algorithm’s function!

def create_sorted_chunks(input_path, chunk_size):
    chunks = []
    with open(input_path, 'r') as file:
        chunk = []
        for line in file:
            chunk.append(int(line.strip()))
            if len(chunk) == chunk_size:
                chunk.sort()
                chunks.append(chunk)
                chunk = []
 
        # Handle the remaining elements in the last chunk
        if chunk:
            chunk.sort()
            chunks.append(chunk)
 
    return chunks

Now that we have separated the data into chunks and sorted them individually, we can continue with the second step, which merges the sorted temporary files into a single sorted output file!

The merge_sorted_chunks() function uses the heapq module to manage the sorted chunks during the merge phase efficiently. Essentially, a heap is a data structure which makes accessing the minimum/maximum element of a set computationally cheaper.

In this code, we initialize a min-heap called heap by taking the first element from each chunk (if it exists) and adding it as a tuple (value, chunk_index, element_index) to the heap. We then use a while loop to repeatedly pop the smallest element from the heap and write it to the output file. If there are more elements remaining in the corresponding chunk, we push the next element from that chunk into the heap.

def merge_sorted_chunks(chunks, output_path):
    heap = [(chunk[0], i, 0) for i, chunk in enumerate(chunks) if chunk]
    heapq.heapify(heap)
 
    with open(output_path, 'w') as file:
        while heap:
            value, chunk_index, element_index = heapq.heappop(heap)
            file.write(str(value) + '\n')
 
            if element_index + 1 < len(chunks[chunk_index]):
                next_element = chunks[chunk_index][element_index + 1]
                heapq.heappush(heap, (next_element, chunk_index, element_index + 1))

Now that we have the functions to create sorted chunks and merge them, we can combine them into a complete External Merge Sort implementation. Here’s an example:

input_path = example_data
output_path = # insert output path here
chunk_size = 100
 
external_merge_sort(input_path, output_path, chunk_size)

To test the implementation yourself, don’t forget to set a valid input_path, output_path and adjust the chunk_size according to the data size you can handle in memory. Below you can see the template of how to set the parameters and execute the algorithm.

Here’s the complete code for your reference:

import heapq
import random
 
example_data = # insert example_data path here
with open(example_data, 'w') as example:
    for x in range(1000):
        example.writelines(str(random.randint(0, 100000)) + '\n')
 
 
def create_sorted_chunks(input_path, chunk_size):
    chunks = []
    with open(input_path, 'r') as file:
        chunk = []
        for line in file:
            chunk.append(int(line.strip()))
            if len(chunk) == chunk_size:
                chunk.sort()
                chunks.append(chunk)
                chunk = []
 
        # Handle the remaining elements in the last chunk
        if chunk:
            chunk.sort()
            chunks.append(chunk)
 
    return chunks
 
def merge_sorted_chunks(chunks, output_path):
    heap = [(chunk[0], i, 0) for i, chunk in enumerate(chunks) if chunk]
    heapq.heapify(heap)
 
    with open(output_path, 'w') as file:
        while heap:
            value, chunk_index, element_index = heapq.heappop(heap)
            file.write(str(value) + '\n')
 
            if element_index + 1 < len(chunks[chunk_index]):
                next_element = chunks[chunk_index][element_index + 1]
                heapq.heappush(heap, (next_element, chunk_index, element_index + 1))
 
def external_merge_sort(input_path, output_path, chunk_size):
    chunks = create_sorted_chunks(input_path, chunk_size)
    merge_sorted_chunks(chunks, output_path)
 
input_path = example_data
output_path = # insert output path here
chunk_size = 100
 
external_merge_sort(input_path, output_path, chunk_size)

I hope this tutorial helps you understand how to implement External Merge Sort in Python! If you have any further questions, feel free to ask. See you in the next tutorial!

 

Video, Further Resources & Summary

I have recently published a video on the Statistics Globe YouTube channel, which demonstrates the examples of this article. You can find the video below.

 

The YouTube video will be added soon.

 

In addition, you might read the related tutorials on this website.

 

In this tutorial, I have shown how to sort large input files efficiently via external merge sort in Python programming. In case you have any further questions, please let me know in the comments.

 

Ömer Ekiz Python Programming & Informatics

This page was created in collaboration with Ömer Ekiz. Have a look at Ömer’s author page to get further information about his professional background, a list of all his tutorials, as well as an overview of his other tasks on Statistics Globe.

 

Subscribe to the Statistics Globe Newsletter

Get regular updates on the latest tutorials, offers & news at Statistics Globe.
I hate spam & you may opt out anytime: Privacy Policy.


Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.

Top