# External Merge Sort in Python (Example)

This article shows how to implement the external merge sort algorithm in Python.

The external merge sort algorithm is used for sorting large data sets that cannot fit into the main memory. It is a two-step process, where the first step involves dividing the data into smaller chunks and sorting them using a standard sorting algorithm (like quicksort, heapsort, etc.) in the main memory, while the second step involves merging the sorted chunks into a single sorted output file.

Let’s take a look at some Python codes in action!

## Creation of Example Data & Module Importation

First, we need to import the relevant packages:

```import heapq
import random```

As a next step, we’ll need to create some example data by executing the code below:

```example_data = # insert example_data path here
with open(example_data, 'w') as example:
for x in range(1000):
example.writelines(str(random.randint(0, 100000)) + '\n')```

The code snippet will produce a text full of random integers in the range of [0, 10000]. You should enter a path for the `example_data` file to generate the example data by yourself.

We’re done with the example generation, so let’s dive into the implementation!

## Example: Implementing External Merge Sort

Our first function `create_sorted_chunks()` takes an `input_path`, and `chunk_size` as input parameters. The input file is split into smaller txt files called `chunks`, where each chunk contains `chunk_size` amount of data.

Note: Since this algorithm is designed to handle vast amounts of data and lists have a limit they can handle, we will utilize the with open() clauses. These clauses are used to do read, write and append operations on files.

Let’s get straight into the first step of defining the algorithm’s function!

```def create_sorted_chunks(input_path, chunk_size):
chunks = []
with open(input_path, 'r') as file:
chunk = []
for line in file:
chunk.append(int(line.strip()))
if len(chunk) == chunk_size:
chunk.sort()
chunks.append(chunk)
chunk = []

# Handle the remaining elements in the last chunk
if chunk:
chunk.sort()
chunks.append(chunk)

return chunks```

Now that we have separated the data into chunks and sorted them individually, we can continue with the second step, which merges the sorted temporary files into a single sorted output file!

The `merge_sorted_chunks()` function uses the heapq module to manage the sorted chunks during the merge phase efficiently. Essentially, a heap is a data structure which makes accessing the minimum/maximum element of a set computationally cheaper.

In this code, we initialize a min-heap called `heap` by taking the first element from each chunk (if it exists) and adding it as a tuple (`value`, `chunk_index`, `element_index`) to the heap. We then use a while loop to repeatedly pop the smallest element from the heap and write it to the output file. If there are more elements remaining in the corresponding chunk, we push the next element from that chunk into the heap.

```def merge_sorted_chunks(chunks, output_path):
heap = [(chunk, i, 0) for i, chunk in enumerate(chunks) if chunk]
heapq.heapify(heap)

with open(output_path, 'w') as file:
while heap:
value, chunk_index, element_index = heapq.heappop(heap)
file.write(str(value) + '\n')

if element_index + 1 < len(chunks[chunk_index]):
next_element = chunks[chunk_index][element_index + 1]
heapq.heappush(heap, (next_element, chunk_index, element_index + 1))```

Now that we have the functions to create sorted chunks and merge them, we can combine them into a complete External Merge Sort implementation. Here’s an example:

```input_path = example_data
output_path = # insert output path here
chunk_size = 100

external_merge_sort(input_path, output_path, chunk_size)```

To test the implementation yourself, don’t forget to set a valid `input_path`, `output_path` and adjust the chunk_size according to the data size you can handle in memory. Below you can see the template of how to set the parameters and execute the algorithm.

Here’s the complete code for your reference:

```import heapq
import random

example_data = # insert example_data path here
with open(example_data, 'w') as example:
for x in range(1000):
example.writelines(str(random.randint(0, 100000)) + '\n')

def create_sorted_chunks(input_path, chunk_size):
chunks = []
with open(input_path, 'r') as file:
chunk = []
for line in file:
chunk.append(int(line.strip()))
if len(chunk) == chunk_size:
chunk.sort()
chunks.append(chunk)
chunk = []

# Handle the remaining elements in the last chunk
if chunk:
chunk.sort()
chunks.append(chunk)

return chunks

def merge_sorted_chunks(chunks, output_path):
heap = [(chunk, i, 0) for i, chunk in enumerate(chunks) if chunk]
heapq.heapify(heap)

with open(output_path, 'w') as file:
while heap:
value, chunk_index, element_index = heapq.heappop(heap)
file.write(str(value) + '\n')

if element_index + 1 < len(chunks[chunk_index]):
next_element = chunks[chunk_index][element_index + 1]
heapq.heappush(heap, (next_element, chunk_index, element_index + 1))

def external_merge_sort(input_path, output_path, chunk_size):
chunks = create_sorted_chunks(input_path, chunk_size)
merge_sorted_chunks(chunks, output_path)

input_path = example_data
output_path = # insert output path here
chunk_size = 100

external_merge_sort(input_path, output_path, chunk_size)```

I hope this tutorial helps you understand how to implement External Merge Sort in Python! If you have any further questions, feel free to ask. See you in the next tutorial!

## Video, Further Resources & Summary

I have recently published a video on the Statistics Globe YouTube channel, which demonstrates the examples of this article. You can find the video below.

In this tutorial, I have shown how to sort large input files efficiently via external merge sort in Python programming. In case you have any further questions, please let me know in the comments.

This page was created in collaboration with Ömer Ekiz. Have a look at Ömer’s author page to get further information about his professional background, a list of all his tutorials, as well as an overview of his other tasks on Statistics Globe.

Subscribe to the Statistics Globe Newsletter