Porting a humongous Perl script to Python

The Problem

The first project I was tasked with at my new job involved porting a large (>18k lines long!) Perl script to Python. I knew from experience that trying to do this in one ‘big bang’ step was sure to result in the new version having a bunch of bugs that had been squashed out of the Perl script over years of development. Instead I sought a more cautious approach which is described in this post.

The Perl script in question is run as a console app. It takes a document id along with various optional arguments. The script locates/generates a number of urls, writes them to a database and exits. It is invoked by another Perl script which reads the results from the database on completion, all in the lifecycle of a FastCGI API request.

Now, many years on from this script being created it seems an obvious fit for a ‘microservice’. Thus the goal is to both port the code to Python (as part of a company-wide push to consolidate languages) and to change it from a console app to a Flask API.

Interfacing Code

Going back to the cautious approach I mentioned above; fortunately the structure of the existing Perl script lent it to being gradually ported over piece by piece. I looked for a way to interface it with the new Flask API and the Python subprocess module looked like it would work nicely.

In terms of data transfer, I made a minor modification to the Perl script to write its output to stdout as JSON, rather than to the existing database (which I did not want the Python API to be coupled to). Writing data to stdout sounds fragile but I rationalised that this is exactly what Linux utilities piped to each other have been doing for years. It just means you have to be careful not to have any stray print statements floating around.

The interfacing Python code looks something like this:

def get_links_from_perl_script(start_process, process_info):
	input_list = [start_process, process_info]
	input_json = json.dumps(input_list)

	p = Popen(perl_script_path, stdin=PIPE, stdout=PIPE, stderr=PIPE, cwd=perl_script_dir)
	output, err = p.communicate(input_json.encode())
	rc = p.returncode

	if rc == 0:
		logger.info('Perl script finished successfully for start process: %s' % (start_process))
		if err:
			logger.warn('But the following errors were reported: %s' % err)
		links = []
		json_output = json.loads(output)
		return json_output 
	else:
		logger.error('Perl script exited with non-zero error code: %d for start process: %s. Error: %s' % (rc, start_process, err))

And on the Perl side:

use strict;
use JSON qw(encode_json decode_json);
my $str = do { local $/; <STDIN> };
my $decoded_json = decode_json($str);
# do stuff....
print encode_json(\@some_results);
exit(0);

One extra quirk is that I work on a Windows machine. Whilst options exist to install Perl on Windows, it definitely doesn’t seem to be a first class citizen. However we now have the WSL (Windows Subsystem for Linux)! My Ubuntu WSL already has Perl installed, so I wondered if I could get my Python Flask API to spin up a Perl subprocess on the WSL and pipe data to and from it. It turns out this is fairly easy. In the Python code above, the perl_script_path variable is declared as follows:

perl_script_path = 'wsl perl /mnt/c/Users/nware/Dev/this_project/humongous_script.pl'.split()

Note: a trick for young players is that this won’t work if you have a 32-bit version of Python installed. The WSL is 64-bit so Python won’t know how to find it. Ideally just install 64-bit Python, but you can work around it with this magical incantation:

perl_script_path = os.path.join(os.environ['SystemRoot'], 'SysNative', 'wsl perl /mnt/c/Users/nware/Dev/this_project/humongous_script.pl'.split()

Perl package management

A quick note on Perl package management. I was frustrated at the seemingly manual process of installing Perl packages with cspan. Coming from a Python/C#/Javascript background which all have good(ish) package management solutions, this seemed archaic. I went looking for something similar and found exactly what I was after: Carton.

Containerization

This all worked nicely for dev/test but I wanted the Flask API in a Docker container for production. The tricky thing here is that containers are meant (for good reason) to run a single workload. Thus there are official Python base containers and official Perl base containers but obviously none that have both.

I ended up creating an intermediary container, which is essentially the contents of the official Perl Dockerfile but with the based changed from buildpack-deps:stretch to python:3.6-stretchhttps://hub.docker.com/r/nwareing/perl-python/.

Note: The long term goal of this project is to gradually port all of the Perl code into Python. When this is done, the interfacing code can be removed and we will just use the offical Python docker image as per normal.

I could then create my actual application Dockerfile as follows:

FROM nwareing/perl-python:latest

RUN cpanm Carton && mkdir -p /perl_src    
WORKDIR /perl_src

COPY ./perl_src/cpanfile /perl_src/cpanfile
RUN carton install

WORKDIR /app

RUN set -ex && pip install pipenv uwsgi

COPY ./Pipfile* /app/

RUN pipenv install --system --deploy

COPY ./perl_src/humongous_script.pl /perl_src/humongous_script.pl

COPY ./src /app

# Start uwsgi web server
ENTRYPOINT [ "uwsgi", "--ini", "uwsgi.ini" ]

Summary

In summary, the work of art / monstrosity I’ve created looks something like this:

Hosting Upsource with Docker – DNS Dilemmas

Currently at work we are using an open source source code management tool called Kallithea. Unfortunately it doesn’t seem to be under active development any longer and is in general a bit unstable and lacking the features we need in a growing development team. For me the biggest pain point was not having a nice web interface to browse and review code. We’re currently evaluating other options (BitBucket, GitHub, VSO/TFS etc.) and trying to decide whether to self-host or not. This process is taking a bit of time, so I went looking for something to tide us over until we came up with a more permanent solution. This lead me to Upsource, one of JetBrains’ latest incarnations.

Upsource is web-based tool for browsing and reviewing code. The handy thing with Upsource is that it tacks onto your source code hosting tool, rather than being an all-in-one like the systems we are looking at moving to. This allowed me to quietly install it without ruffling any feathers and let members of the team decide whether or not they wanted to use it. Luckily I had a spare Linux box running Ubuntu on which I was quickly able to get it installed and hooked up with LDAP.

The interesting part came a month or later when the next version of Upsource was released (February 2017). As well as a bunch of handy new features (full-text search FTW) they also announced that new versions were being published as Docker images. This sounded like a good idea and one which would make future updates easier, so I followed the instructions to migrate my Upsource instance to being hosted under Docker. Unfortunately I found that after starting up my new version of Upsource inside a Docker container, it could no longer resolve internal URLs; neither those pointing to the source code repositories or to the LDAP server.

A bit of Googling revealed that this was a known issue with Docker on recent versions of Ubuntu: https://github.com/docker/docker/issues/23910. It sounds like it’s resolved in the latest version of Docker, but I couldn’t work out whether that had been released yet.

Luckily someone had already written up a handy blogpost showing how to get around the issue: https://robinwinslow.uk/2016/06/23/fix-docker-networking-dns/#the-permanent-system-wide-fix.

I went with the ‘quick fix’ approach described there:

  1. I ran this command to find the IP address of the DNS server running inside my company’s network. This spat out two contiguous IPs for me, so I just choose the first one.
    $ nmcli dev show | grep 'IP4.DNS'
    IP4.DNS[1]:                             10.0.0.2
    IP4.DNS[1]:                             10.0.0.3
  2. Added a –dns 10.0.0.2 argument to the docker run command I used to start the Upsource container.

Problem solved!