dinogalactic

Fix Docker Errors When Building a CDK Construct Library with projen

Recently I got the following error from a GitHub action run when using the projen AWS CDK Construct Library project generator while contributing to the open-source Control Broker:

Got permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock

The CDK code that kicked off the denied Docker daemon call was the following:

constructor(scope: Construct, id: string) {
    super(scope, id);
    this.handler = new PythonFunction(this, `${id}CloudFormationInputHandler`, {
      entry: join(__dirname, 'lambda-function-code/cloudformation-input-handler'),
      runtime: Runtime.PYTHON_3_9,
      index: 'lambda_function.py',
      handler: 'lambda_handler',
      timeout: Duration.seconds(60),
    });
  }

Note that this PythonFunction construct builds all the dependencies and things for a Python-based Lambda function into the Lambda code zip without requiring the user to do much of anything to make this happen. It is very intuitive. It does, however use Docker, which hadn't been a problem for me in other contexts.

Got permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock

But I still got the above error, so I spent a long while trying to figure this out. I first discovered that one major problem was that my build and release steps, in my GitHub workflow files, which are generated by projen, were running inside a container. This means the Docker error I was getting was coming from inside the container that was running my GitHub job. The first clue. In my case, the GitHub job container was jsii/superchain:

# ~~ Generated by projen. To modify, edit .projenrc.js and run "npx projen".

name: release
on:
  push:
    branches:
      - main
  workflow_dispatch: {}
jobs:
  release:
    runs-on: ubuntu-latest
    permissions:
      contents: write
    outputs:
      latest_commit: ${{ steps.git_remote.outputs.latest_commit }}
    env:
      CI: "true"
    steps:
      # I've remove steps just to shorten this snippet
      - name: release
        run: npx projen release
    container:
      image: jsii/superchain:1-buster-slim-node14

npx projen release does a lot of things, but one of the things it does is run Jest tests. This was the part that was raising the permissions error regarding the Docker daemon socket. I started troubleshooting. At first I thought it was impossible to access the docker API from within a GitHub action, but that was quickly disproven when I remembered that I had done it before in a GitHub action, albeit outside a container (i.e. on the GitHub job's host, meaning in a job that did not have the container option set).

I then tried another angle. I realized that the container might not have access to the Docker socket because it hadn't been mapped into the container's filesystem. I decided to change the container configuration to the following:

container:
    image: jsii/superchain:1-buster-slim-node14
    volumes:
        - /var/run/docker.sock:/var/run/docker.sock

Surely this would work! Now the socket file actually exists within my job container.

Got permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock

Same error.

I then stepped back for a moment and considered the error message. As is so often the case, the error message contained all the information I needed from the start, but I didn't realize that until digging around, wracking my brain, and taking a break. It's kind of like good movies -- usually the first scene contains most of the information of the movie, but you don't know how to interpret the presentation just yet. But that's enough about the philosophy of error messages.

A real breakthrough came when I inspected the Dockerfile for the jsii/superchain container.

Critically, I found the following line within the Dockerfile, and I immediately knew it to be the culprit.

USER superchain:superchain

A different user besides root wouldn't have the permissions to access the Docker daemon, so this container image is incompatible with building Docker images on GitHub actions.

But how could this be possible? Doesn't the CDK package all kinds of assets inside of Docker containers, including (like my use case) the code for Lambda functions? Indeed, it does.

For a moment I thought I would have to rewrite my Lambda functions in Typescript so they could be built with esbuild outside of a container, but then I got sad because that would mean that I couldn't use things like Lambda layers in my Construct, since those could also need (or at least benefit from using) Docker to build them.

I also found this projen project page that seemed to indicate that the preferred way to author Lambda functions using projen projects is to write them in Typescript. Once again, I felt that the extreme dedication to Typescript within the CDK community was at odds with its stated goal to support many runtimes. If the only way interoperability works is if you write everything in Typescript, then only the users of Constructs would be able to write in any other language. But then, why would they write in some other language if, ultimately, they would only be able to share their code with users in still other languages if they had written their code in Typescript to begin with. Why not just make everyone write CDK code in Typescript, especially since you need node for any JSII-based project? Most importanty for my immediate need, avoiding the Docker building altogether would allow me to use my Python Lambda Function code in my

I feel bad, but this person kept going down the path I was on and concluded that they couldn't use Docker in GitHub actions with their projen project either, coincidentally because they used the PythonLambdaFunction Construct as well:

An interesting problem I ran into when using the PythonFunction with GitHub Actions is the construct uses Docker under the hood to install dependencies. This caused issues because Docker was unable to be called within the Action. The solution is to use the L2 Construct SingletonFunction and the local bundle option. This is well described in this AWS blog post.

But I digress. I finally saw the light and came up with a different approach that had many benefits.

Here is the test code that instantiated the PythonLambda construct, though indirectly through the new CloudFormationInputHandler() call:

test('ControlBroker can be created and attached to a stack', () => {
  const stack = new Stack();
  const api = new Api(stack, 'ControlbrokerApi', {});

  const cfnInputHandler = new CloudFormationInputHandler(stack, 'CfnInputHandler');
  // ^^ The above instantiates the PythonLambda construct and kicks off the
  // ^^ Docker daemon calls during bundling (which happens BEFORE synth!)

  const cfnInputHandlerApiBinding = new HttpApiBinding('CloudFormation', api, cfnInputHandler);
  const evalEngine = new OpaEvalEngine(stack, 'EvalEngine');
  const evalEngineBinding = new HttpApiBinding('EvalEngine', api, evalEngine);
  api.setEvalEngine(evalEngine, evalEngineBinding);
  api.addInputHandler(cfnInputHandler, cfnInputHandlerApiBinding);
  new ControlBroker(stack, 'TestControlBroker', {
    api,
  });
});

Why do I have to instantiate my constructs that need Docker for bundling within my unit tests? Why not just mock those (or at least the bundling part) out and thereby avoid the Docker building. It would definitely cut down on test execution time, after all, and arguably actual Lambda function bundling belongs to the consumers of the Constructs rather than the Construct library itself. Perhaps a custom integration testing phase could do this bundling as part of a test deployment or something, but there was no need to do this during the packaging and release process of my Construct library. It would also reduce the number of dependencies needed to run my tests and build my library (because it would remove Docker).

So I sought a way to mock out the bundling part of my Lambda functions in my unit tests.

I came up with the following, which mocks the PythonLambda class, skipping any bundling:

import { PythonFunction } from '@aws-cdk/aws-lambda-python-alpha';

jest.mock('@aws-cdk/aws-lambda-python-alpha');

const mockedPythonFunction = <jest.Mock<typeof PythonFunction>>(PythonFunction as unknown);
mockedPythonFunction.mockImplementation(() => {
  const original = jest.requireActual('@aws-cdk/aws-lambda-python-alpha');
  return {
    ...original.PythonFunction,
    functionArn: 'arn:aws:lambda:us-east-1:123456789012:function:mockfunction',
    addPermission: () => {},
  };
});

test('ControlBroker can be created and attached to a stack', () => {
  const stack = new Stack();
  const api = new Api(stack, 'ControlbrokerApi', {});
  const cfnInputHandler = new CloudFormationInputHandler(stack, 'CfnInputHandler');
  const cfnInputHandlerApiBinding = new HttpApiBinding('CloudFormation', api, cfnInputHandler);
  const evalEngine = new OpaEvalEngine(stack, 'EvalEngine');
  const evalEngineBinding = new HttpApiBinding('EvalEngine', api, evalEngine);
  expect(mockedPythonFunction).toHaveBeenCalled();
  api.setEvalEngine(evalEngine, evalEngineBinding);
  api.addInputHandler(cfnInputHandler, cfnInputHandlerApiBinding);
  new ControlBroker(stack, 'TestControlBroker', {
    api,
  });
});

The particular portions of interest are those jest mock-related calls and mockedPythonFunction.mockImplementation(). That mockImplementation() call mocks out the property functionArn and the method addPermission(), which I found the surrounding code needed in order to still function. For instance, without a functionArn value, which of course the real non-mocked code provides, I would get the following error:

FAIL  test/control-broker.test.ts
 ● ControlBroker can be created and attached to a stack

   Either `integrationSubtype` or `integrationUri` must be specified.

However, with the minimal necessary mock implementation, I get the following:

Test Suites: 2 passed, 2 total
Tests:       2 passed, 2 total
Snapshots:   0 total
Time:        4.26 s, estimated 5 s

Yay! Previously test runs took significantly longer - at least 19 seconds if the container already existed and was cached, and far longer (minutes) if not.

The most important thing is that no Docker containers are created during unit test runs any longer, and I think this approach can be used by others in the future to both speed up and make their AWS CDK Construct Library unit test runs compatible with GitHub actions.