Is this a good way to do JSON validation?

At work, where we use Clojure, we’ve been improving our error messages in the public API to

  1. return as many errors as possible in a response, and
  2. be in humanly readable English.

If one adopts spec as we have one gets the former for free, but the output of spec can hardly be called humanly readable. For the latter part we chose to use phrase.

Given that I’d like to see Haskell used more at work (currently there are 2 minor services written in Haskell and around a score in Clojure) I thought I’d take a look at JSON validation in Haskell. I ended up beig less than impressed. We have at least one great library for parsing JSON, aeson, but there are probably a few more that I haven’t noticed. It’s of course possible to mix in validation with the parsing, but since parsers, and this is true for aeson’s parser too, tend to be monads and that means that item 1 above, finding as many errors as possible, isn’t on the table.

A quick look at Hackage gave that

  • there is a package called aeson-better-errors that looked promising but didn’t fi my needs (I explain at the end why it isn’t passing muster)
  • the support for JSON Schema is very lacking in Haskell, hjsonschema is deprecated and aeson-schema only supports version 3 of the draft (the current version is 7) and the authors claim that that hjsonschema is more moderna and more actively maintained

So, a bit disappointed I started playing with the problem myself and found that, just as is stated in the description of the validation library, I want something that’s isomorphic to Either but accumulates on the error side. That is, something like

I decided it was all right to limit validation to proper JSON expressions, i.e. a validator could have the type Value -> JSONValidationResult. I want to combine validators so I decided to wrap it in a newtype and write a SemiGroup instance for it as well:

The function to actually run the validation is rather straight forward

After writing a few validators I realised a few patterns emerged and the following functions simplified things a bit:

With this in place I started writing validators for the basic JSON types:

The number type in JSON is a float (well, in aeson it’s a Scientific), so to check for an integer a bit more than the above is needed

as well as functions that check for the presence of a specific key

With this in place I can now create a validator for a person with a name and an age:

and run it on a Value:

and all failures are picked up

Reflections

  1. I quickly realised I wanted slightly more complex validation of course, so all the validators for basic JSON types above have a version taking a custom validator of type a -> JSONValidationResult (where a is the Haskell type contained in the particulare Value).
  2. I started out thinking that I want an Applicative for my validations, but slowly I relaxed that to SemiGroup. I’m still not sure about this decision, because I can see a real use of or which I don’t really have now. Maybe that means I should switch back towards Applicative, just so I can implement an Alternative instance for validators.
  3. Well, I simply don’t know if this is even a good way to implement validators. I’d love to hear suggestions both for improvements and for completely different ways of tackling the problems.
  4. I would love to find out that there already is a library that does all this in a much better way. Please point me in its direction!

Appendix: A look at aeson-better-errors

The issue with aeson-better-errors is easiest to illustrate using the same example as in its announcement:

and with this loaded in GHCi (and make sure to either pass -XOverloadedStrings on the command line, or :set -XOverloadedStrings in GHCi itself)

*> parse asPerson "{\"name\": \"Alice\", \"age\": 32}"
Right (Person "Alice" 32)
*> parse asPerson "{\"name\": \"Alice\"}"
Left (BadSchema [] (KeyMissing "age"))
*> parse asPerson "{\"nam\": \"Alice\"}"
Left (BadSchema [] (KeyMissing "name"))

Clearly aeson-better-errors isn’t fulfilling the bit about reporting as many errors as possible. Something that I would have realised right away if I had bothered reading its API reference on Hackage a bit more carefully, the parser type ParseT is an instance of Monad!

Zipping streams

Writing the following is easy after glancing through the documentation for conduit:

Neither pipes nor streaming make it as easy to figure out. I must be missing something! What functions should I be looking at?

Picking up new rows in a DB

I’ve been playing around a little with postgresql-simple in combination with streaming for picking up inserted rows in a DB table. What I came up with was

It seems almost too simple so I’m suspecting I missed something.

  • Am I missing something?
  • Should I be using streaming, or is pipes or conduit better (in case they are, in what way)?

Using a configuration in Scotty

At work we’re only now getting around to put correlation IDs into use. We write most our code in Clojure but since I’d really like to use more Haskell at work I thought I’d dive into Scotty and see how to deal with logging and then especially how to get correlation IDs into the logs.

The types

For configuration it decided to use the reader monad inside ActionT from Scotty. Enter Chell:

In order to run it I wrote a function corresponding to scotty:

Correlation ID

To deal with the correlation ID each incoming request should be checked for the HTTP header X-Correlation-Id and if present it should be used during logging. If no such header is present then a new correlation ID should be created. Since it’s per request it feels natural to create a WAI middleware for this.

The easiest way I could come up with was to push the correlation ID into the request’s headers before it’s passed on:

It also turns out to be useful to have both a default correlation ID and a function for pulling it out of the headers:

Getting the correlation ID into the configuration

Since the correlation ID should be picked out of the request on handling of every request it’s useful to have it the configuration when running the ChellActionM actions. However, since the correlation ID isn’t available when running the reader (the call to runReaderT in chell) something else is called for. When looking around I found local (and later I was pointed to the more general withReaderT) but it doesn’t have a suitable type. After some help on Twitter I arrived at withConfig which allows me to run an action in a modified configuration:

Making it handy to use

Armed with this I can put together some functions to replace Scotty’s get, post, etc. With a configuration type like this:

The modified get looks like this (Scotty’s original is S.get)

With this in place I can use the simpler ReaderT Config IO for inner functions that need to log.

QuickCheck on a REST API

Since I’m working with web stuff nowadays I thought I’d play a little with translating my old post on using QuickCheck to test C APIs to the web.

The goal and how to reach it

I want to use QuickCheck to test a REST API, just like in the case of the C API the idea is to

  1. generate a sequence of API calls (a program), then
  2. run the sequence against a model, as well as
  3. run the sequence against the web service, and finally
  4. compare the resulting model against reality.

The REST API

I’ll use a small web service I’m working on, and then concentrate on only a small part of the API to begin with.

The parts of the API I’ll use for the programs at this stage are

Method Route Example in Example out
POST /users {"userId": 0, "userName": "Yogi Berra"} {"userId": 42, "userName": "Yogi Berra"}
DELETE /users/:id

The following API calls will also be used, but not in the programs

Method Route Example in Example out
GET /users [0,3,7]
GET /users/:id {"userId": 42, "userName": "Yogi Berra"}
POST /reset

Representing API calls

Given the information about the API above it seems the following is enough to represent the two calls of interest together with a constructor representing the end of a program

and a program is just a sequence of calls, so list of ApiCall will do. However, since I want to generate sequences of calls, i.e. implement Arbitrary, I’ll wrap it in a newtype

Running against a model (simulation)

First of all I need to decide what model to use. Based on the part of the API I’m using I’ll use an ordinary dictionary of Int and Text

Simulating execution of a program is simulating each call against a model that’s updated with each step. I expect the final model to correspond to the state of the real service after the program is run for real. The simulation begins with an empty dictionary.

The simulation of the API calls must then be a function taking a model and a call, returning an updated model

Here I have to make a few assumptions. First, I assume the indeces for the users start on 1. Second, that the next index used always is the successor of highest currently used index. We’ll see how well this holds up to reality later on.

Running against the web service

Running the program against the actual web service follows the same pattern, but here I’m dealing with the real world, so it’s a little more messy, i.e. IO is involved. First the running of a single call

The running of a program is slightly more involved. Of course I have to set up the Manager needed for the HTTP calls, but I also need to

  1. ensure that the web service is in a well-known state before starting, and
  2. extract the state of the web service after running the program, so I can compare it to the model

The call to POST /reset resets the web service. I would have liked to simply restart the service completely, but I failed in automating it. I think I’ll have to take a closer look at the implementation of scotty to find a way.

Extracting the web service state and packaging it in a Model is a matter of calling GET /users and then repeatedly calling GET /users/:id with each id gotten from the first call

Generating programs

My approach to generating a program is based on the idea that given a certain state there is only a limited number of possible calls that make sense. Given a model m it makes sense to make one of the following calls:

  • add a new user
  • delete an existing user
  • end the program

Based on this writing genProgram is rather straight forward

Armed with that the Arbitrary instance for Program can be implemented as1

The property of an API

The steps in the first section can be used as a recipe for writing the property

What next?

There are some improvements that I’d like to make:

  • Make the generation of Program better in the sense that the programs become longer. I think this is important as I start tackling larger APIs.
  • Write an implementation of shrink for Program. With longer programs it’s of course more important to actually implement shrink.

I’d love to hear if others are using QuickCheck to test REST APIs in some way, if anyone has suggestions for improvements, and of course ideas for how to implement shrink in a nice way.


  1. Yes, I completely skip the issue of shrinking programs at this point. This is OK at this point though, because the generated Programss do end up to be very short indeed.

A simple zaw widget for jumping to git projects

A colleague at work showed me a script he put together to quickly jump to the numerous git projects we work with. It’s based on dmenu and looks rather nice. However, I’d rather have something based on zsh, but when looking around I didn’t find anything that really fit. So, instead I ended up writing a simple widget for zaw.

I then attached bound it like this

On mocks and stubs in python (free monad or interpreter pattern)

A few weeks ago I watched a video where Ken Scambler talks about mocks and stubs. In particular he talks about how to get rid of them.

One part is about coding IO operatioins as data and using the GoF interpreter pattern to

What he’s talking about is of course free monads, but I feel he’s glossing over a lot of details. Based on some of the questions asked during the talk I think I share that feeling with some people in the audience. Specifically I feel he skipped over the following:

  • How does one actually write such code in a mainstream OO/imperative language?
  • What’s required of the language in order to allow using the techniques he’s talking about?
  • Errors tend to break abstractions, so how does one deal with error (i.e. exceptions)?

Every time I’ve used mocks and stubs for unit testing I’ve had a feeling that “this can’t be how it’s supposed to be done!” So to me, Ken’s talk offered some hope, and I really want to know how applicable the ideas are in mainstream OO/imperative languages.

The example

To play around with this I picked the following function (in Python):

It’s small and simple, but I think it suffices to highlight a few important points. So the goal is to rewrite this function such that calls to IO operations (actions) (e.g. os.read) are replaced by data (an instance of some data type) conveying the intent of the operation. This data can later be passed to an interpreter of actions.

Thoughts on the execution of actions and the interpreter pattern

When reading the examples in the description of the interpreter pattern what stands out to me is that they are either

  1. a list of expressions, or
  2. a tree of expressions

that is passed to an interpreter. Will this do for us when trying to rewrite count_chars_of_file?

No, it won’t! Here’s why:

  • A tree of actions doesn’t really make sense. Our actions are small and simple, they encode the intent of a single IO operation.
  • A list of actions can’t deal with interspersed non-actions, in this case it’s the line n = len(text) that causes a problem.

The interpreter pattern misses something that is crucial in this case: the running of the interpreter must be intermingled with running non-interpreted code. The way I think of it is that not only the action needs to be present and dealt with, but also the rest of the program, that latter thing is commonly called a continuation.

So, can we introduce actions and rewrite count_chars_of_file such that we pause the program when interpretation of an action is required, interpret it, and then resume where we left off?

Sure, but it’s not really idiomatic Python code!

Actions and continuations

The IO operations (actions) are represented as a named tuple:

and the functions returning actions can then be written as

The interpreter is then an if statement checking the value of op.op with each branch executing the IO operation and passing the result to the rest of the program. I decided to wrap it directly in the program runner:

So far so good, but what will count_char_of_file all of this do to count_chars_of_file?

Well, it’s not quite as easy to read any more (basically it’s rewritten in CPS):

Generators to the rescue

Python does have a notion of continuations in the form of generators.1 By making count_char_of_file into a generator it’s possible to remove the explicit continuations and the program actually resembles the original one again.

The type for the actions loses one member, and the functions creating them lose an argument:

The interpreter and program runner must be modified to step the generator until its end:

Finally, the generator-version of count_chars_of_file goes back to being a bit more readable:

Generators all the way

Limitations of Python generators mean that we have either have to push the interpreter (runProgram) down to where count_char_of_file is used, or make all intermediate layers into generators and rewrite the interpreter to deal with this. It could look something like this then:

Final thoughts

I think I’ve shown one way to achieve, at least parts of, what Ken talks about. The resulting code looks almost like “normal Python”. There are some things to note:

  1. Exception handling is missing. I know of no way to inject an exception into a generator in Python so I’m guessing that exceptions from running the IO operations would have to be passed in via generator.send as a normal value, which means that exception handling code would have to look decidedly non-Pythonic.
  2. Using this approach means the language must have support for generators (or some other way to represent the rest of the program). I think this rules out Java, but probably it can be done in C#.
  3. I’ve only used a single interpreter here, but I see no issues with combining interpreters (to combine domains of operations like file operations and network operations). I also think it’d be possible to use it to realize what De Goes calls Onion Architecture.

Would I ever advocate this approach for a larger Python project, or even any project in an OO/imperative language?

I’m not sure! I think that testing using mocks and stubs has lead to a smelly code base each an every time I’ve used it, but this approach feels a little too far from how OO/imperative code usually is written. I would love to try it out and see what the implications really are.


  1. I know, I know, coroutines! I’m simplifying and brushing over details here, but I don’t think I’m brushing over any details that are important for this example.

That's a large project

From LWN October 6, 2016:

Over the course of the last year, the project accepted about eight changes per hour — every hour — from over 4,000 developers sponsored by over 400 companies.

Free, take 2

The other day I read a blog post on monads and stuff after which I felt rather silly about my earlier posts on Free.

I think this is probably the post I should have written instead :)

I’ll use three pieces of code, each one builds on the one before:

  • Free1.hs - Uses a free monad for a single algebra/API (full code here).
  • Free2.hs - Uses a free monad for a two algebras/APIs, where one decorates the other (full code here).
  • Free3.hs - Uses a free monad for a three algebras/APIs, where two are used in the program and the remaing one decorates the other two (full code here).

The first - one algebra

I’m re-using the algebras from my previous posts, but I believe it makes it easier to follow along if the amount of jumping between posts is minimized so here is the first one once again:

It’s a ridiculously small one, but I believe it’s good enough to work with. In the previous posts I implemented the Functor instances manually. I couldn’t be bothered this time around; I think I pretty much know how to do that for this kind of types now.

Having a type for the algebra, SimpleFileAPI, is convenient already now, even more so in the other examples.

The two convenience functions on the end makes it straight forward to write functions using the algebra:

This is simple, straight forward monadic code. Easy to read and work with. Of course it doesn’t actually do anything at all yet. For that I need need an interpreter, something that translates (reduces) the algebra, the API, the commands, call them what you will, into the (side-)effects I want. For Free that is foldFree together with a suitable function f :: SimpleFileF a -> IO a.

I want LoadFile to translate to a file being read and SaveFile to some data being saved to a file. That makes it pretty obvious how that f needs to be written:

At this point it might be good to explain the constructors of SimpleFileF a bit more. At first I thought they looked a bit funny. I mean, why does SaveFile have an a at all since it obviously always should result in ()? And what’s up with that function in Loadfile?

It did become a little clearer to me after some thought and having a look at Free:

I personally find the latter constructor a bit mind-bending. I can handle recursive functions fairly well, but recursive types have a tendency to confuse me. From what I understand one can think of Free as similar to a list. Pure ends the list (Nil) and Free one instance of f to the rest of the list (Cons). Since Free f a is a monad one can think of a as the result of the command.

If I were to write saveFile explicitly it’d look like this

and for loadFile:

But let’s get back to the type and why ‘a’ occurs like it does in the two constructors. As Gabriel G shows in his post Why free monads matter a constructor without a would result in termination. In other words, if SaveFile didn’t hold an a I’d not be able to write, in a natural way, a function that saves two files.

Another limiting factor is that foldFree of the Free implementation I’m using has the type Monad m => (forall x. f x -> m x) -> Free f a -> m a. This sets a requirement on what the function translating from my API into effects may look like, i.e. what f in runSimpleFile may look like. If SaveFile had no a to return what would f (SaveFile {}) return, how could it ever satisfy the required type?

The reason for LoadFile having a function String -> a is simply that there is no data yet, but I still want to be able to manipulate it. Using a function and function composition is the ticket then.

I think that’s all there is to say about to the first piece of code. To run it take a look at the comment at the end of the file and then play with it. If you want to turn all characters of a file foo into upper case you can use

The second - two algebras, one decorating the other

The second piece of code almost only adds to the first one. There is one exception though, the function runSimpleFile is removed. Instead I’ve taken the transformation function, which used to be called f and was internal to runSimpleFile and moved it out. It’s called stepSimpleFile:

The logging API, LogAPI, follows the same pattern as SimpleFileAPI and I’m counting on the description above being clear enough to not have to repeat myself. For completeness I include the code:

I intend the LogAPI to be used as embellishments on the SimpleFileAPI, in other words I somehow have to turn an operation of SimpleFileAPI into an operation of LogAPI, i.e. I need a transformation. I called it logSimpleFileT and let it turn operations in SimpleFileF (i.e. not exactly SimpleFileAPI) into LogAPI (if you are wondering about my choice of type I hope it’ll become clear below, just trust me for now that this is a good choice):

So far everything is hopefully very straight forward and unsurprising. Now I need to combine the two APIs, I need to add them, in other words, I need a sum type:

The next question is how to turn my two original APIs, SimpleFileAPI and LogAPI, into SumAPI. Luckily that’s already solved by the function hoistFree:

With this and logSimpleFileT from above I can use foldFree to decorate each operation with a logging operation like this:

This is where the type of logSimpleFileT hopefully makes sense!

Just as in the first section of this post, I need an interpreter for my API (SumAPI this time). Once again it’s written using foldFree, but this time I provide the interpreters for the sub-algegras (what I’ve chosen to call step functions):

The file has a comment at the end for how to run it. The same example as in the previous section, but now with logging, looks like this

The third - three algebras, one decorating the other two

To combine three algebras I simply take what’s in the previous section and extend it, i.e. a sum type with three constructors:

With this I’ve already revealed that my three APIs are the two from previous sections, LogAPI (for decorating the other APIs), SimpleFileAPI and a new one, StdIoAPI.

I want to combine them in such a wat that I can write functions using both APIs at the same time. Then I modify withSimpleFile into

and I can add another function that uses it with StdIoAPI:

The way to allow the APIs to be combined this way is to bake in S already in the convenience functions. This means the code for SimpleFileAPI has to change slightly (note the use of A2 in loadFile and saveFile):

The new API, StdIoAPI, has only one operation:

The logging API, LogAPI, looks exactly the same but I now need two transformation functions, one for SimpleFileAPI and one for StdIoAPI.

The new version of logT needs to operate on S in order to decorate both APIs.

This file also has comments on how to run it at the end. This time there are two examples, one on how to run it without logging

and one with logging

CMake, ExternalData, and custom fetch script

I failed to find a concrete example on how to use the CMake module ExternalData with a custom fetch script. Since I finally manage to work out how to use it I thought I’d try to help out the next person who needs to go down this route.

Why ExternalData?

I thought I’d start with a short justification of why I was looking at the module at all.

At work I work with a product that processes images and video. When writing tests we often need some rather large files (from MiB to GiB) as input. The two obvious options are:

  1. Check the files into our Git repo, or
  2. Put them on shared storage

Neither of these are very appealing. The former just doesn’t feel quite right, these are large binary files that rarely, if ever, change, why place them under version control at all? And if they do change the Git repo is likely to balloon in size and impact cloning times negatively. The latter makes it difficult to run our tests on a machine that isn’t on the office network and any changes to the files will break older tests, unless we always only add files, never modify any in place. On the other hand, the former guarantees that the files needed for testing are always available and it is possible to modify the files without breaking older tests. The pro of the latter is that we only download the files needed for the current tests.

ExternalData is one option to address this. On some level it feels like it offers a combination of both options above:

  • It’s possible to use the shared storage
  • When the shared storage isn’t available it’s possible to fall back on downloading the files via other means
  • The layout of the storage is such that modifying in place is much less likely
  • Only the files needed for the currest tests will be downloaded when building off-site

The object store

We do our building in docker images that do have our shared storage mapped in, so I’d like them to take advantage of that. At the same time I want the builds performed off-site to download the files. To get this behaviour I defined two object stores:

The module will search each of these for the required files and download only if they aren’t found. Downloaded files will be put into the first of the stores. Oh, and it’s very important that the first store is given with an absolute path!

The store on the shared storage looks something like this:

/mnt/share/over/nfs/Objects
└── MD5
    ├── 94ed17f9b6c74a732fba7b243ab945ff
    └── a2036177b190fbee6e9e038b718f1c20

I can then drop a file MyInput.avi.md5 in my source tree with the md5 of the real file (e.g. a2036177b190fbee6e9e038b718f1c20) as the content. Once that is done I can follow the example found in the introduction of the reference documentation.

curl vs sftp

So far so good. This now works on-site, but for off-site use I need to fetch the needed files. The last section of the reference documentation is called Custom Fetch Scripts. It mentions that files are normally downloaded using file(DOWNLOAD). Neither there, nor in the documentation for file is there a mention of what is used under the hood to fetch the files. After asking on in #cmake I found out that it’s curl. While curl does handle SFTP I didn’t get it to work with my known_hosts file, nor with my SSH agent (both from OpenSSH). On the other hand it was rather easy to configure sftp to fetch a file from the internet-facing SSH server we have. Now I just had to hook it into CMake somehow.

Custom fetch script

As the section on “Custom Fetch Scripts” mention three things are needed:

  1. Specify the script via the ExternalDataCustomScript:// protocol.
  2. Tell CMake where it can find the fetch script.
  3. The fetch script itself.

The first two steps are done by providing a URL template and pointing to the script via a special variable:

It took me a ridiculous amount of time to work out how to write a script that turns out to be rather short. This is an experience that seems to repeat itself when using CMake; it could say something about me, or something about CMake.

This script is run with cmake -P in the binary dir of the CMakeLists.txt where the test is defined, which means it’s oblivious about the project it’s part of. PROJECT_BINARY_DIR is empty and CMAKE_BINARY_DIR is the same as CMAKE_CURRENT_BINARY_DIR. This is the reason why the first store in ExternalData_OBJECT_STORES has to be an absolute path – it’s very difficult, if not impossible, to find the correct placement of the object store otherwise.