Wednesday, October 27, 2021

On Haskell layout rule

Layout rule in Haskell allows programmers to omit writing braces and semicolons in let, where and do by a lexer and a parser who insert them automatically. Here are some references for Haskell layout rule:

When I read through the sections, I feel that the layout rule is quite intuitive but its formal description is a bit difficult to understand. So, I have decided to try to explain the formal description by example so that I can understand it better, which is the purpose of this article.

Suppose a simple Haskell program as follows.

1:module Main where

2:

3:main = do

4:    putStr hello

5:    putStrLn world

6:  where

7:    hello = "Hello"

8: 

9:world = " World!!"


GHC will automatically translate it into one in the following. 

1:module Main where

2: 

3:{main = do

4:    {putStr hello

5:    ;putStrLn world

6:  }where

7:    {hello = "Hello"

8: 

9:};world = " World!!"

0:}


This is a very useful feature for programmers. However, the compiler writers will be more burdensome because the implementation of the layout rule is quite tricky. The layout rule demands a lexer and a parser to work together. You can never develop a standalone Haskell lexer correctly without support by any parsers. 


In this article, let me try to explain this layout rule by showing a running example of the layout translation function in the Haskell 2010 language report. This function explains the layout rule by translating one using the layout rule into another with explicit braces and semicolons (so without using the layout rule). 


L (tokens, indentations) = tokens' where

  • tokens : indentation sensitive tokens
  • indentations : nested indentation context (e.g., [5, 1])
  • tokens' : indentation insensitive tokens using explicit braces and semicolons

To recognize spaces in Haskell lexer, special tokens <n> and {n} are introduced. 
  • Firstly, {n} indicates that there is no { right after let, where, do, or of. The integer, n, is the column (indentation) of the next lexeme. It is defined as 0 when there is no lexeme and the end of file is reached. 
  • Secondly, <n> indicates that there are only white spaces in front of a lexeme in a line. The integer, n, is the column (indentation) of the lexeme. 

Note that the first column is 1, not 0. The zero as an indentation is used specially, which will be explained later.

The definition of L function in Section 10.3 is as follows. 

(Group 1)
L (<n>: ts) (m : ms) 
  = ; : (L ts (m : ms))    if m = n
  = } : (L (< n >: ts) ms) if n < m

L (<n>: ts) ms = L ts ms


(Group 2)
L ({n} : ts) (m : ms) = { : (L ts (n : m : ms))  if n > m 
L ({n} : ts) []       = { : (L ts [n])           if n > 0
L ({n} : ts) ms       = { : } : (L (<n>: ts) ms)


(Group 3)
L (} : ts) (0 : ms) = } : (L ts ms)
L (} : ts) ms       = parse-error
L ({ : ts) ms       = { : (L ts (0 : ms))


(Group 4)
L (t : ts) (m : ms) = } : (L (t : ts) ms)  if m /= 0 and parse-error(t)

L (t : ts) ms = t : (L ts ms)


(Group 5)
L [ ] [] = []
L [ ] (m : ms) = } : L [ ] ms if m /= 0

For readability, I suggest to group similar rules as above. 

With the Haskell example, the L function will be explained. We start with

L (module : Main : where : ..., []) 
 = module : L (Main : where :..., [])
 = module : Main : L (where :..., [])
 = module : Main : where : L (..., [])

By 2nd rule of Group 2 since there is no { after where and before main and also main is in the first column i.e., n=1, 

 = module : Main : where : L ({1} : main : = : do : ..., [])
 = module : Main : where : { : L (main : = : do : ..., [1])
 = module : Main : where : { : main : L (= : do : ..., [1])
 = module : Main : where : { : main : = : L (do : ..., [1])
 = module : Main : where : { : main : = : do : L (..., [1])

By the same reason as above but with 1st rule of Group 2 since the indentation context [1] is not empty,

 = module : Main : where : { : main : = : do : L ({5} : putStr : hello : ... , [1])
 = module : Main : where : { : main : = : do : { : L (putStr : hello : ... , [5,1])
 = module : Main : where : { : main : = : do : { : putStr : L (hello : ... , [5,1])
 = module : Main : where : { : main : = : do : { : putStr : hello : L (... , [5,1])

This time it is <5>, not {5} since the lexeme processed just before is an identifier hello, not a keyword, such as do or where. So, we insert ; by applying 1st rule of Group 1.  

 = module : Main : where : { : main : = : do : { : putStr : hello : L (<5> : putStrLn : world : ... , [5,1])
 = module : Main : where : { : main : = : do : { : putStr : hello : ; : L(putStrLn : world : ... , [5,1])
 = module : Main : where : { : main : = : do : { : putStr : hello : ; : putStrLn : L(world : ... , [5,1])
 = module : Main : where : { : main : = : do : { : putStr : hello : ; : putStrLn : world : L(... , [5,1])

Since n=3<5=m, we insert } by applying 2nd rule of Group 1. 

 = module : Main : where : { : main : = : do : { : putStr : hello : ; : putStrLn : world : L(<3> : where : ... , [5,1])
 = module : Main : where : { : main : = : do : { : putStr : hello : ; : putStrLn : world : } : L(<3> : where : ... , [1])

Now the 3rd rule of Group 1 is applied to remove <3>. Note that the side conditions of the 1st and 2nd rules are not satisfied. 

 = module : Main : where : { : main : = : do : { : putStr : hello : ; : putStrLn : world : } : L(where : ... , [1])
 = module : Main : where : { : main : = : do : { : putStr : hello : ; : putStrLn : world : } : L(where : {5} : hello : = : "world" : ... , [1])
 = module : Main : where : { : main : = : do : { : putStr : hello : ; : putStrLn : world : } : where : L({5} : hello : = : "world" : ... , [1])

By applying 1st rule of Group 1, 

 = module : Main : where : { : main : = : do : { : putStr : hello : ; : putStrLn : world : } : where : { : L(hello : = : "world" : ... , [5,1])
 = module : Main : where : { : main : = : do : { : putStr : hello : ; : putStrLn : world : } : where : { : hello : L(= : "world" : ... , [5,1])
 = module : Main : where : { : main : = : do : { : putStr : hello : ; : putStrLn : world : } : where : { : hello : = : L("world" : ... , [5,1])
 = module : Main : where : { : main : = : do : { : putStr : hello : ; : putStrLn : world : } : where : { : hello : = : "world" : L(... , [5,1])

By applying 2nd rule of Group 1,

 = module : Main : where : { : main : = : do : { : putStr : hello : ; : putStrLn : world : } : where : { : hello : = : "hello" : L(<1> : world : = : " world!!" : ... , [5,1])
 = module : Main : where : { : main : = : do : { : putStr : hello : ; : putStrLn : world : } : where : { : hello : = : "hello" : } :  L(<1> : world : = : " world!!" : ... , [1])

By applying 1st rule of Group ,

 = module : Main : where : { : main : = : do : { : putStr : hello : ; : putStrLn : world : } : where : { : hello : = : "hello" : } :  ; : L(world : = : " world!!" : ... , [1])
 = module : Main : where : { : main : = : do : { : putStr : hello : ; : putStrLn : world : } : where : { : hello : = : "hello" : } :  ; : world : L(= : " world!!" : ... , [1])
 = module : Main : where : { : main : = : do : { : putStr : hello : ; : putStrLn : world : } : where : { : hello : = : "hello" : } :  ; : world : = : L(" world!!" : ... , [1])
 = module : Main : where : { : main : = : do : { : putStr : hello : ; : putStrLn : world : } : where : { : hello : = : "hello" : } :  ; : world : = : " world!!" : L(... , [1])

Now we arrive at the end of file. By applying 2nd rule of Group 5, 

 = module : Main : where : { : main : = : do : { : putStr : hello : ; : putStrLn : world : } : where : { : hello : = : "hello" : } :  ; : world : = : " world!!" : L([] , [1])
 = module : Main : where : { : main : = : do : { : putStr : hello : ; : putStrLn : world : } : where : { : hello : = : "hello" : } :  ; : world : = : " world!!" : } : L([] , [])

We finish by applying 1st rule of Group 5:

 = module : Main : where : { : main : = : do : { : putStr : hello : ; : putStrLn : world : } : where : { : hello : = : "hello" : } :  ; : world : = : " world!!" : } : []

This is how we've got the layout insensitive version of the program!


We have several rules that have not been used in the running example. 
  • 3rd rule of Group 2
  • Three rules of Group 3
  • 1st rule of Group 4
Wait, did we use 2nd rule of Group 4? Yes, we did where we consume lexemes that are not actually relevant to the layout rule, for example, as in:

L (module : Main : where : ..., []) 
 = module : L (Main : where :..., [])

The 3rd rule of Group 2 is used in a corner case that there is a use of where but is followed by nothing (i.e., EOF). In the case, we just use write { and } after the keyword where. 

The three rules of Group 3 handle programs that use braces explicitly. Let us start with the 3rd rule of Group 3. When there is an explicit opening brace, this brace is produced as it is. Importantly, the indentation context now has zero in the head position indicating the presence of an explicit opening brace.

Note that explicit braces written by programmers cannot be matched against implicit braces that are inserted by the layout rule. When there were such a case, a parse error would be issued. That is done by 1st and 2nd rules of Group 3.

In 1st rule of Group 3, an explicit closing brace is matched against the head indentation, which must be zero. Otherwise, in 2nd rule of Group 3, there is a parse error. 

The 1st rule of Group 4 is most interesting (and also problematic since this rule demands a lexer to be quite tightly coupled with a parser). Let us consider another Haskell example:


let x = 1 in x

which is expected to be translated into

let {x = 1 }in x


Let us start with 

L( let :  {5} : x : = : 1 : in : x : [], [])
 = let : L({5} : x : = : 1 : in : x : [], [])
 = let : { : L(x : = : 1 : in : x : [], [5])  
 = let : { : x : L(= : 1 : in : x : [], [5])  
 = let : { : x : = : L(1 : in : x : [], [5])  
 = let : { : x : = : 1 : L(in : x : [], [5])  

At this moment, the Haskell parser will give rise to a parse error because of the absence of a matching closing brace against the opening one inserted by the Haskell lexer. Also, note that m is 5, which is not zero. 

By applying 1st rule of Group 4, we can managed to overcome this parse error as:

 = let : { : x : = : 1 : } : L(in : x : [], [5])  
 = let : { : x : = : 1 : } : in : L(x : [], [])  
 = let : { : x : = : 1 : } : in : x : L([], [])  
 = let : { : x : = : 1 : } : in : x : []

This is what Haskell layout rule is! 

This exercise have helped me better understand the Haskell layout rule. I hope it will help you too if you are not familiar with the formal description of Haskell layout rule.  And, I will welcome any comments or corrections on this article. 

  







No comments: