Ferret: An Experimental Clojure Compiler

Ferret is no longer experimental see http://dropbox.nakkaya.com/builds/ferret-manual.html for more information.

Ferret is an experimental Lisp to C++ compiler, the idea was to compile code that is written in a very small subset of Clojure to be automatically translated to C++ so that I can program stuff in Clojure where JVM or any other Lisp dialect is not available.

This is a literate program, the code in this document is the executable source, in order to extract it, open this raw file with emacs and run,

M-x org-babel-tangle

It will build the necessary directory structure and export the files and tests contained.

Disclaimer: This all started because I was bored, there was no planning and I had no idea what I was doing plus I wrote most of it after 1 am, so it does need some cleanup. Also please don't complain because I did not use boost this or boost that, my original intention was to use this on a microcontroller which means there is no boost or standard C++ library.

Compiler

Compiler has two major parts, transformation and code generation. During transformation we make passes over the code, with each pass code becomes more and more like C++ basically after the final pass it is C++ written with s-expressions. Then during code generation we iterate over the code and spit valid C++.

Transformation

(defn morph-form [tree pred f]
  (loop [loc (zip/seq-zip tree)]
    (if (zip/end? loc)
      (zip/root loc)
      (recur
       (zip/next
        (if (pred (zip/node loc))
          (zip/replace loc (f (zip/node loc)))
          loc))))))

(defn remove-form [tree pred]
  (loop [loc (zip/seq-zip tree)]
    (if (zip/end? loc)
      (zip/root loc)
      (recur
       (zip/next
        (if (pred (zip/node loc))
          (zip/remove loc)
          loc))))))

(defn is-form? [& s]
  (fn [f]
    (and (seq? f)
         (some true? (map #(= % (first f)) s)))))

During each pass we iterate over the nodes in the form using morph-form and remove-form, they both take a s-expression and a predicate if the predicate returns true, morph-form will call f passing the current node as an argument and replace that node with f's return value, remove-form on the other hand does what its name suggests and removes the node when predicate returns true.

(defn dispatch-reader-macro [ch fun]
  (let [dm (.get
            (doto (.getDeclaredField clojure.lang.LispReader "dispatchMacros")
              (.setAccessible true))
            nil)]
    (aset dm (int ch) fun)))

(defn native-string [rdr letter-u]
  (loop [s (str )
         p \space
         c (char (.read rdr))]
    (if (and (= c \#) (= p \>))
      s
      (recur (str s p) c (char (.read rdr))))))

(dispatch-reader-macro \< native-string)

We install a custom reader macro, what it does is turn everything between #< and ># into a string, this makes life so much easier when you need to embed native code into a function, otherwise it is a nightmare to indent native code in a string.

(defn process [form]
  (->> (expand-macros form)
       (add-built-in)
       (expand-macros)
       (vector->list)
       (let->fn)
       (do->fn)
       (closure-conversion)
       (symbol-conversion)
       (vector->list)))

Forms go through eight transformations before they are passed to the code generation phase.

(defn expand-macros [form]
  (let [macros (->> (read-string (str \( (read-from-url "runtime.clj") \)))
                    ;;get built in macros
                    (filter (is-form? 'defmacro))
                    ;;merge user defined macros
                    (concat (filter (is-form? 'defmacro) form)))
        form (remove-form form (is-form? 'defmacro))
        temp-ns (gensym)]

    (create-ns temp-ns)
    (binding [*ns* (the-ns temp-ns)]
      (refer 'clojure.core :exclude (concat (map second macros) ['fn 'let 'def]))
      (use 'clojure.contrib.macro-utils)
      (doseq [m macros]
        (eval m)))

    (let [form (morph-form form
                           (apply is-form? (map second macros))
                           (fn [f]
                             (binding [*ns* (the-ns temp-ns)]
                               (macroexpand-all f))))]
      (remove-ns temp-ns)
      form)))

First we read all the macros present in runtime.clj then add to that user defined macros, they are evaluated in a temporary namespace, using morph-form we iterate all the macros used in the code that we are compiling and expand them in the temporary namespace then the node is replaced with its expanded form.

(defn add-built-in
  ([form]
     (let [built-in (->> (read-string (str \( (read-from-url "runtime.clj") \)))
                         (filter (is-form? 'defn))
                         (reduce (fn[h v] (assoc h (second v) v)) {}))
           fns (ref {'list (built-in 'list)})
           form (add-built-in form built-in fns)]
       (concat (vals @fns) form)))
  ([form built-in fns]
     (morph-form form symbol?
                 #(do (if-let [f (built-in %)]
                        (when (not (@fns %))
                          (do (dosync (alter fns assoc % f))
                              (add-built-in
                               (expand-macros (drop 3 f))
                               built-in fns)))) %))))

In order to keep the generated C++ code compact only the functions used will be present in the generated source file. Which means if you don't use println anywhere in the code it won't be defined in the final C++ file, but if you use it, it and everything it uses will be defined, in the case of println it will pull apply, print and newline with it.

(defn vector->list [form]
  (morph-form form vector? #(reverse (into '() %))))

Since there is no support for vectors, they are converted to lists.

(defn let->fn [form]
  (morph-form form
              (is-form? 'let)
              (fn [[_ bindings & body]]
                (let [bindings (partition 2 bindings)
                      vars (flatten (map first bindings))
                      defs (map #(cons 'define-var %) bindings)
                      body-fn (cons (concat ['fn vars] body) vars)]
                  (list (concat ['fn []] defs [body-fn]))))))

let forms are transformed into nested functions which are then called immediately, bindings are setup in the outer function, expressions are placed in the inner function which takes the bindings as arguments.

So following form,

(let->fn '(let [a 1
                b 2]
            (+ a b)))

after transformation becomes,

((fn []
   (define-var a 1)
   (define-var b 2)

   ((fn (a b)
      (+ a b)) a b)))
(defn do->fn [form]
  (morph-form form
              (is-form? 'do)
              #(list (concat ['fn []] (rest %)))))

A similar method is used for the do form, expressions are wrapped in a fn that takes no parameters and executed in place.

(do->fn '(do (+ 1 1)))
((fn [] (+ 1 1)))
(defn lambda-defined? [fns env args body]
  (let [f (concat [env args] body)
        name (reduce (fn[h v]
                       (let [[_ n & r] v]
                         (if (= r f) n))) nil @fns)]
    (when name
      (apply list 'lambda-object name env))))

(defn define-lambda [fns env args body]
  (let [n (gensym)]
    (dosync (alter fns conj (concat ['define-lambda n env args] body)))
    (apply list 'lambda-object n env)))

(defn closure-conversion
  ([form]
     (let [fns (ref [])
           form (closure-conversion form fns)]
       (vector->list (concat @fns form))))
  ([form fns & env]
     (morph-form form
                 (is-form? 'fn)
                 (fn [[_ args & body]]
                   (let [env (if (nil? env) '() (first env))
                         body (closure-conversion body fns (concat args env))]
                     (if-let [n (lambda-defined? fns env args body)]
                       n
                       (define-lambda fns env args body)))))))

closure-conversion handles the problem of free variables,

(defn make-adder [x]
  (fn [n] (+ x n)))

in the above snippet x is a free variable, the function make-adder returns, has to have a way of referencing that variable when it is used. The way we do this is that, every function will pass its arguments to inner functions (if any) it contains.

(closure-conversion '(fn [x]
                        (fn [n] (+ x n))))

Above form will be converted to,

(define-lambda G__265 (x) (n) (+ x n))
(define-lambda G__266 () (x) (lambda-object G__265 x))

What this means is, define a functor named G_265 that holds a reference to x, and another functor G_266 that has no state. When we create an instance of G_265 we pass x to its constructor. Since every thing is already converted to fns this mechanism allows variables to be referenced down the line and solves the free variable problem.

(defn symbol-conversion [form]
  (let [c (comp #(symbol (escape {\- \_ \* "_star_" \+ "_plus_" \/ "_slash_"
                                  \< "_lt_" \> "_gt_" \= "_eq_" \? "_QMARK_"}
                                 (str %)))
                #(cond (= 'not %) '_not_
                       :default %))]
    (morph-form form symbol? c)))

Final step converts all symbols that are not legal C++ identifiers into valid ones.

Code Generation

At this point all we need is a multi method that will emit correct string based on the form.

(defmulti emit (fn [form _]
                 (cond (is-special-form? 'define_lambda form) 'define_lambda
                       (is-special-form? 'lambda_object form) 'lambda_object
                       (is-special-form? 'define_var form) 'define_var
                       (is-special-form? 'native_declare form) 'native_declare
                       (is-special-form? 'if form) 'if
                       (is-special-form? 'def form) 'def
                       (is-special-form? 'reduce form) 'reduce
                       (to-str? form) :to-str
                       (keyword? form) :keyword
                       (number? form) :number
                       (nil? form) :nil
                       (char? form) :char
                       (string? form) :string
                       (seq? form) :sequence)))

Without preprocessing following forms,

(emit '(list 1 2 3) (ref {}))

(emit '(+ 1 2) (ref {}))

(emit '(if (< a b)
         b a)
      (ref {}))

would evaluate to,

"INVOKE(VAR(list), VAR(3),VAR(2),VAR(1))"
"INVOKE(VAR(+), VAR(2),VAR(1))"
"(BOOLEAN(INVOKE(VAR(<), VAR(b),VAR(a)))->asBool() ? (VAR)VAR(b) : (VAR)VAR(a))"

So the actual compilation will just map emit to all forms passed and string-template will handle the job of putting them into an empty C++ skeleton.

(defn emit-source [form]
  (let [state (ref {:lambdas [] :symbol-table #{} :native-declarations []})
        body (doall (map #(emit % state) (process form)))]
    (assoc @state :body body)))

Runtime

On the C++ side we define our own object system, which includes the following types,

  • Sequence
  • Lambda
  • Boolean
  • Keyword
  • Pointer
  • Integer
  • Float
  • Character (There is no string type, strings are converted to lists of characters.)
class Object{
    public:
      Object() : refCount(0) {}
      virtual ~Object() {};

      virtual int getType() = 0;
      virtual var toOutputStream() = 0;
      virtual var equals(var o) = 0;

      void addRef() { refCount++; }
      bool subRef() { return (--refCount <= 0); }


      void* operator new(size_t size){ 
        return malloc(size); 
      } 

      void  operator delete(void * ptr){ 
        free(ptr); 
      }

      void* operator new[](size_t size){ 
        return malloc(size); 
      }

      void  operator delete[](void * ptr){ 
        free(ptr); 
      }

    private:
      int refCount;
    };

All our types are derived from the base Object type,(defining new/delete is needed because in avr-gcc they are not defined.)

class Boolean : public Object { 
public:
  Boolean(bool b){value = b;}
  int getType(){ return BOOLEAN_TYPE;}

  bool asBool() { return value; }

  var equals(var o){
    if (OBJECT(o)->getType() != BOOLEAN_TYPE)
      return false;

    return (value == BOOLEAN(o)->asBool());
  }

  var toOutputStream(){ 
    if (value)
      fprintf(OUTPUT_STREAM, "true"); 
    else
      fprintf(OUTPUT_STREAM, "false"); 

    return var();
  }
private:
  bool value;
};

except functors, they derive from the class Lambda, which has a single invoke method that takes a sequence of vars as argument, this allows us to execute them in a uniform fashion.

class Lambda : public Object{ 
public:
  virtual var invoke(var args) = 0;
};

Garbage collection is handled by reference counting, a var holds a pointer to an Object, everything is passed around as vars it is responsible for incrementing/decrementing the reference count, when it reaches zero it will automatically free the Object.

class var{
public:
  var(Object* ptr=0) : m_ptr(ptr) { addRef(); }

  var(const var& p) : m_ptr(p.m_ptr) { addRef(); }

  ~var() { subRef(); }

  var& operator= (const var& p){
    return *this = p.m_ptr;
  }

  var& operator= (Object* ptr){
    if (m_ptr != ptr){
      subRef();
      m_ptr=ptr;
      addRef();
    }
    return *this;
  }

  var(int i);
  var(float f);
  var(bool b);
  var(char b);

  var& operator, (const var& m);
  var toOutputStream() {
    if (m_ptr != NULL )
      m_ptr->toOutputStream();
    else
      fprintf(OUTPUT_STREAM, "nil");
  }

  Object* get() { return m_ptr; }

private:
  void addRef(){
    // Only change if non-null
    if (m_ptr) m_ptr->addRef();
  }

  void subRef(){
    // Only change if non-null
    if (m_ptr){
      // Subtract and test if this was the last pointer.
      if (m_ptr->subRef()){
        delete m_ptr;
        m_ptr=0;
      }
    }
  }

  Object* m_ptr;
};

Once our object system is in place we can define rest of the runtime (functions/macros) using our Clojure subset,

(defn first [x]
  #<
  if(x.get() == NULL)
    __result = VAR();
  else
    __result = SEQUENCE(x)->first();
  >#)

We can embed C++ code into our functions, which is how most of the primitive functions are defined such as the first function above, once primitives are in place rest can be defined in pure Clojure,

(defn println [& more]
  (apply print more)
  (newline))

As for macros, normal Clojure rules apply since they are expended using Clojure, the only exception is that stuff should not expand to fully qualified Clojure symbols, so the symbol fn should not expand to clojure.core/fn,

(defmacro defn [name args & body]
  (list 'def name (cons 'fn `( ~args ~@body))))

List of all functions and macros defined,

defn not= when while forever
and or cond not nil?
empty? list rest cons while
dotimes apply integer? float? char?
list? print newline println +
\* - / \= <
> >= <= conj inc
dec pos? neg? zero? count
reverse pin-mode digital-write digital-read sleep

Example Code

In order to compile the samples,

lein run -in sample.clj

output will be placed in a directory called solution/,

Arduino LED

(pin-mode 13 :output)

(forever
 (digital-write 13 :high)
 (sleep 500)
 (digital-write 13 :low)
 (sleep 500))

FFI

g++ solution.cpp -I/opt/local/include/ \
                 -L/opt/local/lib \
                 -lopencv_core -lopencv_highgui
(native-declare #<
                #include "opencv/cv.h"
                #include "opencv/highgui.h"
                >#)

(defn wait-key [i] "__result = var((char)cvWaitKey(NUMBER(i)->intValue()));")

(defn video-capture [i]
  #<
  cv::VideoCapture *cap = new cv::VideoCapture(NUMBER(i)->intValue());
  if (cap->isOpened())
   __result = var(new Pointer(cap));
  >#)

(defn named-window [n] "cv::namedWindow(toCppString(n),1);")

(defn query-frame [c]
  #<
  cv::VideoCapture *cap = static_cast<cv::VideoCapture*>(POINTER(c)->ptr);
  cap->grab();
  cv::Mat *image = new cv::Mat;
  cap->retrieve(*image, 0);
  __result = var(new Pointer(image));
  >#)

(defn show-image [f img]
  #<
  cv::Mat *i = static_cast<cv::Mat*>(POINTER(img)->ptr);
  imshow(toCppString(f), *i);
  >#)

(def cam (video-capture 0))

(named-window "cam")

(while (not= (wait-key 1) \q)
  (let [f (query-frame cam)]
    (show-image "cam" f)))