Haskell: Functional Programming, Solid Code, Big Data

Bryan O'Sullivan

2011-09-18

Welcome!

main = putStrLn "hello!"

Copy these slides (if you want)

git clone https://github.com/bos/strange-loop-2011

My Haskell background

What to expect 1

What to expect 2

Your tools

What else is needed?

Problem definition

Given a web site, we want to scrape it and find important web pages.

This involves a lot of figuring stuff out!

  1. Learn Haskell

  2. Download one web page

  3. Extract links from a page, so we can find more pages to download

  4. Once we're done, compute which ones are important

  5. Make it all fast?

Let's get started!

Create a file named Hello.hs and give it the following contents:

main = putStrLn "hello, world!"

The suffix .hs is the standard for Haskell source files.

File names start with a capital letter, and everyone uses CamelCase.

Building it

This command will look for Hello.hs in the current directory, and compile it:

ghc --make Hello

The generated executable will be named Hello (Hello.exe on Windows).

Checking in

Is everyone able to build and run their Hello executable?

Something a little more convenient

It's nice to have fast, native code at our fingertips.

But when I'm working, I expect a few things:

For these circumstances, a full compiler is a bit slow.

Instead, I often use the interactive interpreter, ghci.

Let's start GHCi

Easily done:

ghci

It will display a startup banner, followed by a prompt:

Prelude>

This default prompt tells us which modules are available to play with.

Play around

The ghci interpreter evaluates expressions interactively.

Try it out:

2 + 2
123456781234567812345678 * 87654321876543
"foo" ++ "bar"

(That ++ is the "append" operator.)

Directives

All interpreter directives start with a ":" character.

Let's load our source file into ghci:

:load Hello.hs

Now the ghci prompt changes:

*Main>

Running our code in ghci

We defined a function named main, so let's invoke it:

main

Did that work for you?

What about this?

putStrLn "hi mom!"

A few more useful directives

Remember, all ghci directives start with a ":".

Final ghci efficiency tips

We can abbreviate directives:

We also have command line history and editing.

Getting used to the cycle

Use :edit or your text editor to change the "hello" text.

Use :reload to reload the source file.

Test out your redefinition of main.

Lists and strings

[1,2,3,4]
['h','e','l','l','o']

Double quotes are just syntactic sugar for the longer form:

"hello"

What does this print?

"foo" == ['f','o','o']

Calling functions: 1

We use white space to separate a function from its argument:

head "foo"
head [1,2,3]
tail [1,2,3]

Calling functions: 2

If a function takes multiple arguments, we separate them with white space:

min 3 4

If an argument is a compound expression, wrap it in parentheses:

compare (3+5) (2+7)
max (min 3 4) 5

Quick exercises: 1

Use ghci as a calculator.

The ** operator performs exponentiation.

Quick exercises: 2

The notation ['a'..'z'] generates a list from start to end, inclusive.

The sum function adds the elements of a list.

Quick exercises: 3

The show function renders a value as a string. Try it!

show (1 == 2)

The length function tells us how many elements are in a list.

length [1,2,3]

Defining a function

It is pretty simple to define a new function.

Open up your text editor, create a new file with a .hs extension, and get writing!

isOdd x  =  (rem x 2) == 1

Load your source file into ghci and give myOdd a try.

Making life more interesting

Now we can define very simple functions, but we're missing some important building blocks for more fun.

So let's get to it!

Conditional execution

Q: What does the familiar if look like in Haskell?

A: Familiar!

gcd a b = if b == 0
then a
else gcd b (rem a b)

We have the following elements:

Finally! A tiny bit about types

The two possible results of an if expression must have the same type.

If then evaluates to a String, well else must too!

For instance, this makes no sense:

if True
then 3.14
else "wombat"

We are forbidden from writing ill-typed expressions like this.

What about else?

In imperative languages, we can usually leave out the else clause after an if.

Not so in Haskell.

Why does this make sense for imperative languages, but not Haskell?

A nearly trivial exercise

Write a function that appends ", world" to its argument if the argument is "hello", or just returns its argument unmodified otherwise.

Lists in Haskell

We already know what a list looks like in Haskell:

[1,2,3]

And of course there's the syntactic sugar for strings:

"foo" == ['f','o','o']

But is this everything there is to know?

List constructors

Supposing we want to construct a list from first principles.

Type this into ghci

Add an element to an empty list:

1 : []

From single-element lists onwards

What about extending that list?

2 : (1 : [])

You're probably guessing now that [2,1] is syntactic sugar for 2:(1:[]). And you're right!

What is the result of this expression?

5 : 8 : [] == [5,8]

Constructors

We refer to [] and : as constructors, because we use them to construct lists.

When you create a list, the Haskell runtime has to remember which constructors you used, and where.

So the value [5,8] is represented as:

What did we see?

Depending on your background, I bet you're thinking something like this:

Right on.

Why do we care about constructors?

Of course Haskell has to remember what a list is constructed of.

It also lets us inspect a list, to see which constructors were used.

How do we do this?

import Data.Char

isCapitalized name
= case name of
(first:rest) -> isUpper first
[] -> False

Welcome to the case expression

A case expression allows us to inspect a structure to see how it was constructed.

isCapitalized name
= case name of
[] -> False
(first:rest) -> isUpper first

If the constructor used was the "add to the front" : operator, then things get more interesting.

Pattern matching

The case expression performs what we call pattern matching.

A worked example

Let's step through the machinery of what happens if we evaluate this expression.

isCapitalized "Ann"

Whew! A few exercises!

Finally! We can write slightly more complex functions.

Now that you can inspect the front of a list, you should be able to process an entire list recursively.

First, please write a function named myLength that computes the number of elements in a list.

Next, write a function named countCaps that calculates the number of capital letters in a string.

countCaps "Monkey Butter" == 2

Counting capital letters

Wow, that countCaps function was a pain, right?

Here's my definition that uses only the machinery we've learned so far:

countCaps string =
case string of
[] -> 0
(x:xs) -> if isUpper x
then 1 + countCaps xs
else countCaps xs

Huh.

I thought Haskell was all about concision!?

Conciseness 1: top-level pattern matching

countCaps []     = 0
countCaps (x:xs) =
if isUpper x
then 1 + countCaps xs
else countCaps xs

We can define a function as a series of equations, each containing a pattern match.

This is nice syntactic sugar for case.

Conciseness 2: guards

countCaps []     = 0
countCaps (x:xs)
| isUpper x = 1 + countCaps xs
| otherwise = countCaps xs

After each | is a guard.

(Yes, patterns in a case can have guards too.)

Before

Like the original version, but with use of case stripped out:

countCaps xs =
if null xs
then 0
else if isUpper (head xs)
then 1 + countCaps (tail xs)
else countCaps (tail xs)

After

Both shorter and easier to follow:

countCaps []     = 0
countCaps (x:xs)
| isUpper x = 1 + countCaps xs
| otherwise = countCaps xs

Another approach

Write a new version of countCaps as follows:

This should give the same result as your first function. Right?

A change of specification

Suppose we want to count the number of lowercase letters in a string.

This seems almost the same as our function for counting uppercase letters.

What can we do with this observation?

Higher order functions

Higher order function: a function that accepts another function as a parameter.

filter pred [] = []
filter pred (x:xs)
| pred x = x : filter pred xs
| otherwise = filter pred xs

How can we use this to define countLowerCase?

Data in, data out

By now, we've seen several definitions like this:

countLowerCase string =
length (filter isLower string)

This is a recurring pattern:

Function composition

Haskell doesn't limit us to giving functions alphanumeric names.

Here, we define a function named simply ".", which we can use as an operator:

(f . g) x = f (g x)

How to use this?

countLowerCase = length . filter isLower

Understanding composition

If that seemed hard to follow, let's make it clearer.

We'll plug the arguments into the RHS of our function definition:

(f . g) x = f (g x)

We had length as the first argument to ".", and filter isLower as the second:

(length . filter isLower) x 
= length (filter isLower x)

Local variables

Inside an expression, we can introduce new variables using let.

let x = 2
y = 4
in x + y

White space

Haskell is sensitive to white space!

White space and local variables

If you're defining local variables, they must start in the same column.

This is good:

let x = 2
y = 4
in x + y

But this will lead to a compiler error:

let x = 2
y = 4
in x + y

Composition exercise

Using function composition wherever you can, write a function that accepts a string and returns a new string containing only the words that begin with vowels.

Example:

disemvowel "I think, therefore I am."
== "I I am."

My solution

Here's how I wrote disemvowel:

disemvowel = 
let isVowel c = toLower c `elem` "aeiou"
in unwords . filter (isVowel . head) . words

Does this remind you of a Unix shell pipeline, only right-to-left?

Problem definition, once again

Given a web site, we want to scrape it and find important web pages.

We're now Haskell experts, right?

Let's download a web page!

We'd really like to rely on a library to download a web page for us.

At times like this, there's a very handy central repository of open source Haskell software:

Go there now!

Click on the Packages link at the top of the page to browse packages.

Alas, the list is overwhelmingly long, but we can find libraries for all kinds of tasks if we're patient.

Are we patient?

Ugh!

Scrolling through thousands of libraries is hard - surely there's a better way?

Enter the cabal command!

Run this command in a terminal window:

cabal update

This downloads the latest index of all software on Hackage.

With the index updated, we can search it:

cabal list http

That still gives us 20+ packages to comb through, but at least it's better than the 3,400 on the Packages web page.

Short-cutting the search

The best HTTP client library is named http-enumerator.

We can read about it online:

That landing page for a package is intimidating, but look towards the bottom, at the section labeled "Modules".

What do you see?

Installing a package

Before we can use http-enumerator, we must install it.

To install the http-enumerator package, we just issue a single command:

cabal install http-enumerator

This command figures out all the other libraries that http-enumerator depends on, and downloads, compiles, and installs the whole lot.

Expect it to take a few minutes and print a lot of output.

Reading docs: packages and modules

While we're waiting for the http-enumerator package and all of its dependencies to install, let's try to figure out how we should use it.

Remember the link to API documentation at the end of the package's web page? Click through to the API docs.

An API page begins with a title that looks something like this:

Network.HTTP.Enumerator

This is the name of a module.

A module is a collection of related code.

A package is a collection of related modules.

(This will sound familiar if you know Python.)

Reading docs: the rest

After the initial blurb, a module's docs consists of type signatures and descriptions.

Here is a really simple type signature:

foo :: String

How the heck do we read this?

The name of the thing being defined comes before the :: characters.

Its type follows after the ::.

This means "the value named foo has the type String".

Haskell's type system

Up until now, we have not bothered talking about types or type signatures.

Every expression and value in Haskell has a single type.

Those types can almost always be inferred automatically by the compiler or interpreter.

The most common basic types

A function signature

Here's another type signature:

words :: String -> [String]

Here we see a new symbol, ->, which means "this is a function".

The type after the last -> is the return type of the function.

All of its predecessors are argument types.

So this is a function that takes one String argument, and returns... what?

List notation

The notation [a] means "a list of values, all of some type a".

So [String] means "a list of values, all of type String".

Type synonyms

What's a String?

We can introduce new synonyms of our own.

type Dollars = Int

A type synonym can be handy for documenting an intended use for an existing type.

Words

words :: String -> [String]

We can now read that this function accepts a string as argument, and returns a list of strings.

From reading its name and type signature, can you guess what words might do?

Another signature

Tell me about this signature:

mystery :: [String] -> String

What are some reasonable possible behaviours for this function?

Reading real-world docs

Here is the very first signature from http-enumerator:

simpleHttp 
  :: (MonadIO m, Failure HttpException m) => 
     String -> m ByteString

This is more complex! How the heck do we read it?

The bits between :: and '=>' are constraints on where we can use simpleHttp - but let's ignore constraints for now.

We'll also ignore that mysterious lowercase m for a bit.

What can we tell about this function?

ByteString

A ByteString is a blob of binary data.

Unlike String, it is not represented as a list, but as a packed array.

However, it contains binary bytes, not text!

Let's play in ghci!

Does everyone have http-enumerator installed now?

Fire up ghci, and let's play with the module:

import Network.HTTP.Enumerator

Notice that after we type this, the prompt changes:

Prelude Network.HTTP.Enumerator>

This tells us that the module has loaded and is available.

Wait! Are you on Windows?

On Windows, we have to set up Winsock before any networking will work.

First, let's load the lowest-level networking module:

import Network.Socket

And here's how we initialize Winsock:

withSocketsDo (return ())

(It's harmless to do this on Unix.)

With that out of the way ...

Finally - let's load a web page!

simpleHttp "http://example.com/"

Did that just print a ton of HTML in the terminal window? All right!

From binary to text

Now we have a ByteString, which we need to turn into text for manipulating.

Let's cheat, and assume that all web pages are encoded in UTF-8.

Pure code

So far, all of the code we have written has been "pure".

Impure code

And yet ... somehow we downloaded a web page!

So can we write code like this?

length (simpleHttp "http://x.org/")

NO.

The type system segregates code that must be pure from code that may have side effects ("impure" code).

Are we stuck?

Well, let's look at a simpler example than simpleHttp.

Type this in ghci:

:type readFile

This will tell us what the type of readFile is.

IO

The :type directive should print something like this:

readFile :: FilePath -> IO String

Notice that IO on the result type?

It means "this function may have side effects".

We often refer to impure functions, with IO in the result type, as actions.

Mixing IO and other stuff

The type system keeps track of which functions have IO in their types, and keeps us honest.

We can still mix pure and impure code in a natural way:

charCount fileName = do
contents <- readFile fileName
return (length contents)

"do" notation

Critical to what we just saw was the do keyword at the beginning of the function definition.

This introduces a series of IO actions, one per line.

Capturing the results of impure code

To capture the result of an IO action, we use <- instead of =.

contents <- readFile fileName

The result (contents) is pure - it does not have the IO type.

This is how we supply pure code with data returned from impure code.

The "return" action

This is not the return type you're used to!

It takes a pure value (without IO in its type), and wraps it with the IO type.

Pure code can't call impure code, but it can thread data back into the impure world using return.

Haskell programs and IO

When you write a Haskell program, its entry point must be named main.

The type of main must be:

main :: IO ()

() is named "unit", and means more or less the same thing as void in C or Java.

What this means is that all Haskell programs are impure!

Binary to text

Remember we were planning to cheat earlier?

We had this:

simpleHttp :: String -> IO ByteString

We need something whose result is an IO String instead.

How should that look?

UTF-8 conversion

To do the conversion, let's grab a package named utf8-string.

cabal install utf8-string

That contains a package named Data.ByteString.Lazy.UTF8.

import Data.ByteString.Lazy.UTF8

It defines a function named toString:

toString :: ByteString -> String

UTF-8 conversion exercise

Write an action that downloads a URL and converts it from a ByteString to a String using toString.

Write a type signature for the action.

Downloading and saving a web page

Use your download function to save a local copy of the page you just wrote.

saveAs :: String -> Int -> IO ()

For simplicity, let's save the local files as names containing numbers:

makeFileName :: Int -> FilePath
makeFileName k = "download-" ++ show k ++ ".html"

To save a local copy of a file, you'll need the writeFile action.

Shoveling through HTML

Two truisms:

So! Let's use another library.

cabal install tagsoup

The tagsoup package can parse arbitrarily messy HTML.

It will feed us a list of events, like a SAX parser.

Dealing with problems

Try this:

head [1]

Now try this:

head []

Oops

If we pass an empty list, the head function throws an exception.

Suppose we need a version of head that will not throw an exception.

safeHead :: [a] -> ????

What should the ???? be?

Let's invent something.

safeHead (x:xs) = Some x
safeHead [] = None

Some? None?

To bring these constructors into existence, we need to declare a new type.

data Perhaps a = Some a
| None

The | character separates the constructors. We can read it as:

Maybe

Actually, Haskell already has a Perhaps type.

data Maybe a = Just a
| Nothing

The a is a type parameter, meaning that when we write this type, we have to supply another type as a parameter:

Using constructors

If we want to construct a Maybe Int using the Just constructor, we must pass it an Int.

Just 1  :: Maybe Int
Nothing :: Maybe Int

This will not work, because the types don't match:

Just [1] :: Maybe String

Pattern matching over constructors

We can pattern match over the constructors for Maybe just as we did for lists.

case foo of
Just x -> x
Nothing -> bar

Tags

The tagsoup package defines the following type:

data Tag = TagOpen String [Attribute]
| TagClose String
| TagText String
| TagComment String
| TagWarning String
| TagPosition Row Column

What do you think the constructors mean?

Pattern matching on a Tag

Suppose we want to write a predicate that will tell is if a Tag is an opening tag.

Don't care!

Our first body looked like this:

isOpenTag (TagOpen x y)     = True
isOpenTag (TagClose x) = False
isOpenTag (TagText x) = False
isOpenTag (TagComment x) = False
isOpenTag (TagWarning x) = False
isOpenTag (TagPosition x y) = False

Concise, but ugly.

The wild card pattern

We can write "I don't care what this pattern or variable is" using the "_" character.

isOpenTag (TagOpen _ _) = True
isOpenTag _ = False

The wild card pattern always matches.

Just a quick question

Why don't we write the function like this?

isOpenTag  _            = False
isOpenTag (TagOpen _ _) = True

Extracting links from a web page

Suppose we have a page in memory already.

Let's use it!

processPage url = do
page <- download url
return (parseTags page)

Tidying tags up

Parsed tags can contain a mixture of tag names.

<A HREF="...">
<a hrEF="...">

Canonical tags

Let's use our function to clean up the result of parseTags.

processPage url = do
page <- download url
return
(canonicalizeTags
(parseTags page))

Extracting links

We only care about open tags that are links, so <a> tags.

Whee!

This cascade is getting a bit ridiculous.

processPage url = do
page <- download url
return
(filter (isTagOpenName "a")
(canonicalizeTags
(parseTags page)))

Two observations:

A rewriting exercise

Take this function and split it into pure and impure parts.

Write the pure part using function composition.

processPage url = do
page <- download url
return
(filter (isTagOpenName "a")
(canonicalizeTags
(parseTags page)))

My solution

processPage url = do
page <- download url
return (process page)

process =
filter (isTagOpenName "a") .
canonicalizeTags .
parseTags

More stuff to filter out

Let's skip nofollow links.

We want to get the "rel" attribute of a tag.

No following

nofollow tag = fromAttrib "rel" tag == "nofollow"
process =
filter (not . nofollow) .
filter (isTagOpenName "a") .
canonicalizeTags .
parseTags

We have a list of <a> tags

How would we extract the "href" attribute from every element of the list?

Only non-empty <a href> tags

process =
filter (not . null) .
map (fromAttrib "href") .
filter (not . nofollow) .
filter (isTagOpenName "a") .
canonicalizeTags .
parseTags

Canonical URLs

Links can be absolute, relative, or invalid garbage, and we only want valid-looking absolute links.

To properly create an absolute link, we need to know the absolute URL of the page we're looking at.

canonicalizeLink :: String -> String -> Maybe String

Working with URIs

The Network.URI package contains some functions we might find handy.

parseURI :: String -> Maybe URI
parseURIReference :: String -> Maybe URI
uriToString id "" :: URI -> String
nonStrictRelativeTo :: URI -> URI -> Maybe URI

A monster of indentation

This is really hard to read!

import Network.URI

canon :: String -> String -> Maybe String
canon referer path =
case parseURI referer of
Nothing -> Nothing
Just r ->
case parseURIReference path of
Nothing -> Nothing
Just p ->
case nonStrictRelativeTo p r of
Nothing -> Nothing
Just u ->
Just (uriToString id u "")

Surely there's a better way.

Stair stepping

Notice that that function was a series of case inspections of Maybe values?

Suppose we had a function that accepted a normal value, and returned a Maybe value.

a -> Maybe b

And suppose we had a concise syntax for writing an anonymous function.

\a -> "hi mom! " ++ a

The \ is pronounced "lambda".

Observation

The case analysis is quite verbose. Suppose we had a function that performed it, and called another function if our value was Just.

bind :: Maybe a -> (a -> Maybe b) -> Maybe b
bind Nothing _ = Nothing
bind (Just value) action = action value

Using bind

How could we use this?

canon1 referer path =
parseURI referer `bind`
\r -> parseURIReference path `bind`
\p -> nonStrictRelativeTo p r `bind`
\u -> Just (uriToString id u "")

If we enclose a function name in backticks, we can use the function as an infix operator.

Reformatting the code

canon referer path =
parseURI referer `bind` \r ->
parseURIReference path `bind` \p ->
nonStrictRelativeTo p r `bind` \u ->
Just (uriToString id u "")

A built-in name for bind

The >>= operator is a more general version of our bind function.

canon referer path =
parseURI referer >>= \r ->
parseURIReference path >>= \p ->
nonStrictRelativeTo p r >>= \u ->
Just (uriToString id u "")

Using syntactic sugar

Here's some tidier syntax that should look familiar.

canonicalize :: String -> String -> Maybe String

canonicalize referer path = do
r <- parseURI referer
p <- parseURIReference path
u <- nonStrictRelativeTo p r
return (uriToString id u "")

Nearly there

process url =
map (canonicalize url) .
filter (not . null) .
map (fromAttrib "href") .
filter (\t -> fromAttrib "rel" t /= "nofollow") .
filter (isTagOpenName "a") .
canonicalizeTags .
parseTags

One awkward thing: what is the type of this function?

From [Maybe a] to [a]

Go to this web site:

Type this into the search box:

[Maybe a] -> [a]

What does the first result say?

We're there!

import Data.Maybe
import Network.URI

links url =
catMaybes .
map (canonicalize url) .
filter (not . null) .
map (fromAttrib "href") .
filter (\t -> fromAttrib "rel" t /= "nofollow") .
filter (isTagOpenName "a") .
canonicalizeTags .
parseTags

From links to spidering

If we can download the links from one page, we can easily write a spider to follow those links.

To keep things simple, let's set a limit on the number of pages we'll download.

What information do we want to generate?

What do we need to track along the way?

What we need to track

Here's the state we need to maintain:

Tracking what we've seen

For any given page, we need to remember both it and all the pages it links to.

One possibility for associating the two is a tuple:

("http://x.org/", ["http://microsoft.com/"])

Tuples are useful any time we want mixed-type data without the hassle of creating a new type.

Speaking of a new type, here's how we'd define one:

data Link = Link String [String]

-- Let's define some accessors, too.
linkFrom (Link url _) = url
linkTo (Link _ links) = links

Avoiding duplication

We don't want to visit any URL twice.

How do we avoid this?

visited url = elem url . map linkTo

This function has a problem - what is that problem?

Better performance

We really want a structure with a fast lookup operation.

What would you use in your language?

Maps and importing

In Haskell, we have mutable hash tables, but we don't use them.

Instead, we use immutable key-value maps.

We must perform fancy module importing tricks because the Data.Map module defines a lot of names that would otherwise overlap with built-in names.

This means "only import the name Map from Data.Map":

import Data.Map (Map)

And this means "import everything from Data.Map, but all those names must be prefixed with Map.":

import qualified Data.Map as Map

What use is an immutable data structure?

Everyone knows how to add a key and value to a hash table, right?

And that seems like a fundamental operation.

What do we do with maps?

How can this possibly work? Is it efficient?

A fistful of dollars

Here's a surprisingly handy built-in operator:

f $ x = f x

Why is this useful? Because it lets us eliminate parentheses.

Before:

explode k = error ("failed on " ++ show k)

After:

explode k = error $ "failed on " ++ show k

Partial application

This is annoying to write:

increment k = 1 + k

Almost as bad:

\k -> 1 + k

Much handier, and identical:

(1+)

In fact, this is valid:

increment = (1+)

Spidering, in all its glory

spider :: Int -> URL -> IO (Map URL [URL])
spider count url0 = go 0 Map.empty (Set.singleton url0)
where
go k seen queue0
| k >= count = return seen
| otherwise =
case Set.minView queue0 of
Nothing -> return seen
Just (url, queue) -> do
page <- download url
let ls = links url page
newSeen = Map.insert url ls seen
notSeen = Set.fromList .
filter (`Map.notMember` newSeen) $ ls
newQueue = queue `Set.union` notSeen
go (k+1) newSeen newQueue

Where do we stand?

We can now:

What remains?

Fin

At this point, if we have miraculously not run out of time, we're going on a choose-your-own-adventure session in Emacs.

Thanks for sticking with the slideshow so far!