Bryan O'Sullivan
2011-09-18
main = putStrLn "hello!"
My name is Bryan O'Sullivan
I started using Haskell in 1993
I wrote a book about it, "Real World Haskell"
I write lots of open source code
My company invests heavily and openly in Haskell
git clone https://github.com/bos/strange-loop-2011
Began using Haskell at university
When I graduated in 1995, I switched to more mainstream languages
My interest reawakened around 2005
I've found Haskell to be a great general-purpose language
The community of people is amazing
Haskell is a fairly big language
Since so much is unfamiliar to newcomers, expect to wander far from your comfort zone
I'm going to teach you interesting things, but not everything
This is a hands-on workshop: you'll be writing code!
There will be a short break every 45 minutes
Don't be afraid to ask questions!
You've already installed the Haskell Platform, right?
This gives us a great toolchain
The GHC compiler (ghc
)
The GHCi interpreter (ghci
)
The Cabal package manager (cabal
)
Some handy libraries and tools
A text editor
A terminal window
Given a web site, we want to scrape it and find important web pages.
This involves a lot of figuring stuff out!
Learn Haskell
Download one web page
Extract links from a page, so we can find more pages to download
Once we're done, compute which ones are important
Make it all fast?
Create a file named Hello.hs
and give it the following contents:
main = putStrLn "hello, world!"
The suffix .hs
is the standard for Haskell source files.
File names start with a capital letter, and everyone uses CamelCase
.
This command will look for Hello.hs
in the current directory, and compile it:
ghc --make Hello
The generated executable will be named Hello
(Hello.exe
on Windows).
--make
option up there tells GHC to automatically deal with dependencies on source files and packages.Is everyone able to build and run their Hello
executable?
It's nice to have fast, native code at our fingertips.
But when I'm working, I expect a few things:
I do lots of exploration.
I make tons of mistakes.
For these circumstances, a full compiler is a bit slow.
Instead, I often use the interactive interpreter, ghci
.
Easily done:
ghci
It will display a startup banner, followed by a prompt:
Prelude>
This default prompt tells us which modules are available to play with.
The ghci
interpreter evaluates expressions interactively.
Try it out:
2 + 2
123456781234567812345678 * 87654321876543
"foo" ++ "bar"
(That ++
is the "append" operator.)
All interpreter directives start with a ":
" character.
Let's load our source file into ghci
:
:load Hello.hs
Now the ghci
prompt changes:
*Main>
We defined a function named main
, so let's invoke it:
main
Did that work for you?
What about this?
putStrLn "hi mom!"
Remember, all ghci
directives start with a ":
".
:help
tells us what directives are available.
:reload
reloads the file we last :load
ed.
:edit
launches our text editor on the file you most recently :load
ed. (Does not automatically :reload
.)
:quit
exits back to the shell.
We can abbreviate directives:
:e
is treated as :edit
:r
is :reload
We also have command line history and editing.
On Unix, compatible with readline
.
On Windows, the same as cmd.exe
.
Use :edit
or your text editor to change the "hello" text.
Use :reload
to reload the source file.
Test out your redefinition of main
.
main
.[1,2,3,4]
['h','e','l','l','o']
Double quotes are just syntactic sugar for the longer form:
"hello"
What does this print?
"foo" == ['f','o','o']
We use white space to separate a function from its argument:
head "foo"
head [1,2,3]
tail [1,2,3]
If a function takes multiple arguments, we separate them with white space:
min 3 4
If an argument is a compound expression, wrap it in parentheses:
compare (3+5) (2+7)
max (min 3 4) 5
Use ghci
as a calculator.
The **
operator performs exponentiation.
The notation ['a'..'z']
generates a list from start to end, inclusive.
The sum
function adds the elements of a list.
The show
function renders a value as a string. Try it!
show (1 == 2)
The length
function tells us how many elements are in a list.
length [1,2,3]
It is pretty simple to define a new function.
Open up your text editor, create a new file with a .hs
extension, and get writing!
isOdd x = (rem x 2) == 1
We start with the name of the function.
Next come the names we want to give its parameter(s), separated by white space.
After those come a single =
character, with the body of the function following.
Load your source file into ghci
and give myOdd
a try.
Now we can define very simple functions, but we're missing some important building blocks for more fun.
So let's get to it!
Q: What does the familiar if
look like in Haskell?
A: Familiar!
gcd a b = if b == 0
then a
else gcd b (rem a b)
We have the following elements:
A Boolean expression
then
an expression that will be the result if the Boolean is True
else
an expression that will be the result if the Boolean is False
The two possible results of an if
expression must have the same type.
If then
evaluates to a String
, well else
must too!
For instance, this makes no sense:
if True
then 3.14
else "wombat"
We are forbidden from writing ill-typed expressions like this.
In imperative languages, we can usually leave out the else
clause after an if
.
Not so in Haskell.
Why does this make sense for imperative languages, but not Haskell?
Write a function that appends ", world"
to its argument if the argument is "hello"
, or just returns its argument unmodified otherwise.
++
.We already know what a list looks like in Haskell:
[1,2,3]
And of course there's the syntactic sugar for strings:
"foo" == ['f','o','o']
But is this everything there is to know?
Supposing we want to construct a list from first principles.
We write the empty list as []
.
Given an existing list, we can add another element to the front of the list using the :
operator.
Add an element to an empty list:
1 : []
What about extending that list?
2 : (1 : [])
You're probably guessing now that [2,1]
is syntactic sugar for 2:(1:[])
. And you're right!
What is the result of this expression?
5 : 8 : [] == [5,8]
We refer to []
and :
as constructors, because we use them to construct lists.
When you create a list, the Haskell runtime has to remember which constructors you used, and where.
So the value [5,8]
is represented as:
A :
constructor, with 5
as its first parameter, and as its second ...
Another :
constructor, this time with 8
as its first parameter, and now as its second ...
A []
constructor
Depending on your background, I bet you're thinking something like this:
"Hey! Haskell lists look like singly linked lists!"
"Hey! That looks just like lists built out of cons
cells in Lisp!"
Right on.
Of course Haskell has to remember what a list is constructed of.
It also lets us inspect a list, to see which constructors were used.
How do we do this?
import Data.Char
isCapitalized name
= case name of
(first:rest) -> isUpper first
[] -> False
A case
expression allows us to inspect a structure to see how it was constructed.
isCapitalized name
= case name of
[] -> False
(first:rest) -> isUpper first
In between case
and of
is the expression we are inspecting.
If the constructor used was the empty-list constructor []
, then clearly the name
we're inspecting is empty, hence not capitalized.
If the constructor used was the "add to the front" :
operator, then things get more interesting.
Whatever was the first parameter of the :
constructor is bound to the name first
.
The second parameter of the :
constructor (i.e. everything in the list after the first element) is bound to the name rest
.
The expression following the ->
is evaluated with these values.
The case
expression performs what we call pattern matching.
Patterns are checked from top to bottom.
As soon as a match is found, its right hand side (after the ->
) is used as the result of the entire case
expression.
If no match succeeds, an exception is thrown.
Let's step through the machinery of what happens if we evaluate this expression.
isCapitalized "Ann"
Finally! We can write slightly more complex functions.
Now that you can inspect the front of a list, you should be able to process an entire list recursively.
First, please write a function named myLength
that computes the number of elements in a list.
Next, write a function named countCaps
that calculates the number of capital letters in a string.
countCaps "Monkey Butter" == 2
Wow, that countCaps function was a pain, right?
Here's my definition that uses only the machinery we've learned so far:
countCaps string =
case string of
[] -> 0
(x:xs) -> if isUpper x
then 1 + countCaps xs
else countCaps xs
I thought Haskell was all about concision!?
countCaps [] = 0
countCaps (x:xs) =
if isUpper x
then 1 + countCaps xs
else countCaps xs
We can define a function as a series of equations, each containing a pattern match.
This is nice syntactic sugar for case
.
countCaps [] = 0
countCaps (x:xs)
| isUpper x = 1 + countCaps xs
| otherwise = countCaps xs
After each |
is a guard.
If a pattern matches, we evaluate each Boolean guard expression from top to bottom.
When one succeeds, we evaluate the RHS as the body of the function.
(Yes, patterns in a case
can have guards too.)
Like the original version, but with use of case
stripped out:
countCaps xs =
if null xs
then 0
else if isUpper (head xs)
then 1 + countCaps (tail xs)
else countCaps (tail xs)
Both shorter and easier to follow:
countCaps [] = 0
countCaps (x:xs)
| isUpper x = 1 + countCaps xs
| otherwise = countCaps xs
Write a new version of countCaps
as follows:
Write a function that goes through a list, and which generates a new list that contains only its capital letters.
Use length
to count the number of elements.
This should give the same result as your first function. Right?
Suppose we want to count the number of lowercase letters in a string.
This seems almost the same as our function for counting uppercase letters.
What can we do with this observation?
Higher order function: a function that accepts another function as a parameter.
filter pred [] = []
filter pred (x:xs)
| pred x = x : filter pred xs
| otherwise = filter pred xs
How can we use this to define countLowerCase
?
By now, we've seen several definitions like this:
countLowerCase string =
length (filter isLower string)
This is a recurring pattern:
A function of one argument
It's being fed the result of ...
... another function of one argument
Haskell doesn't limit us to giving functions alphanumeric names.
Here, we define a function named simply ".
", which we can use as an operator:
(f . g) x = f (g x)
How to use this?
countLowerCase = length . filter isLower
If that seemed hard to follow, let's make it clearer.
We'll plug the arguments into the RHS of our function definition:
(f . g) x = f (g x)
We had length
as the first argument to ".
", and filter isLower
as the second:
(length . filter isLower) x
= length (filter isLower x)
Inside an expression, we can introduce new variables using let
.
let x = 2
y = 4
in x + y
Local definitions come after the let
.
The expression where we use them comes after the in
.
Haskell is sensitive to white space!
A top-level definition starts in the leftmost column.
After the beginning of a definition, if the next non-empty line is indented further, it is treated as a continuation of that definition.
Never use tabs in your source files.
If you're defining local variables, they must start in the same column.
This is good:
let x = 2
y = 4
in x + y
But this will lead to a compiler error:
let x = 2
y = 4
in x + y
Using function composition wherever you can, write a function that accepts a string and returns a new string containing only the words that begin with vowels.
words
and unwords
functions before you start.Example:
disemvowel "I think, therefore I am."
== "I I am."
Here's how I wrote disemvowel
:
disemvowel =
let isVowel c = toLower c `elem` "aeiou"
in unwords . filter (isVowel . head) . words
Does this remind you of a Unix shell pipeline, only right-to-left?
Given a web site, we want to scrape it and find important web pages.
We're now Haskell experts, right?
We'd really like to rely on a library to download a web page for us.
At times like this, there's a very handy central repository of open source Haskell software:
(Everyone just calls it "Hackage")
Go there now!
Click on the Packages link at the top of the page to browse packages.
Alas, the list is overwhelmingly long, but we can find libraries for all kinds of tasks if we're patient.
Are we patient?
Scrolling through thousands of libraries is hard - surely there's a better way?
Enter the cabal
command!
Run this command in a terminal window:
cabal update
This downloads the latest index of all software on Hackage.
With the index updated, we can search it:
cabal list http
That still gives us 20+ packages to comb through, but at least it's better than the 3,400 on the Packages web page.
The best HTTP client library is named http-enumerator
.
We can read about it online:
That landing page for a package is intimidating, but look towards the bottom, at the section labeled "Modules".
What do you see?
Before we can use http-enumerator
, we must install it.
To install the http-enumerator
package, we just issue a single command:
cabal install http-enumerator
This command figures out all the other libraries that http-enumerator
depends on, and downloads, compiles, and installs the whole lot.
Expect it to take a few minutes and print a lot of output.
While we're waiting for the http-enumerator
package and all of its dependencies to install, let's try to figure out how we should use it.
Remember the link to API documentation at the end of the package's web page? Click through to the API docs.
An API page begins with a title that looks something like this:
Network.HTTP.Enumerator
This is the name of a module.
A module is a collection of related code.
A package is a collection of related modules.
(This will sound familiar if you know Python.)
After the initial blurb, a module's docs consists of type signatures and descriptions.
Here is a really simple type signature:
foo :: String
How the heck do we read this?
The name of the thing being defined comes before the ::
characters.
Its type follows after the ::
.
This means "the value named foo
has the type String
".
Up until now, we have not bothered talking about types or type signatures.
Every expression and value in Haskell has a single type.
Those types can almost always be inferred automatically by the compiler or interpreter.
Bool
Int
Char
Double
Here's another type signature:
words :: String -> [String]
Here we see a new symbol, ->
, which means "this is a function".
The type after the last ->
is the return type of the function.
All of its predecessors are argument types.
So this is a function that takes one String
argument, and returns... what?
The notation [a]
means "a list of values, all of some type a
".
So [String]
means "a list of values, all of type String
".
What's a String
?
[Char]
, i.e. "a list of Char
".We can introduce new synonyms of our own.
type Dollars = Int
A type synonym can be handy for documenting an intended use for an existing type.
words :: String -> [String]
We can now read that this function accepts a string as argument, and returns a list of strings.
From reading its name and type signature, can you guess what words
might do?
Tell me about this signature:
mystery :: [String] -> String
What are some reasonable possible behaviours for this function?
Here is the very first signature from http-enumerator
:
simpleHttp
:: (MonadIO m, Failure HttpException m) =>
String -> m ByteString
This is more complex! How the heck do we read it?
The bits between ::
and '=>' are constraints on where we can use simpleHttp
- but let's ignore constraints for now.
We'll also ignore that mysterious lowercase m
for a bit.
What can we tell about this function?
A ByteString
is a blob of binary data.
Unlike String
, it is not represented as a list, but as a packed array.
However, it contains binary bytes, not text!
ByteString
for working with data that you have to manipulate as text.Does everyone have http-enumerator
installed now?
Fire up ghci
, and let's play with the module:
import Network.HTTP.Enumerator
Notice that after we type this, the prompt changes:
Prelude Network.HTTP.Enumerator>
This tells us that the module has loaded and is available.
On Windows, we have to set up Winsock before any networking will work.
First, let's load the lowest-level networking module:
import Network.Socket
And here's how we initialize Winsock:
withSocketsDo (return ())
(It's harmless to do this on Unix.)
Finally - let's load a web page!
simpleHttp "http://example.com/"
Did that just print a ton of HTML in the terminal window? All right!
Now we have a ByteString
, which we need to turn into text for manipulating.
Let's cheat, and assume that all web pages are encoded in UTF-8.
So far, all of the code we have written has been "pure".
The behaviour of all of our functions has depended only on their inputs.
All of our data is immutable.
There's thus no way to change a global variable and modify the behaviour of a function.
And yet ... somehow we downloaded a web page!
So can we write code like this?
length (simpleHttp "http://x.org/")
NO.
The type system segregates code that must be pure from code that may have side effects ("impure" code).
Well, let's look at a simpler example than simpleHttp
.
Type this in ghci
:
:type readFile
This will tell us what the type of readFile
is.
The :type
directive should print something like this:
readFile :: FilePath -> IO String
Notice that IO
on the result type?
It means "this function may have side effects".
We often refer to impure functions, with IO
in the result type, as actions.
The type system keeps track of which functions have IO
in their types, and keeps us honest.
We can still mix pure and impure code in a natural way:
charCount fileName = do
contents <- readFile fileName
return (length contents)
Critical to what we just saw was the do
keyword at the beginning of the function definition.
This introduces a series of IO
actions, one per line.
To capture the result of an IO
action, we use <-
instead of =
.
contents <- readFile fileName
The result (contents
) is pure - it does not have the IO
type.
This is how we supply pure code with data returned from impure code.
This is not the return
type you're used to!
It takes a pure value (without IO
in its type), and wraps it with the IO
type.
Pure code can't call impure code, but it can thread data back into the impure world using return
.
When you write a Haskell program, its entry point must be named main
.
The type of main
must be:
main :: IO ()
()
is named "unit", and means more or less the same thing as void
in C or Java.
What this means is that all Haskell programs are impure!
Remember we were planning to cheat earlier?
We had this:
simpleHttp :: String -> IO ByteString
We need something whose result is an IO String
instead.
How should that look?
To do the conversion, let's grab a package named utf8-string
.
cabal install utf8-string
That contains a package named Data.ByteString.Lazy.UTF8
.
import Data.ByteString.Lazy.UTF8
It defines a function named toString
:
toString :: ByteString -> String
Write an action that downloads a URL and converts it from a ByteString
to a String
using toString
.
Write a type signature for the action.
Haskell definitions usually don't require type signatures.
Nevertheless, we write them for documentation on almost all top-level definitions.
Use your download
function to save a local copy of the page you just wrote.
saveAs :: String -> Int -> IO ()
For simplicity, let's save the local files as names containing numbers:
makeFileName :: Int -> FilePath
makeFileName k = "download-" ++ show k ++ ".html"
To save a local copy of a file, you'll need the writeFile
action.
Two truisms:
Most HTML in the wild is a mess.
Even parsing well formed HTML is complicated.
So! Let's use another library.
cabal install tagsoup
The tagsoup
package can parse arbitrarily messy HTML.
It will feed us a list of events, like a SAX parser.
Try this:
head [1]
Now try this:
head []
If we pass an empty list, the head
function throws an exception.
Suppose we need a version of head
that will not throw an exception.
safeHead :: [a] -> ????
What should the ????
be?
Let's invent something.
safeHead (x:xs) = Some x
safeHead [] = None
We're using a constructor named Some
to capture the idea "we have a result".
The constructor None
indicates "we don't have a result here".
To bring these constructors into existence, we need to declare a new type.
data Perhaps a = Some a
| None
The |
character separates the constructors. We can read it as:
The Perhaps
type has two constructors:
Some
followed by a single argument
or None
with no arguments
Actually, Haskell already has a Perhaps
type.
data Maybe a = Just a
| Nothing
The a
is a type parameter, meaning that when we write this type, we have to supply another type as a parameter:
Maybe Int
Maybe String
If we want to construct a Maybe Int
using the Just
constructor, we must pass it an Int
.
Just 1 :: Maybe Int
Nothing :: Maybe Int
This will not work, because the types don't match:
Just [1] :: Maybe String
We can pattern match over the constructors for Maybe
just as we did for lists.
case foo of
Just x -> x
Nothing -> bar
The tagsoup
package defines the following type:
data Tag = TagOpen String [Attribute]
| TagClose String
| TagText String
| TagComment String
| TagWarning String
| TagPosition Row Column
What do you think the constructors mean?
Suppose we want to write a predicate that will tell is if a Tag
is an opening tag.
What should the type of this function be?
What should its body look like?
Our first body looked like this:
isOpenTag (TagOpen x y) = True
isOpenTag (TagClose x) = False
isOpenTag (TagText x) = False
isOpenTag (TagComment x) = False
isOpenTag (TagWarning x) = False
isOpenTag (TagPosition x y) = False
Concise, but ugly.
We really only care about one constructor.
We never use the variables x
or y
that we declare.
We can write "I don't care what this pattern or variable is" using the "_
" character.
isOpenTag (TagOpen _ _) = True
isOpenTag _ = False
The wild card pattern always matches.
Since we don't care about x
or y
, we can state that explicitly using _
.
Since we don't care about any constructor except TagOpen
, we can match all the others using _
.
Why don't we write the function like this?
isOpenTag _ = False
isOpenTag (TagOpen _ _) = True
Suppose we have a page in memory already.
Browse the tagsoup
docs, in the Text.HTML.TagSoup
module.
Find a function that will parse a web page into a series of tags.
processPage url = do
page <- download url
return (parseTags page)
Parsed tags can contain a mixture of tag names.
<A HREF="...">
<a hrEF="...">
tagsoup
function that will turn tag names and attributes to lower case.Let's use our function to clean up the result of parseTags
.
processPage url = do
page <- download url
return
(canonicalizeTags
(parseTags page))
We only care about open tags that are links, so <a>
tags.
How would we write the type of a function that will indicate whether a Tag
is an open tag with the correct name?
How would we use this function to extract only the open tags from a list of parsed tags?
This cascade is getting a bit ridiculous.
processPage url = do
page <- download url
return
(filter (isTagOpenName "a")
(canonicalizeTags
(parseTags page)))
Two observations:
Our action is now mostly pure code.
It sure looks like a pipeline.
Take this function and split it into pure and impure parts.
Write the pure part using function composition.
processPage url = do
page <- download url
return
(filter (isTagOpenName "a")
(canonicalizeTags
(parseTags page)))
processPage url = do
page <- download url
return (process page)
process =
filter (isTagOpenName "a") .
canonicalizeTags .
parseTags
Let's skip nofollow
links.
We want to get the "rel"
attribute of a tag.
nofollow tag = fromAttrib "rel" tag == "nofollow"
process =
filter (not . nofollow) .
filter (isTagOpenName "a") .
canonicalizeTags .
parseTags
How would we extract the "href"
attribute from every element of the list?
process =
filter (not . null) .
map (fromAttrib "href") .
filter (not . nofollow) .
filter (isTagOpenName "a") .
canonicalizeTags .
parseTags
Links can be absolute, relative, or invalid garbage, and we only want valid-looking absolute links.
To properly create an absolute link, we need to know the absolute URL of the page we're looking at.
canonicalizeLink :: String -> String -> Maybe String
The Network.URI
package contains some functions we might find handy.
parseURI :: String -> Maybe URI
parseURIReference :: String -> Maybe URI
uriToString id "" :: URI -> String
nonStrictRelativeTo :: URI -> URI -> Maybe URI
This is really hard to read!
import Network.URI
canon :: String -> String -> Maybe String
canon referer path =
case parseURI referer of
Nothing -> Nothing
Just r ->
case parseURIReference path of
Nothing -> Nothing
Just p ->
case nonStrictRelativeTo p r of
Nothing -> Nothing
Just u ->
Just (uriToString id u "")
Surely there's a better way.
Notice that that function was a series of case
inspections of Maybe
values?
Suppose we had a function that accepted a normal value, and returned a Maybe
value.
a -> Maybe b
And suppose we had a concise syntax for writing an anonymous function.
\a -> "hi mom! " ++ a
The \
is pronounced "lambda".
The case
analysis is quite verbose. Suppose we had a function that performed it, and called another function if our value was Just
.
bind :: Maybe a -> (a -> Maybe b) -> Maybe b
bind Nothing _ = Nothing
bind (Just value) action = action value
How could we use this?
canon1 referer path =
parseURI referer `bind`
\r -> parseURIReference path `bind`
\p -> nonStrictRelativeTo p r `bind`
\u -> Just (uriToString id u "")
If we enclose a function name in backticks, we can use the function as an infix operator.
canon referer path =
parseURI referer `bind` \r ->
parseURIReference path `bind` \p ->
nonStrictRelativeTo p r `bind` \u ->
Just (uriToString id u "")
The >>=
operator is a more general version of our bind
function.
canon referer path =
parseURI referer >>= \r ->
parseURIReference path >>= \p ->
nonStrictRelativeTo p r >>= \u ->
Just (uriToString id u "")
Here's some tidier syntax that should look familiar.
canonicalize :: String -> String -> Maybe String
canonicalize referer path = do
r <- parseURI referer
p <- parseURIReference path
u <- nonStrictRelativeTo p r
return (uriToString id u "")
process url =
map (canonicalize url) .
filter (not . null) .
map (fromAttrib "href") .
filter (\t -> fromAttrib "rel" t /= "nofollow") .
filter (isTagOpenName "a") .
canonicalizeTags .
parseTags
One awkward thing: what is the type of this function?
Go to this web site:
Type this into the search box:
[Maybe a] -> [a]
What does the first result say?
import Data.Maybe
import Network.URI
links url =
catMaybes .
map (canonicalize url) .
filter (not . null) .
map (fromAttrib "href") .
filter (\t -> fromAttrib "rel" t /= "nofollow") .
filter (isTagOpenName "a") .
canonicalizeTags .
parseTags
If we can download the links from one page, we can easily write a spider to follow those links.
To keep things simple, let's set a limit on the number of pages we'll download.
What information do we want to generate?
What do we need to track along the way?
Here's the state we need to maintain:
The number of pages we have downloaded
A collection of pages we have seen links to, but haven't downloaded
A collection of pages and their outbound links
For any given page, we need to remember both it and all the pages it links to.
One possibility for associating the two is a tuple:
("http://x.org/", ["http://microsoft.com/"])
Tuples are useful any time we want mixed-type data without the hassle of creating a new type.
Speaking of a new type, here's how we'd define one:
data Link = Link String [String]
-- Let's define some accessors, too.
linkFrom (Link url _) = url
linkTo (Link _ links) = links
We don't want to visit any URL twice.
How do we avoid this?
visited url = elem url . map linkTo
This function has a problem - what is that problem?
We really want a structure with a fast lookup operation.
What would you use in your language?
In Haskell, we have mutable hash tables, but we don't use them.
Instead, we use immutable key-value maps.
We must perform fancy module importing tricks because the Data.Map
module defines a lot of names that would otherwise overlap with built-in names.
This means "only import the name Map
from Data.Map
":
import Data.Map (Map)
And this means "import everything from Data.Map
, but all those names must be prefixed with Map.
":
import qualified Data.Map as Map
Everyone knows how to add a key and value to a hash table, right?
And that seems like a fundamental operation.
What do we do with maps?
How can this possibly work? Is it efficient?
Here's a surprisingly handy built-in operator:
f $ x = f x
Why is this useful? Because it lets us eliminate parentheses.
Before:
explode k = error ("failed on " ++ show k)
After:
explode k = error $ "failed on " ++ show k
This is annoying to write:
increment k = 1 + k
Almost as bad:
\k -> 1 + k
Much handier, and identical:
(1+)
In fact, this is valid:
increment = (1+)
spider :: Int -> URL -> IO (Map URL [URL])
spider count url0 = go 0 Map.empty (Set.singleton url0)
where
go k seen queue0
| k >= count = return seen
| otherwise =
case Set.minView queue0 of
Nothing -> return seen
Just (url, queue) -> do
page <- download url
let ls = links url page
newSeen = Map.insert url ls seen
notSeen = Set.fromList .
filter (`Map.notMember` newSeen) $ ls
newQueue = queue `Set.union` notSeen
go (k+1) newSeen newQueue
We can now:
Download a web page
Extract its links
Spider out from there, without repeat visits
What remains?
We could spider multiple pages concurrently
Or we could compute which pages are "important"
At this point, if we have miraculously not run out of time, we're going on a choose-your-own-adventure session in Emacs.
Thanks for sticking with the slideshow so far!