#$Id$
package WWW::Crawler;

use strict;
use vars qw($VERSION @ISA);
use Carp;

$VERSION = '0.02';

sub DEBUG () {0}

########################################################
sub new
{
    my($package)=@_;
    return bless {
            TODO=>[],
            ALREADY=>{},
        }, $package;
}


########################################################
sub run
{
    my($self)=@_;
    while($self->one_loop) {}
}

########################################################
sub one_loop
{
    my($self)=@_;
    my $link=$self->next_link;
    $self->fetch($link) if $link;
    return defined $link;
}


########################################################
sub schedule_link
{
    my($self, $page)=@_;
    $page={uri=>$page} if ref $page ne 'HASH';

    $page=$self->cannonical($page);
    return unless $self->include($page);
    DEBUG and warn "Added $page->{uri} to TODO list\n";
    push @{$self->{TODO}}, $page;
    return $page;
}

########################################################
sub next_link
{
    my($self)=@_;
    my $page=shift @{$self->{TODO}};
    DEBUG and warn $page ? "Removed $page->{uri} from TODO list\n" : "TODO list is empty\n";
    return $page;
}

########################################################
sub cannonical
{
    my($self, $page)=@_;

    $page->{uri}=~s/#.+$//;         # http://foo.com/yadda.html#biff 
#    $page->{uri}=~s/\?.+$//;        # http://foo.com/yadda.html?bill=bibble
    return $page->{uri};
}



########################################################
sub seen
{
    my($self, $page)=@_;
    $self->{ALREADY}->{$page->{uri}}=1;
}

########################################################
sub include
{
    my($self, $page)=@_;
    DEBUG and warn "Have we already visited $page->{uri} ?\n";
    return if exists $self->{ALREADY}{$page->{uri}};
    DEBUG and warn "No...\n";
    return 1;
}





########################################################
sub error
{
    my($self, $page, $error)=@_;
}
########################################################
sub fetch
{
    my($self, $page)=@_;
    croak "Please overload WWW::Crawler::fetch\n";
}

########################################################
sub process
{
    my($self, $page)=@_;
    croak "Please overload WWW::Crawler::process\n";
}

########################################################
sub parse
{
    my($self, $page)=@_;
    my %data;
    return \%data;
}

########################################################
sub extract_links
{
    my($self, $page)=@_;
    croak "Please overload WWW::Crawler::extract_links\n";
}



1;
__END__
# Below is the stub of documentation for your module. You better edit it!

=head1 NAME

WWW::Crawler - Unified framework for web crawlers

=head1 SYNOPSIS

    package My::Crawler;
    use WWW::Crawler;

    sub fetch
    {
        my($self, $page)=@_;
        $page->{document}=get($page->{uri});
        $self->fetched($page);
    }
    
    sub parse
    {
        my($self, $page)=@_;
        my %data;
        $data{links}=[$page->{document} =~ m(href="(.+?)")ig];
        $data{title}=$1 if $page->{document} =~ m(<title>(.+?)</title>)i;
        return \%data;
    }
    
    sub extract_links
    {
        my($self, $page)=@_;
        return @{$page->{parsed}{links}};
    }

    sub process
    {
        my($self, $page)=@_;
        print "Doing something to $page->{parsed}{title}\n";
    }

    package main;
    
    my $crawler=My::Crawler->new();

    $crawler->sechdule_link("http://www.yahoo.com/"); crawler->run;

Obviously, this example is very bad.  It will doesn't respect robots.txt,
nor does it check to make sure you are only crawling one host or anything. 
Running it would be very bad.


=head1 DESCRIPTION

WWW::Crawler is intented as a unified framework for web crawlers.  It should
be subclassed so for each application.


=head1 METHODS

=head2 cannonical

Turns an URI into it's cannonical form.  Known host equivalents (localhost
is the same as localhost.localdomain, or www.slashdot.org and slashdot.org
are the same) should be dealt with here.  

The default method simply removes internal anchors (page.html#foo is in
fact page.html) and URI parameters (page.html?foo=bar is in fact page.html).

=head2 error($self, $page, $error)

Called when an error occurs while fetching an URI.  $error is whatever
fetch() sets it to.  Default is do nothing.  You should overload this if you
want to report errors somewhere.   Having a generalised error mechanism like
this allows things like WWW::Crawler::RobotsRules to cooperate with various
fetch() routines cleanly.


=head2 extract_links($self, $page)

Returns a array of absolute URIs or $page things of all the links contained
in a given page.  URIs should be in full form (ie
http://server.com/yadda/yadda/yadda.html) or URI objects.  Use $page->{uri}
as a base URI for relative links.  We can't do this in cannonical(), because
it doesn't know the base URI a link was extracted from.

B<Must be overloaded.>

=head2 fetch($self, $page)

Should fetch the requested URI ($page->{uri}), set $page->{header} (if
applicable and needed) and $page->{document} then call
$self->fetched($page).  If there was an error you should call
$self->fetched($page, {...something to do with the error}).

B<Must be overloaded.>

=head2 fetched($self, $page)

This is where the document is processed, links are extracted and so on. 
Page must contain the following members : document and uri.

=head2 include($self, $page)

Returns true if the $page should be scheduled.

=head2 new($package)

Constructor.  Overload as needed.  Please call SUPER::new() as well if you
are using the default schedule_link/next_link/include, because they need
package members.

Default constructor requires no parameters.

Creates the following members:

=over 4

=item ALREADY 
    
Hashref of URIs that have already been visited.

=item TODO

Arrayref FIFO URIs that need to be processed.

=back

=head2 next_link($self)

Returns an URI that should be fetched and processed.  Returns an empty
string means no more URIs are known, but we still want to keep going. 
Return undef() means all the work has been done and now we go home.

=head2 parse($self, $page)

Parses an HTML document ($page->{document}) and sets various members of
$page to be used later by process() and/or extract_links().

=head2 process($self, $page)

This is where an application does it's own work.  All members of $page
should be set.

B<Must be overloaded.>

=head2 run($self)

Main processing loop.  Does not exit until next_link() returns undef().

Overload this method to fit it into your own event loop.

=head2 schedule_link($self, $page)

Add $page to the todo list.  Must cooperate with next_link() and add_link()
to get their job done.  It turns the URI into a cannonical form with
cannonical() and makes sure the URI should be fetched by calling include().

Returns the cannonical URI (as a $page) if it was put on the TODO list, or
undef() otherwise.

If you wanted to go easy on a servers bandwidth, this is where you'd put the
logic.  Something like :

    sub schedule_link
    {
        my($self, $page)=@_;
        my $uri=$page->{uri};

        $page=$self->cannonical($page);
        return unless $self->include($page);
        my $host=URI->new($uri)->host();
        
        $self->{SERVERS_TIME}{$host}||=time;
        push @{$self->{SERVERS}{$host}}, $page;
    }

    sub next_link
    {
        my($self)=@_;
        my $now=time;
        foreach my $host (grep {$self->{SERVERS_TIME}{$_}} <= $now}
                                keys %{$self->{SERVERS_TIME}}) {
                                
            if(@{$self->{SERVERS}{$host}}) {
                push @{$self->{TODO}}, shift @{$self->{SERVERS}{$host}};
                $self->{SERVERS_TIME}{$_}=$now+1;
            } else {
                delete $self->{SERVERS}{$host};
                $self->{SERVERS_TIME}{$host};
            }
        }
        my $next=shift @{$self->{TODO}};
   
        return '' if not $next and keys %{$self->{SERVERS_TIME}};
        return $next;
    }



=head2 seen($self, $page)

seen() is called for each URI that is being processed.  This method should
cooperate with include() to avoid fetching the same URI twice.

=head2 $page

$page is a hashref that is passed to many routines.  It contains various
information about a given page.

=over 4

=item uri

URI of the page.  Set by run()

=item header

HTTP header.  Set by parser() and/or fetch().

=item content

Document contents.  Set by fetched()

=item parsed

Contains the data returned by parsed().

=back


=head1 OVERLOADING

The following methods must be overloaded: fetch(), process(),
extract_links().

Object members should be created in new() and documented in the POD.


=head1 Woah, i'm confused.

So am I! 

Anyway, here is a psuedo-code version of what is going on:

    schedule_link(with a given URI) # prime the pump


    run {
        while(next_link() returns defined) {
            fetch($page) {
                fetched($page now has document (and maybe header))
                parse($page)
                process($page)
                seen($page)
                foreach (extract_links($page)) {
                    schedule_link($new_page) {
                        cannonical($new_page)
                        if(include($page)) {
                            # add to the todo list, so next_link() sees it
                        }     
                    }
                }
            }
        }
    }

=head1 AUTHOR

Philip Gwyn <perl@pied.nu>

=head1 SEE ALSO


L<WWW::Crawler::RobotsRules>,
WWW::Crawler::Slower,
L<WWW::Crawler::LWP>, 
WWW::Crawler::POE, 
perl(1).

=cut

$Log$