Register

If this is your first visit, please click the Sign Up now button to begin the process of creating your account so you can begin posting on our forums! The Sign Up process will only take up about a minute of two of your time.

Results 1 to 1 of 1

Thread: Scraper

  1. #1
    Senior Member rosland's Avatar
    Join Date
    Jul 2003
    Location
    Norway
    Posts
    1,944
    Member #
    2096
    Scraper/lift complex information from other sites

    In the november contest, dotcommakers (who won) where told that their stats were off.

    Transio suggested database access or a "scraper" to extract the correct (and always current) figures from webdesignforums own page, and dynamically display them in the intro.
    A scraper, as opposed to database access, does not process any data to display results. It mearly "scrapes" it off an existing page where the data allready has been processed for display, and extract the information you're after into its own array.

    Here's an example of a scraper that does that:
    PHP Code:
    <html>
    <head>
    <title>Scraper</title>
    </head>

    <body>
    <?php
        $url 
    "http://www.webdesignforums.net/";
        
    $fp fopen($url,"r");
        if(
    $fp)
           {
            while(!
    feof($fp)){
                
    $buffer fgets($fp600);
                @
    $file .= $buffer;
           }
        
    fclose($fp);
        } else {
            die(
    "Could not create a connection to Webdesignforums.net");    
        }
    ?>
    <?php
        preg_match
    ("/<td><font\sclass=\"sf\"\s>(.*)\smembers,\s(.*)\sthreads/i",$file,$result);
        
    preg_match("/(.*)\sposts/i",$file,$res);
        echo 
    "<b>Members...</b> $result[1]<br/><b>Threads...</b> $result[2]<br/><b>Posts...</b> $res[1]";        
    ?>

    </body>
    </html>
    In short, the script loads the index page of webdesignforums and streams it in to a buffer.
    The function fgets ($handle, 600) uses the filepointer to extract the number of bytes specified. PHP versions < 4.2 needs the length parameter. Version >4.2 defaults to 1024 if nothing is specified. After PHP version 4.3.3, it reads to the end of the file if nothing is specified. If it's a huge file, you gain speed by limiting the amount to read.

    The preg_match function, looks for a "regular expression". That is, it's looking for a match based on what we define.

    In webdesignforums page, the following is revealed when looking at the page source:
    Code:
    <img src="customimages/welcomepanel/quickstats.gif" width="85" height="25" /></td>
                        <td><font class="sf" >4,163 members, 9,282 threads, 
                            89,581 posts</font></td>
    We want to tell preg_match to look for the section containing the numbers, and store the result in an array.
    We extract the interesting part of the string and place it inside the two forward facing slashes as so:
    Code:
    preg_match(/"<font class="sf" >4,163 members, 9,282 threads/i")
    The "i" at the end tells the parser to treat the string case insensitive. Not a problem here, but could be if the search string contained capital letters.
    We then remove all whitespaces from the string and replace them with \s which tells PHP that there's supposed to be a single white space there in the original text.
    We also have to escape any charachters that could confuse the parser. If for instance there was a "</b>" tag in the string, then we would have to escape the forward slash like this "<\/b>" not to confuse the parser as to where the regular expression terminates. Same goes for double quote marks.
    Next we replace the numbers with (.*), which tells PHP to store the original numbers in a numerical array, in this case called $result.

    The end result looks like this:
    Code:
    preg_match("/<td><font\sclass=\"sf\"\s>(.*)\smembers,\s(.*)\sthreads/i",$file,$result);
    The $file is the argument that holds the string, and $result is the name we chose for the array containing the numbers extracted.

    Since the original source code contains a line break and a lot of whitespace before revealing the number of posts, we just add another preg_match to the second line we're after and extract the posts figures.

    We then just echo the result in any HTML wrapping of our choice.

    Note that the first number in the array is $result[1] and not $result[0].
    $result[0] contains the string the data is extracted from, and $result[1] the first key holding the extracted data.

    (Didn't know if this qualifies as a tutorial or not, so you Mod's feel free to move it or delete it as you find appropriate.)
    S. Rosland

  2.  


Remove Ads

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
All times are GMT -6. The time now is 10:17 AM.
Powered by vBulletin® Version 4.2.3
Copyright © 2019 vBulletin Solutions, Inc. All rights reserved.
vBulletin Skin By: PurevB.com