Register

If this is your first visit, please click the Sign Up now button to begin the process of creating your account so you can begin posting on our forums! The Sign Up process will only take up about a minute of two of your time.

Results 1 to 2 of 2
  1. #1
    Senior Member rosland's Avatar
    Join Date
    Jul 2003
    Location
    Norway
    Posts
    1,944
    Member #
    2096
    (moved from PHP & MySQL forum, when I saw the turtle dollar winning option)

    In the november contest, [user]dotcommakers[/user] (who won) where told that his 'stats' (statistic information) were off.

    Transio and Spluffdaddy (WDF mod's) suggested database access or a "scraper" to extract the correct (and always current) figures from webdesignforums own page, and dynamically display them in the intro.

    A "scraper", as opposed to database access, does not process any data to display results. It mearly "scrapes" it off an existing page where the data allready has been processed for display, and extract the information you're after into its own array.

    To do this, we need to use "regular expressions".
    Regular expressions is a language in itself, with it's own syntax. It is not however, a standalone language, but is specifically designed to deal with string matching. It was first introduced in Perl, but almost all languages incorporates it now. Not all languages have included all the syntactic options in regular expressions, and the syntax has some variations depending on which language it is incorporated in. PHP's use of regular expressions, is based on the original Perl module.

    Here's an example of a scraper that uses a regular expression to extract the 'stats' from WDF's index page:

    (The code below has been modified to fit the new layout of WDF4 (to allow a working code example), and therefore the following explanatoy text does not match the "preg_match" lines in the posted code block. However, the principles should still be clear.)
    PHP Code:
    <html>
    <head>
    <title>Scraper</title>
    </head>

    <body>
    <?php
        $url 
    "http://www.webdesignforums.net/";
        
    $fp fopen($url,"r");
        if(
    $fp)
           {
            while(!
    feof($fp)){
                
    $buffer fgets($fp600);
                @
    $file .= $buffer;
           }
        
    fclose($fp);
        } else {
            die(
    "Could not create a connection to Webdesignforums.net");    
        }
    ?>
    <?php
        
        preg_match
    ("/<b>Members:<\/b>\s+(.*)\s+\(<a\s+href=/i",$file,$resA);
        
    preg_match("/<b>Threads:<\/b>\s+(.*);/i",$file,$resB);
        
    preg_match("/<b>Posts:<\/b>\s+(.*)/i",$file,$resC);
        echo 
    "<b>Members...</b> $resA[1]<br/><b>Threads...</b> $resB[1]<br/><b>Posts...</b> $resC[1]";        
    ?>
    </body>
    </html>

    In short, the script loads the index page of webdesignforums and streams it in to a buffer.
    The function fgets ($fp, 600) uses the filepointer to extract the number of bytes specified. PHP versions < 4.2 needs the length parameter. Version >4.2 defaults to 1024 if nothing is specified. After PHP version 4.3.3, it reads to the end of the file if nothing is specified. If it's a huge file, you gain speed by limiting the amount to read.

    The preg_match function, uses a "regular expression" to find a match on the index page of WDF. That is, it's looking for a match based on what we define.

    In webdesignforums page, the following is revealed when looking at the page source:
    Code:
    <img src="customimages/welcomepanel/quickstats.gif" width="85" height="25" /></td>
                        <td><font class="sf" >4,163 members, 9,282 threads, 
                            89,581 posts</font></td>
    We want to tell preg_match to look for the section containing the numbers, and store the result in an array.
    We extract the interesting part of the string and place it inside the two forward facing slashes like so:
    Code:
    preg_match(/"<font class="sf" >4,163 members, 9,282 threads/i")
    The "i" at the end tells the parser to treat the string case insensitive. Not a problem here, but could be if the search string contained capital letters.
    We then remove all whitespaces from the string and replace them with \s which tells PHP that there's supposed to be a single white space there in the original text. We could also use \s+, which means "any single or more whitespace".

    We also have to escape any charachters that could confuse the parser. If for instance there was a "</b>" tag in the string, then we would have to escape the forward slash like this "<\/b>" not to confuse the parser as to where the regular expression block ends. Same goes for double quote marks. There are other characters with special meaning as well. The dot/punctuation mark, means "any litteral character". If we were looking for a punctuation mark, we would have to escape it like so \. to find a match. These characters are called "metacharacters" as they have a special meaning in regular expressions.
    Next we replace the numbers with (.*), which tells PHP to store the original numbers in a numerical array, in this case called $result.

    The end result looks like this:
    Code:
    preg_match("/<td><font\sclass=\"sf\"\s>(.*)\smembers,\s(.*)\sthreads/i",$file,$result);
    The $file is the argument that holds the string, and $result is the name we chose for the array containing the numbers extracted.

    Since the original source code contains a line break and a lot of whitespace before revealing the number of posts (fgets() reads line by line), we just add another preg_match to the second line we're after and extract the posts figures.

    We then just echo the result in any HTML wrapping of our choice.

    Note that the first number in the array is $result[1] and not $result[0].
    $result[0] contains the string the data is extracted from, and $result[1] the first key holding the extracted data.
    S. Rosland

  2.  

  3. #2
    Senior Member rosland's Avatar
    Join Date
    Jul 2003
    Location
    Norway
    Posts
    1,944
    Member #
    2096
    Scraper updated with WDF4's html output.
    S. Rosland


Remove Ads

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  

Search tags for this page

how to extract information thru images

Click on a term to search for related topics.
All times are GMT -6. The time now is 01:24 PM.
Powered by vBulletin® Version 4.2.3
Copyright © 2019 vBulletin Solutions, Inc. All rights reserved.
vBulletin Skin By: PurevB.com