Register

If this is your first visit, please click the Sign Up now button to begin the process of creating your account so you can begin posting on our forums! The Sign Up process will only take up about a minute of two of your time.

Results 1 to 6 of 6
  1. #1
    WDF Staff mlseim's Avatar
    Join Date
    Apr 2004
    Location
    Cottage Grove, Minnesota
    Posts
    7,716
    Member #
    5580
    Liked
    718 times
    A friend of mine goes to Printable-Puzzles each day and picks the latest crossword puzzles
    to print ... as a PDF. It takes a few clicks to get the final PDF downloaded puzzle. So he asked
    me to create a PHP script to directly link the latest PDF files without all the website clicking.
    They go through a series of download links that are really unnecessary and annoying.

    Below is the script.

    I suppose it will work until they alter their web page HTML. It's an example of doing HTML scraping.
    They don't offer any RSS feeds (XML files) to do what I'm doing ... that would be the best method.

    So, in effect, I'm "scraping" their site, which is hot-linking. I normally don't even discuss
    things that are considered unethical, but for some reason I think this is something developers
    might like to see. And besides that, their PDF files do have links on them, so their name is
    represented in the resulting files. If they decide to offer an RSS feed (which they should), using
    PHP CURL would be a valid method to parse the XML file.

    Use this at your own risk, and your conscience, and know that I don't condone web-scraping or hot-linking.

    If Printable-Puzzles.com ever wishes to offer this feature, or to email a PDF file to subscribers,
    I hope they see this and ask one of us for help. I would be glad to assist with the scripting.

    PHP Code:
    <?php
    // Printable Crossword Puzzles - Get the latest PDF files directly, without all the links they have.

    $feed_url "http://www.printable-puzzles.com/printable-crossword-puzzles.php";

    // An example of what we'll be searching for ... what the HTML text looks like.
    // Clueless Crossword  #DC193TH
    // Crossword Puzzle #V755OC

    # INITIATE CURL.
    $curl curl_init();

    # CURL SETTINGS.
    curl_setopt($curlCURLOPT_URL,"$feed_url");
    curl_setopt($curlCURLOPT_HEADER0);
    curl_setopt($curlCURLOPT_RETURNTRANSFER1);
    curl_setopt($curlCURLOPT_CONNECTTIMEOUT0);

    # GRAB THE FILE.
    //$data = strip_tags(curl_exec($curl));
    $data curl_exec($curl);

    curl_close($curl);

    // GRAB THE NEWEST CLUELESS CROSSWORD PUZZLE
    $clueless_pos strpos($data"Crossword  #"); // they put two spaces in between
    $clueless_code substr($data$clueless_pos+127);
    $clueless_url="http://www.printable-puzzles.com/dl.php?".$clueless_code.".pdf";

    // GRAB THE NEWEST NORMAL CROSSWORD PUZZLE
    $cross_pos strpos($data"Puzzle #");
    $cross_code substr($data$cross_pos+86);
    $cross_url="http://www.printable-puzzles.com/dl.php?".$cross_code.".pdf";

    echo
    "
    <center><br /><br />
    Clueless Puzzle --> <a href='
    $clueless_url'>Download Latest Puzzle - PDF</a><br /><br />
    Normal Puzzle --> <a href='
    $cross_url'>Download Latest Puzzle - PDF</a><br /><br />
    </center>
    "
    ;

    ?>


  2.  

  3. #2
    Senior Member Ganners's Avatar
    Join Date
    Feb 2011
    Location
    United Kingdom
    Posts
    415
    Member #
    27007
    Liked
    92 times
    Pretty cool, the way I'd recommend doing it can be seen here using a html parser:
    http://ganners.co.uk/site_scraper/

    The basic code is:
    PHP Code:
    <?php
    include('simple_html_dom.php');

    class 
    Puzzle_Scraper {

        public 
    $html;

        public 
    $puzzleCategories = array();

        
    /**
        * Sets up the method
        * @param string $url (option)
        */
        
    public function __construct($url 'http://www.printable-puzzles.com/printable-crossword-puzzles.php') {

            
    $this->html file_get_html($url);

            
    $this->scrapePuzzles();

        }

        
    /**
        * Returns puzzles array
        * @return array  The array of puzzles
        */
        
    public function getPuzzles() {

            return (array) 
    $this->puzzleCategories;

        }

        
    /**
        * Scrapes the puzzles from the site
        */
        
    private function scrapePuzzles() {

            foreach(
    $this->html->find('div.item') as $item) {

                
    $puzzleCategory = new stdClass;
                
    $puzzleCategory->name $item->find('h2'0)->plaintext;
                
    $puzzleCategory->puzzles = array();

                foreach(
    $item->find('table'0)->find('tr') as $row) {

                    
    $i 0;

                    
    $puzzle = new stdClass;

                    foreach(
    $row->find('td') as $key => $cell) {

                        switch(
    $i) {
                            case 
    0:
                                
    //Sets the date to a unix timestamp
                                
    $puzzle->date strtotime(str_replace("&nbsp;"""$cell->plaintext));
                            break;
                            case 
    1:
                                
    //Grabs the hash code
                                
    $puzzle->code preg_replace("/[a-zA-Z]{5,12} [a-zA-Z]{5,12} #([[:alnum:]])/""$1"str_replace("&nbsp;"""$cell->plaintext));
                            break;
                        }

                        
    $i++;

                    }
                    
    //Sets the puzzles
                    
    $puzzleCategory->puzzles[] = $puzzle;
                }
                
    //Adds to the categories
                
    $puzzleCategories[] = $puzzleCategory;

            }
            
    //Adds to the category array
            
    $this->puzzleCategories $puzzleCategories;
        }

    }

    ?>

    <?php $Puzzle_Scraper = new Puzzle_Scraper; foreach($Puzzle_Scraper->getPuzzles() as $puzzleCategory) { ?>

        <h2><?php echo $puzzleCategory->name?></h2>

        <?php foreach($puzzleCategory->puzzles as $puzzle) { ?>
            <p><a href="http://www.printable-puzzles.com/dl.php?<?php echo $puzzle->code?>.pdf">Download the latest (<?php echo date("d/m/Y"$puzzle->date); ?>)</a></p>
        <?php break; /*Breaking here will only get the first*/ ?>

    <?php ?>

    <hr />

    <p><a href="site_scraper.zip">Download source (.zip)</a></p>
    Mark Gannaway Software Developer

    Recent Experiments
    - Backpropogation Neural Network language solving (http://ann.ganners.co.uk/)
    - Animated image to ASCII (http://google.ganners.co.uk/)
    - 3D Paper Characters (http://cybergame.ganners.co.uk/)
    - Anagram solving (http://roflol.co.uk/)

  4. #3
    WDF Staff mlseim's Avatar
    Join Date
    Apr 2004
    Location
    Cottage Grove, Minnesota
    Posts
    7,716
    Member #
    5580
    Liked
    718 times
    Thanks for showing an alternate solution.

    What would be the advantage to using your method?
    Somehow it seems more complicated and uses more scripting.

    I'll have to look closer at your HTML parser.


  5. #4
    Senior Member Ganners's Avatar
    Join Date
    Feb 2011
    Location
    United Kingdom
    Posts
    415
    Member #
    27007
    Liked
    92 times
    It's not my parser, just one I found It's all commented at the top of it anyway

    The advantage is it's safer and less likely to fail, I can automatically scan all puzzles and retrieve them without having to specify what they are. I actually grab every single puzzle of every puzzle category, but for demonstration purposes i break out of the loop to only print the first. It's a bit safer, if the site changes then mine is more likely to work (questionable). It's just tidier really!

    I've worked with a company whose entire business model is about scraping sites and anything other than a html parser is out of the question for that!

    Few advantages, I think it's just tidier though, more automated, more OO.
    Mark Gannaway Software Developer

    Recent Experiments
    - Backpropogation Neural Network language solving (http://ann.ganners.co.uk/)
    - Animated image to ASCII (http://google.ganners.co.uk/)
    - 3D Paper Characters (http://cybergame.ganners.co.uk/)
    - Anagram solving (http://roflol.co.uk/)

  6. #5
    WDF Staff mlseim's Avatar
    Join Date
    Apr 2004
    Location
    Cottage Grove, Minnesota
    Posts
    7,716
    Member #
    5580
    Liked
    718 times
    I like the idea of being able to specify any of the puzzles.
    I'll have to go play in my sandbox

    Even though I'm not a fan of scraping or hot-linking, I am finding many
    developers are asking about this all the time. I realize there are reasons
    for copyrights, and reasons to make people pay for information. But my
    intentions are to do some mashups and scripting just to give some insight
    into how this is done ... hopefully answering some of the questions.


  7. #6
    Senior Member Ganners's Avatar
    Join Date
    Feb 2011
    Location
    United Kingdom
    Posts
    415
    Member #
    27007
    Liked
    92 times
    Well I think it's a grey area in terms of copyrighting. I suppose this is an issue as you're bypassing the advertising and going straight to the download. But then Google gets a way with doing a similar thing! So yeah it's a bit different.

    It's a popular subject, and this stuff comes in very useful! Some companies pay for it which is part of the model of a company I've worked for before. So even if it's for personal use it's really useful to do. A good test would be to visit an e-commerce site and scrape the site to collate a product list, and be able to paginate to create this product list.

    I'd say parsers are the way to go, you could use simple xml to a point, but ones that mimic CSS selectors seem to be the most popular (like the one I used).
    Mark Gannaway Software Developer

    Recent Experiments
    - Backpropogation Neural Network language solving (http://ann.ganners.co.uk/)
    - Animated image to ASCII (http://google.ganners.co.uk/)
    - 3D Paper Characters (http://cybergame.ganners.co.uk/)
    - Anagram solving (http://roflol.co.uk/)


Remove Ads

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  

Search tags for this page

show clueless crosswords only

Click on a term to search for related topics.
All times are GMT -6. The time now is 10:23 PM.
Powered by vBulletin® Version 4.2.3
Copyright © 2019 vBulletin Solutions, Inc. All rights reserved.
vBulletin Skin By: PurevB.com