Register

If this is your first visit, please click the Sign Up now button to begin the process of creating your account so you can begin posting on our forums! The Sign Up process will only take up about a minute of two of your time.

Results 1 to 9 of 9
  1. #1
    Senior Member Brak's Avatar
    Join Date
    Apr 2003
    Location
    San Francisco, CA
    Posts
    3,413
    Member #
    1217
    Liked
    2 times
    So, I've made this script which reads a large text file (~5-6MB)and goes line by line, making an SQL file based off tab separated values and a few custom functions I have per line.

    No problem.

    It was a bit slow (I can understand this - taking about 1-2min to process) - but then I realized I needed to escape this data for mysql insertion. No problem, I remembered - I'll just escape the needed data.

    PHP Code:
    $temp[0] = mysql_escape_string($temp[0]);
    $shortdesc mysql_escape_string($shortdesc);
    $temp[1] = mysql_escape_string($temp[1]); 
    This increased processing time to around 45min - 2hr. Not acceptable. What gives? I tried mysql_real_escape string first, which produced the same result. These strings that it's escaping themselves aren't very long (Maybe longest is like 1,000 characters). I feel is is pretty unacceptable... but don't want to insert bad data in either. Is there a better solution out there?

    Also, I've noticed while running this script it progressively gets slower. Are there any tips you guys can give me on optimizing PHP for string handling / file handling - I havne't been able to find any.

    If you want the specific functions that I need perform, i can post them too...
    Kyle Neath: Rockstar extraordinare
    The blog | The poetry site | The Spore site

  2.  

  3. #2
    Senior Member Brak's Avatar
    Join Date
    Apr 2003
    Location
    San Francisco, CA
    Posts
    3,413
    Member #
    1217
    Liked
    2 times
    OK, perhaps it was a bug or something, but the addslashes and mysql_escape_string function inserts 95 MEGABYTES of slashes alone. Going to look into the is further.. although suggestions on speed improvements are still welcomed!
    Kyle Neath: Rockstar extraordinare
    The blog | The poetry site | The Spore site

  4. #3
    Senior Member
    Join Date
    Aug 2003
    Posts
    444
    Member #
    2801
    How are you reading the file? One line a time, as an array from file() or file_get_contents()?

    The last one is preferred as it can use memory mapping for speed... apparently: http://uk2.php.net/manual/en/functio...t-contents.php .

    The other thing is that going through arrays is really slow from my experience. You can do loop optimisations. The do{}while set up is fastest usually; for example, see http://www.weberblog.com/article.php/20040419170030157 .

    Also, when are you adding the slashes? I wouldn't do it to the whole file in one go. I would add slashes just before the MySQL calls. This at least has the advantage of lower memory usage overall (instead of adding 95MB in one go, you add 1K or so in live memory in a variable that will get re-used).

    One other important loop optimisation is to DEcrememnt the counter, instead of incrmenting it (i.e., something like $x-- as opposed to $x++). The reason is that if you incrment, PHP will check if $x is going over the maximum value, and that adds a performance hit.

    That's a start, post back please. I love this kind of stuff! I was also researching it recently for my PHPCounter and got 6X faster functionality in some cases! You just have to think about how you're doing something and if there is a better way to do it.
    eKstreme
    eKstreme.com - Free website tools!
    fontfox - free fonts Hand-picked quality fonts.

  5. #4
    Senior Member Brak's Avatar
    Join Date
    Apr 2003
    Location
    San Francisco, CA
    Posts
    3,413
    Member #
    1217
    Liked
    2 times
    Okay, well I solved my (major) problem. The issue was I had a function that, in theory, shortened a description... only thing is, where I c/p this function from (another file) I never stopped to realize I never zeroed out the variable.. so I was getting every short description before ( .= ). This was causing a particularly heavy slashed part (Someone commented to the likes of "//////////") to be escaped every single loop, drastically reducing performance.

    As for my method... Right now I'm using fgets. However, I've run into something strange. In an effort to optimize my script, I started timing everything. The thing is, I'm missing time somewhere. I swear I've split it up right - but when I add together the parts, they don't equal the whole (I'm only timing the loop).

    I have a sneaky suspicion it's this line:
    PHP Code:
    while (!feof($fp)) { 
    .

    So, I'm wondering - will it still be faster to use file_get_contents, split into an array of line, then split into an array of tabs? I'll try this method and see what happens... but I still feel I'm missing a good part of my performance somewhere!
    Kyle Neath: Rockstar extraordinare
    The blog | The poetry site | The Spore site

  6. #5
    Senior Member
    Join Date
    Aug 2003
    Posts
    444
    Member #
    2801
    I would try to avoid arrays all together. Loop through the string that file_get_contents gives you as follows (in pseudo code):

    $realend = strlen($TheString); //$TheString is what file_get_string returns

    Have two variable, $start and $end. Init $start = 0.

    do{

    Find the first \n in the string using strpos, starting at $start. Save it as position $end.

    Use substr to extract from $start to $end. That's your line. Do your
    magic.

    $start = $end;
    }while($end < $realend);

    That should keep things nice and tight.

    As for your timing issues, I've seen some really weird things while timing. In one instance, a quick parse is said to take over a million seconds (!), but clearly doesn't, as it shows up instantaneously in my browser. Fret not
    eKstreme
    eKstreme.com - Free website tools!
    fontfox - free fonts Hand-picked quality fonts.

  7. #6
    Senior Member Brak's Avatar
    Join Date
    Apr 2003
    Location
    San Francisco, CA
    Posts
    3,413
    Member #
    1217
    Liked
    2 times
    Cool...

    In the meantime while I experiment with some other methods for reading each line... since, I know you have a lot of experience with this subject could you tell me if these three functions are wrong? I feel like they're just hacked together, and they constitute a significant part of processing time.

    PHP Code:
    // This creates a slug for clean urls: "Hello Dolly" to "hello-dolly"
    function sanitize_name($title){
        global 
    $sanitizetime_elapsed;
        
    $sanitizetime_start microtime(true);
        
    $title strip_tags($title);
        
    $title strtolower($title);
        
    $title preg_replace('/&.+?;/'''$title); // kill entities
        
    $title preg_replace('/[^%a-z0-9 _-]/'''$title);
        
    $title preg_replace('/\s+/''-'$title);
        
    $title preg_replace('|-+|''-'$title);
        
    $title trim($title'-');
        
    $sanitizetime_end microtime(true);
        
    $sanitizetime_elapsed += $sanitizetime_end $sanitizetime_start;
        return 
    $title;
    }


    // because htmlentities() doesn't really do everything I need.
    function encodehtml($input){
            
    $input preg_replace('/&([^#])(?![a-z12]{1,8};)/''&$1'$input);
            
    $input str_replace('<''&lt;'$input);
            
    $input str_replace('>''&gt;'$input);
            
    $input str_replace('"''&quot;'$input);
            
    $input str_replace("'"''', $input);
            $input = nl2br($input);
            return $input;
        }

    // shortens a description, but keeps whole words in tact. I feel this one is really bad.
    $str_array = split(" ", $temp[1], 46);            
                if( count($str_array) > 45 ){
                    $shortdesc = "";      
                    for($y=0; $y<45; $y++){            
                        $shortdesc .= " " . $str_array[$y]; 
                    }             
                    $shortdesc .= "&hellip;";
                }else{
                    $shortdesc = $product[0];
                } 
    Kyle Neath: Rockstar extraordinare
    The blog | The poetry site | The Spore site

  8. #7
    Senior Member
    Join Date
    Aug 2003
    Posts
    444
    Member #
    2801
    That's a lot of regular expressions! Regexes are known to heavily tax the system. Given how much data you have, the difference in speed will be very noticeable.

    So some suggestions:

    1. Can you merge multiple regexes together? For example:

    PHP Code:
    $title preg_replace('/&.+?;/'''$title); // kill entities    $title = preg_replace('/[^%a-z0-9 _-]/', '', $title); 
    Can become ( haven't checked if it works, and it's late )

    PHP Code:
    $title preg_replace('/(&.+?;)|[^%a-z0-9 _-]/'''$title); 
    The idea is to use one regex to do as many things as possible.

    2. You say htmlentities() doesn't do everything you need. Can you use it and then do more stuff? That may save you a couple of calls.

    3. Take a look at the wordwrap() function: http://uk2.php.net/manual/en/function.wordwrap.php . This allows you to cut a string after a certain number of characters, and it observes word limits (the spaces!) So assuming that a word is 5 characters long, 45 words is 225 characters. The call then becomes:

    PHP Code:
    $shortdescarr explode("\t"wordwrap($temp[1], 225"\t"));//choose another delimiter if you need to.

    $shortdesc $shortdescarr[0] . "&hellip;"
    Give those a shot and see what comes out. Primarily, see if you can avoid the regular expressions. Off to bed now. Will check back in the morning.
    eKstreme
    eKstreme.com - Free website tools!
    fontfox - free fonts Hand-picked quality fonts.

  9. #8
    Senior Member
    Join Date
    Aug 2003
    Posts
    444
    Member #
    2801
    OK, so this bugged me a lot, and in the end, I found this code to be fastest:

    PHP Code:
    <?php
    function getmicrotime()
    {
       list(
    $usec$sec) = explode(" "microtime());
       return ((float)
    $usec + (float)$sec);
    }

    $count 2000000;
    $data file("speed-data.txt");
    $count count($data);

    $startDo getmicrotime();
    $i=0;
    do
    {
        
    $arr explode("\t"$data[$i]);
        
    $i++;
    }
    while (
    $i $count);
    $stopDo getmicrotime();

    echo 
    "\nTime for do: ",$stopDo-$startDo
    $data explode("\n"file_get_contents("speed-data2.txt"));
    $count count($data);
    $startDo getmicrotime();
    $i=0;
    do{
        
    $arr explode("\t"$data[$i]);
        
    $i++;
        }
    while (
    $i $count);
    $stopDo getmicrotime();
    echo 
    "\nTime for fgc: ",$stopDo-$startDo;
    ?>
    The secibd loop always beat (ever so slightly sometimes) the code that used file() instead of file_get_contents(). I used it on a 5MB and a 20MB test text files. I had two copies of the file (as you can see in the code), to minimise on any caching effects that may occur.
    eKstreme
    eKstreme.com - Free website tools!
    fontfox - free fonts Hand-picked quality fonts.

  10. #9
    Senior Member
    Join Date
    Aug 2003
    Posts
    444
    Member #
    2801
    Hi

    Another thing that just occured to me: to make the short description, how about using array_pop() and implode()? http://uk.php.net/manual/en/function.implode.php and http://uk.php.net/manual/en/function.array-pop.php .

    The code will be:

    PHP Code:
    $str_array split(" "$temp[1], 46);            
                if( 
    count($str_array) > 45 ){
                    
    array_pop($str_array);
                    
    $shortdec implode(' '$str_array) . "&hellip;";
                }else{
                    
    $shortdesc $product[0];
                } 
    eKstreme
    eKstreme.com - Free website tools!
    fontfox - free fonts Hand-picked quality fonts.


Remove Ads

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
All times are GMT -6. The time now is 10:52 PM.
Powered by vBulletin® Version 4.2.3
Copyright © 2019 vBulletin Solutions, Inc. All rights reserved.
vBulletin Skin By: PurevB.com