Submit Your Article Forum Rules

Page 2 of 3 FirstFirst 123 LastLast
Results 11 to 20 of 22

Thread: I Need A Way To Scrape A Site's Content

  1. #11
    WebProWorld MVP williamc's Avatar
    Join Date
    Jul 2003
    Location
    On a really big hill in Kentucky
    Posts
    4,538
    I will agree with julien that httrack is probably going to do what you want best. I have used it a number of times when writing my own spiders was not warranted.

    As for your external links problem you would setup rules (filters)to only allow files from the domain you are copying.

    IE:

    Code:
    -*
    +www.example.com/*.html
    +www.example.com/*.php
    +www.example.com/*.asp
    +www.example.com/*.gif 
    +www.example.com/*.jpg 
    +www.example.com/*.png
    -mime:*/* +mime:text/html +mime:image/*
    notice that excludes ALL, then allows only the files from your server, add any other allowances as needed.
    Last edited by williamc; 12-30-2010 at 12:48 PM.
    William Cross
    Hidden Content by Those Damn Coders
    Hidden Content because our constitution matters

  2. #12
    WebProWorld MVP kgun's Avatar
    Join Date
    May 2005
    Location
    Norway
    Posts
    7,713
    Quote Originally Posted by williamc View Post
    I will agree with julien that httrack is probably going to do what you want best. I have used it a number of times when writing my own spiders was not warranted.
    But screen scraping is much simpler and I think done faster via your own PHP class / function. It is described in the first real chapter of the book, chapter 3 page 21.

    Downloading Web Pages. The other chapters describes how to parse and manipulate the content.
    Last edited by kgun; 12-30-2010 at 12:43 PM.
    Hidden Content :: Hidden Content
    Hidden Content
    Conversations creates communities and conversions create profit.

  3. #13
    WebProWorld MVP williamc's Avatar
    Join Date
    Jul 2003
    Location
    On a really big hill in Kentucky
    Posts
    4,538
    Quote Originally Posted by kgun View Post
    But screen scraping is much simpler and I think done faster via your own PHP class / function.
    I agree it would be faster, and Perl would be faster yet. But for his purpose, generically downloading an entire sites contents, httrack already does that pretty damn well, and fairly resource friendly and fast.

    BTW, you should check out wget for scraping images, rather than cURL.
    William Cross
    Hidden Content by Those Damn Coders
    Hidden Content because our constitution matters

  4. #14
    WebProWorld MVP kgun's Avatar
    Join Date
    May 2005
    Location
    Norway
    Posts
    7,713
    Quote Originally Posted by williamc View Post
    I agree it would be faster, and Perl would be faster yet. But for his purpose, generically downloading an entire sites contents, httrack already does that pretty damn well, and fairly resource friendly and fast.
    Perl and regular expressions. Not my love and I have not needed them so far. I prefer python (made by a mathematician now employerd by Google) to perl (made by a language / semantics educated) person.

    Quote Originally Posted by williamc View Post
    BTW, you should check out wget for scraping images, rather than cURL.
    I have get the impression that wget is not generally as good as cURL. There are a lot of such software where I think more and more use the sweedish cURL tool whose developers

    http://www.haxx.se/

    also have a strong background in assembly.

    What is so difficult with scraping images if they are contained within an image tag?
    Last edited by kgun; 12-30-2010 at 12:58 PM.
    Hidden Content :: Hidden Content
    Hidden Content
    Conversations creates communities and conversions create profit.

  5. #15
    WebProWorld MVP williamc's Avatar
    Join Date
    Jul 2003
    Location
    On a really big hill in Kentucky
    Posts
    4,538
    Quote Originally Posted by kgun View Post
    I prefer python (made by a mathematician now employerd by Google) to perl (made by a language / semantics educated) person.
    Perl is still documented as faster than php or python for file operations.

    Quote Originally Posted by kgun View Post
    I have get the impression that wget is not generally as good as cURL.
    When downloading entire files such as images wget is more reliable, and faster in many instances as cURL merely grabs the data to memory and a file handle is still needed to be opened in php and written to, then closed. wget does not have that issue, as it downloads the image(s) directly to server file(s).
    William Cross
    Hidden Content by Those Damn Coders
    Hidden Content because our constitution matters

  6. #16
    WebProWorld MVP kgun's Avatar
    Join Date
    May 2005
    Location
    Norway
    Posts
    7,713
    The most efficient would of course be to use c / c++ :

    http://www.diskusjon.no/index.php?showtopic=1272340 (Use Google translate).

    http://curl.haxx.se/libcurl/ (Also made for Perl - see right edge).
    Last edited by kgun; 12-30-2010 at 01:39 PM.
    Hidden Content :: Hidden Content
    Hidden Content
    Conversations creates communities and conversions create profit.

  7. #17
    WebProWorld MVP kgun's Avatar
    Join Date
    May 2005
    Location
    Norway
    Posts
    7,713
    Here Post #101

    http://www.webproworld.com/webmaster...l=1#post533167

    Is an example of how easy it is to scrape content from online text files with Python 2.5.

    A similar for Python 3.* Se post #2

    http://www.diskusjon.no/index.php?showtopic=1269337 (Use Google translate).
    Hidden Content :: Hidden Content
    Hidden Content
    Conversations creates communities and conversions create profit.

  8. #18
    WebProWorld MVP williamc's Avatar
    Join Date
    Jul 2003
    Location
    On a really big hill in Kentucky
    Posts
    4,538
    Why does your example use urllib when urllib2 has shown to be less of a footprint?
    William Cross
    Hidden Content by Those Damn Coders
    Hidden Content because our constitution matters

  9. #19
    WebProWorld MVP kgun's Avatar
    Join Date
    May 2005
    Location
    Norway
    Posts
    7,713
    Quote Originally Posted by kgun View Post
    Here Post #101

    A similar for Python 3.* Se post #2

    http://www.diskusjon.no/index.php?showtopic=1269337 (Use Google translate).
    Did you use Google translate to read that thread?

    Quote Originally Posted by williamc View Post
    Why does your example use urllib when urllib2 has shown to be less of a footprint?
    The following code functions for Python 3.*.

    Code:
    # Prim5.py
    #
    # Kjell Bleivik 2010: www.kjellbleivik.com - www.oopschool.com - www.digitalpunkt.no, 
    # 
    # ---------------------------------------------------------------------------------------------------
    #
    # Opens the webdocument with The First 10,000 Primes, read the file and print the
    # content.  For version 3.1.2.  Version 2.5 is commented out.
    # Reference:  http://diveintopython3.org/porting-c...with-2to3.html
    #
    # ---------------------------------------------------------------------------------------------------
    #
    #import urllib
    #file = urllib.urlopen('http://primes.utm.edu/lists/small/10000.txt')
    #primes = file.read() 
    #print (primes)
    import urllib.request
    file = urllib.request.urlopen('http://primes.utm.edu/lists/small/10000.txt')
    primes = file.read().decode("utf8")
    my_primes = []
    for i,line in enumerate(primes.split('\n')):
        if i > 3 and i < 1004:
            temp = ((int(item) for item in line.split()))
            for i in temp:
                my_primes.append(i)
    print (my_primes)
    #You can take out number you want.
    #10 first prime number
    #print (my_primes[0:10])
    So how would you write the code with urllib2?

    It is not as simple as replacing urllib with urllib2 in the above code, since I get the following

    Traceback (most recent call last):
    File "E:\PyKildekode\Prim5.py", line 18, in <module>
    import urllib2.request
    ImportError: No module named urllib2.request


    error if I do that.

    The fastest solution is of course to use cURL with C or C++. As a rule of thumb compiled C / C++ code runs 20 times faster than Python code.
    Hidden Content :: Hidden Content
    Hidden Content
    Conversations creates communities and conversions create profit.

  10. #20
    Moderator mjtaylor's Avatar
    Join Date
    Dec 2003
    Location
    The Moon
    Posts
    7,051
    Front Page does it well, IMO.
    Need to write a love letter to Google? I'm an SEO Copywriter Hidden Content Search Smart DesignŽ. | Travel Gypsy in Hidden Content . | Get the Hidden Content to SEO Web Design.

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •