Contact Us Forum Rules Search Archive
WebProWorld Part of WebProNews.com
Page One Link To Us Edit Profile Private Messages Archives FAQ RSS Feeds  
 

Go Back   WebProWorld > Webmaster, IT and Security Discussion > Web Programming Discussion Forum
Subscribe to the Newsletter FREE!


Register FAQ Members List Calendar Arcade Chatbox Mark Forums Read

Web Programming Discussion Forum Working with an API? Developing a plugin? Writing a Mod or script for your favorite blog, Web 2.0 site or Forum? Welcome.

Reply
 
LinkBack Thread Tools Display Modes
  #1 (permalink)  
Old 07-31-2005, 10:43 PM
WebProWorld Pro
 

Join Date: Oct 2004
Location: NYC, USA
Posts: 148
web-content-king RepRank 0
Default Program or script to strip links from html

I need a program or script that will take an html file or a section of html and remove all the links--i.e., all the <a href....... you know what I mean.

Problem is that I have a public-domain document I want to add to my site (US-gov-created) but they did that annoying think of hyperlinking every term of any importance to other web pages and I just want to take them all out.

I do not, however, want to remove their HTML tags since the HTML is perfectly standards-compliant with nothing weird.

I would prefer a solution that leaves the anchor text but only removes the anchor tags.

Any ideas?
Reply With Quote
  #2 (permalink)  
Old 08-03-2005, 05:03 PM
WebProWorld New Member
 

Join Date: Nov 2003
Posts: 3
gambler RepRank 0
Default

You might try downloading the source script into a MS Word document, hi-liting the entire page after download/copy/paste/remove hyperlinks.

Corection: Hi-lite the entire document then ctrl-z. this will remove all hyperlinks and should leave the html alone
__________________
Visit The Las Vegas Gambler on line at http://www.LasVegasGambler.net for info on how to play our games and how to win with casino stocks.
Reply With Quote
  #3 (permalink)  
Old 08-04-2005, 04:53 AM
WebProWorld Veteran
 

Join Date: Aug 2003
Location: Cornwall, UK
Posts: 862
speed RepRank 1
Default

This isn't pretty but does strip tags (requires PHP 4.3.0 or later):
Code:
<?php

$data = file_get_contents('http://www.tolranet.co.uk');
$data = str_replace('</a>', '', $data);
$data = preg_replace('/<a[^>]+href[^>]+>/', '', $data);

// Write to browser
echo $data;

// Write to file
$f = fopen('noatags.html', 'w');
fwrite($f, $data);
fclose($f);

?>
Put that into a file e.g. strip.php and change the URL of the page to strip from www.tolranet.co.uk to whatever you want, you don't have to use a URL you can use the name of a file on the same machine.

When you access the script from a browser it will load the page at the URL without the <a> tags, and save it to a file called noatags.html (assuming you have write permissions for that file).

Any problems with it let me know.
__________________
US & UK Web Hosting with hourly backups | Hosting Affiliate Scheme | Web Directory 2 for 1 Offer
Reply With Quote
  #4 (permalink)  
Old 08-04-2005, 10:12 AM
kgun's Avatar
WebProWorld 1,000+ Club
 

Join Date: May 2005
Location: Norway
Posts: 5,402
kgun RepRank 3kgun RepRank 3kgun RepRank 3
Default Should be simple / effective in C++, C# or Java.

Simple effective solution if you have the tools and expertice in your organization:

It is simple stringmanipulation (on a pointer / array).

More advanced:
A simple Database As MS Access have a "hyperlink" datatype.

1. Parse the code and put the URL's into Access
(URL in one field and anchor text in text field
- or was it record?)
2. Now it is easy to sort etc.
3. By combining it with Visual Basic for Applications and (embedded) SQL you should be able to write programs that does the operations you need on the database.

http://www.techonthenet.com/access/f...reate_date.php

C# is perhaps the most effective language. Its advanced Intellij functionality makes it extremely productive.

Some helpful links:

http://www.c-sharpcorner.com/code/20...ngLanguage.asp
http://msdn.microsoft.com/vstudio/
http://groups.msn.com/
http://www.microsoft.com/communities...ortalHome.mspx
http://msdn.microsoft.com/chats/
http://forums.microsoft.com/msdn/

Perhaps not a simple solution, but your question may be the top of an iceberg?

P.S. As gambler said, open the document in Word. Then you can save it in different formats and because of the DDE in MS Office, you may
1. Save it as plain text.
2. Import that text into Excel in different columns (depends on the separator for columns).
3. Save it as an Excel file.
4. Import it from Excel to Access as a database where you select fields and records.
5. Perhaps import it directly into Access from Word. Use Help to check for possibilties.
6. Import it from Access to Oracle, Sybase or MySQL etc. etc.
7. Combine it with a crawler and you have an autogenreated directory ala, http://www.craigslist.com/

It is mostly (some will say only) a programming (and embedding) task.

Kjell Gunnar Bleivik
http://www.multifinanceit.com/
http://www.blognorway.com/
Reply With Quote
  #5 (permalink)  
Old 08-04-2005, 01:01 PM
Evic's Avatar
WebProWorld Pro
 

Join Date: Jul 2005
Location: Eielson AFB, AK
Posts: 174
Evic RepRank 0
Default

Don't use Word - you'll get a ton of IE proprietary code and lose your W3C compliancy.

I'd go with the PHP solution that was previously mentioned.
__________________
Michael Wales

My Blog: GibThis: Video Game Blog
Reply With Quote
  #6 (permalink)  
Old 08-04-2005, 02:00 PM
kgun's Avatar
WebProWorld 1,000+ Club
 

Join Date: May 2005
Location: Norway
Posts: 5,402
kgun RepRank 3kgun RepRank 3kgun RepRank 3
Default IE proprietary code ?

It is not ebcdic

http://www.natural-innovations.com/c...ciiebcdic.html

even if that could be handled too.

If he is only interested in operating on strings (Ad hoch project), an applet or servlet is good enough.

"Good enough is best."

Kjell Bleivik
http://www.multifinanceit.com/
http://www.blognorway.com/
Reply With Quote
  #7 (permalink)  
Old 08-07-2005, 03:23 PM
WebProWorld Pro
 

Join Date: Oct 2004
Location: NYC, USA
Posts: 148
web-content-king RepRank 0
Default

Quote:
Originally Posted by speed
This isn't pretty but does strip tags (requires PHP 4.3.0 or later):
Code:
<?php

$data = file_get_contents('http://www.tolranet.co.uk');
$data = str_replace('</a>', '', $data);
$data = preg_replace('/<a[^>]+href[^>]+>/', '', $data);

// Write to browser
echo $data;

// Write to file
$f = fopen('noatags.html', 'w');
fwrite($f, $data);
fclose($f);

?>
Put that into a file e.g. strip.php and change the URL of the page to strip from www.tolranet.co.uk to whatever you want, you don't have to use a URL you can use the name of a file on the same machine.

When you access the script from a browser it will load the page at the URL without the <a> tags, and save it to a file called noatags.html (assuming you have write permissions for that file).

Any problems with it let me know.
Hi Speed,

I'm a complete php idiot, I saved it to striplinks.php, didn't make any changes to it (to test it out with your url), and opening it up all that came back was ', '', $data); $data = preg_replace('/]+href[^>]+>/', '', $data); // Write to browser echo $data; // Write to file $f = fopen('noatags.html', 'w'); fwrite($f, $data); fclose($f); ?>

No html file was created. What went wrong?

Thanks
Reply With Quote
  #8 (permalink)  
Old 08-07-2005, 03:56 PM
WebProWorld Veteran
 

Join Date: Aug 2003
Location: Cornwall, UK
Posts: 862
speed RepRank 1
Default

It sounds like you don't have the ability to run PHP on that web server.

However it's worth checking that the <?php tag is at the start of the file, assuming it is then you need to check if your web host supports PHP and if you need to do anything special to run PHP scripts. I know of one host where PHP scripts have to be uploaded to a different area to normal HTML pages.

If the host doesn't support PHP then we'll have to have another think, unless you can borrow some PHP enabled web space to strip the documents.

Let me know how you get on.
__________________
US & UK Web Hosting with hourly backups | Hosting Affiliate Scheme | Web Directory 2 for 1 Offer
Reply With Quote
  #9 (permalink)  
Old 08-07-2005, 07:00 PM
WebProWorld Pro
 

Join Date: Oct 2004
Location: NYC, USA
Posts: 148
web-content-king RepRank 0
Default

Quote:
Originally Posted by speed
It sounds like you don't have the ability to run PHP on that web server.

However it's worth checking that the <?php tag is at the start of the file, assuming it is then you need to check if your web host supports PHP and if you need to do anything special to run PHP scripts. I know of one host where PHP scripts have to be uploaded to a different area to normal HTML pages.

If the host doesn't support PHP then we'll have to have another think, unless you can borrow some PHP enabled web space to strip the documents.

Let me know how you get on.
Hi--thanks a lot. Is it possible to get the script to work with a file located on the local harddrive and not just one uploaded to the server? That would be a big time-saver: just set a default filename, strip.htm, and save any file I want to strip with the file name and voila!

Found my server's scripts directory, thanks.
Reply With Quote
  #10 (permalink)  
Old 08-07-2005, 08:05 PM
WebProWorld Veteran
 

Join Date: Aug 2003
Location: Cornwall, UK
Posts: 862
speed RepRank 1
Default

Yes it's possible to run PHP on a local machine, http://www.php.net/downloads.php is the main PHP downloads, and http://www.firepages.com.au/devindex.htm for a complete PHP, apache setup for windows.

Changing the script to:
Code:
<?php

$data = file_get_contents('strip.html');
$data = str_replace('</a>', '', $data);
$data = preg_replace('/<a[^>]+href[^>]+>/', '', $data);

// Write to browser
echo $data;

// Write to file
$f = fopen('noatags.html', 'w');
fwrite($f, $data);
fclose($f);

?>
Allows you to upload the file to convert as strip.html into the same folder as the above php script, access the php script from the browser and download noatags.html

If you've got command line access you can invoke the above script with something like "php strip.php", depending on your installation, rather than accessing it with a browser.

I don't know how many files you have to strip, if it's only few then it's not worth installing PHP locally, if you've got hundreds then it would probably be better to update the script to convert all html files in a folder so you can bulk upload, convert, download.
__________________
US & UK Web Hosting with hourly backups | Hosting Affiliate Scheme | Web Directory 2 for 1 Offer
Reply With Quote
  #11 (permalink)  
Old 08-20-2005, 06:55 PM
WebProWorld Pro
 

Join Date: Oct 2004
Location: NYC, USA
Posts: 148
web-content-king RepRank 0
Default

Quote:
Originally Posted by speed
Yes it's possible to run PHP on a local machine, http://www.php.net/downloads.php is the main PHP downloads, and http://www.firepages.com.au/devindex.htm for a complete PHP, apache setup for windows.

Changing the script to:
Code:
<?php

$data = file_get_contents('strip.html');
$data = str_replace('</a>', '', $data);
$data = preg_replace('/<a[^>]+href[^>]+>/', '', $data);

// Write to browser
echo $data;

// Write to file
$f = fopen('noatags.html', 'w');
fwrite($f, $data);
fclose($f);

?>
Allows you to upload the file to convert as strip.html into the same folder as the above php script, access the php script from the browser and download noatags.html

If you've got command line access you can invoke the above script with something like "php strip.php", depending on your installation, rather than accessing it with a browser.

I don't know how many files you have to strip, if it's only few then it's not worth installing PHP locally, if you've got hundreds then it would probably be better to update the script to convert all html files in a folder so you can bulk upload, convert, download.
How do I do it so I can do it in bulk?

How bad is this? Just remove the .html to go from a file to a directory?

Code:
<?php

$data = file_get_contents('strip');
$data = str_replace('</a>', '', $data);
$data = preg_replace('/<a[^>]+href[^>]+>/', '', $data);

// Write to file
$f = fopen('noatags', 'w');
fwrite($f, $data);
fclose($f);

?>
Reply With Quote
  #12 (permalink)  
Old 08-21-2005, 06:03 AM
WebProWorld Veteran
 

Join Date: Aug 2003
Location: Cornwall, UK
Posts: 862
speed RepRank 1
Default

Put the following in a .php file:
Code:
<?php

$d = opendir('in');
while(($file = readdir($d)) !== false) {
    if($file != '.' && $file != '..') {
        $data = file_get_contents('in/' . $file);
        $data = str_replace('</a>', '', $data);
        $data = preg_replace('/<a[^>]+href[^>]+>/', '', $data);
        
        // Write to browser
        echo "file $file
\n";
        
        // Write to file
        $f = fopen('out/' . $file, 'w');
        fwrite($f, $data);
        fclose($f);
    }
}
closedir($d);

echo "Done...
\n";
?>
Create a folder called 'in' and a folder called 'out' in the same folder as the script. NOTE: 'out' must be writable by the script.

Upload all the .html files to the 'in' folder, access the script from a web browser, then download all the converted ones from 'out'.

The script will overwrite files in the 'out' folder that have the same name.
__________________
US & UK Web Hosting with hourly backups | Hosting Affiliate Scheme | Web Directory 2 for 1 Offer
Reply With Quote
Reply

  WebProWorld > Webmaster, IT and Security Discussion > Web Programming Discussion Forum
Tags: , , , ,



Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On


Search Engine Optimization by vBSEO 3.2.0