PDA

View Full Version : Web parsing - HTML segmentation



icb01co2
10-17-2004, 03:05 PM
Hi Experts,

This is more of a group brainstorm than question and any ideas would be appriciated. I am about to develop a program in Java to parse a HTML document to a tree structure based upon its tags. I wanna then segment that tree structure into smaller trees that represent segments of a web page based upon layout styles, i.e. a tree to represent the menu elements, on to represent a seciion of text etc. I want to know if any of you have any experience with any of the following : -

- Theoretical ideas on how humans instinctively segment a web page document into sections.

- What can constitue vertical splits in HTML other than table and div tags?

- Horizontal splits in pages can sometimes be done using thin long images, is it possiable (using Java) to evaluate the height and width of an image(if not specified in the HTML) using its URL as an argument?

- Can anyone thing of any other ways of obvious table/div segmentation other than - bg colour, bg images, border.

-Some web pages use JavaScript to change layout of a page dependent on what
browser/screen resolution etc. Is it possiable to make (using Java) a program that forces the
javaScript in a web document to be performed server-side, hence returning only HTML code. If not can you think of any other way around this?

I know its alot but any help would be appreciated.
Thanks, Chris.

Maximilian
10-19-2004, 03:56 AM
Howdy Chris,

Here are 2 Java programs that will fit your needs perfectly:

1.) BlueShoes Java TreeView:
Works like the tree in the windows explorer. Features: - Use your own images/colors or choose from predefined ones. - Built-in checkbox system (like a windows installer to select components). - Built-in radio button to use the tree as form field. - Lots of options and API functions. - Can be directly used as form field. See the Bs_Form package. Just go to:
http://www.blueshoes.org/en/javascript/tree/


2.) JTree
Just take your data from database or file, and JTree displays it in the best way in Web-browser or Java application, providing comfortable and powerful functionality to navigate your data and to manage it! Just go to:
http://scbr.com/docs/products/jtree/jtree.shtml

Cheers!
Max

mikmik
10-19-2004, 07:29 AM
-Some web pages use JavaScript to change layout of a page dependent on what
browser/screen resolution etc. Is it possiable to make (using Java) a program that forces the
javaScript in a web document to be performed server-side, hence returning only HTML code. If not can you think of any other way around this?
You can use php to generate a CSS on the fly, based on browser information.

Also, asp is basically a server side application that can use javascript.



- Theoretical ideas on how humans instinctively segment a web page document into sections.
Whitespace? (besides the obvious ones - colors and borders and fonts)


Also
Server-Side JavaScript (http://docs.sun.com/source/816-6411-10/intro.htm)
On the server, you also embed JavaScript in HTML pages. The server-side statements can connect to relational databases from different vendors, share information across users of an application, access the file system on the server, or communicate with other applications through LiveConnect and Java. HTML pages with server-side JavaScript can also include client-side JavaScript.

In contrast to pure client-side JavaScript pages, HTML pages that use server-side JavaScript are compiled into bytecode executable files. These application executables are run by a web server that contains the JavaScript runtime engine. For this reason, creating JavaScript applications is a two-stage process
This chapter (http://docs.sun.com/source/816-6411-10/partbase.htm) provides an overview of what a typical server-side JavaScript application looks like, and it shows you how to set up your system for server-side development.

------------------------------

If your html is well formed according to xhtml (tags properly nested and closed) it is possible to use DOM and namespace to oreganize and present information from a document.

Dang, I can't remember where, I was just looking at it - DevShed??
Here is some similar info on javascript and xml;
Working with XML and JavaScript (http://www.wdvl.com/Authoring/JavaScript/JSDesign/)

I'm not sure this helps much, but Max's links are lookin' good for me! Thanks, Max, that first one is a beauty for versatility and range.

paulhiles
10-19-2004, 08:43 AM
There are many ways of breaking down a standard HTML page, some of the simplest methods use client-side Javascript. There are many useful little Javascript functions available that will highlight block elements, remove images, all manner of presention formatting, just Google for favelets or bookmarklets.

Many of these are incorporated into Firefox's Web Developer extension. If you don't have it already I suggest you download and start playing around with it straight away! :o)

Of course, you may prefer to analyze your page's structure. I would suggest you take a look at how XML files are constructed and how they can be formatted with XSL (http://www.webproworld.com/viewtopic.php?p=152870#152870) or CSS.

One powerful means of changing the structure of a document is via the Document Object Model (DOM). You can certainly use Java to manipulate the DOM. This tutorial (http://java.sun.com/xml/jaxp/dist/1.1/docs/tutorial/dom/1_read.html) from Sun walks you through the necessary steps.

Take a look at these links and read the rest of that thread, there's some interesting stuff to be found!

Paul

Maximilian
10-19-2004, 12:48 PM
I'm not sure this helps much, but Max's links are lookin' good for me! Thanks, Max, that first one is a beauty for versatility and range.

Thanks Mik Mik, for the kind compliments - You Rock!

Cheers!
Max

mikmik
10-19-2004, 04:27 PM
Ha! You are more than welcome, my helping friend!

I read this post, then I went checking email, got this:

If you're respectful by habit,
constantly honoring the worthy,
four things increase:
long life, beauty,
happiness, strength.

-Dhammapada, 8, translated by Thanissaro Bhikkhu.

Sorry for off topic post, it is a tribute to all who help, and give others compliments :O)