Login

ldarrambide · 07-14-2009, 09:19 AM

Hello,

i'm now fighting with html text extraction.

I've got several different type of data to extract. I tried hard with HTMLDoc class, with provided examples, but it's not enough.

NB: red line is wanted text

1)
<div class='text'>
<b>Covers</b><br/>
http://xxxyyyyzzzz.com/somefile.html <- i want that
</div>

2)
<a href="http://xxxyyyyzzzz.com/somefile.html" target="_blank">http://xxxyyyyzzzz.com/somefile.html</a></div>

3)
<div class="image">
<a href="http://xxxyyyyzzzz.com/somefile" target="_blank"><img src="http://xxxyyyyzzzz.com/somefile.jpeg"

4)
<a href="http://xxxyyyyzzzz.com/somefile" target="_top">Download</a><br>

5)
dd.d3.getElementById("lgpd").outerHTML : why d3, what is it.

6)where to find containerTag & containerNameOrIndex reference?

Long post but long time search :/

kind regards,
Laurent.

***Gintaras*** · 07-14-2009, 10:44 AM

HtmlDoc can get HTML or text of specified tag. To get what is inside, use string functions, eg findrx. Also can be used html element functions.

Macro Macro1090

Code: Copy      Help
str s=

;<body>

;<div class='text'>

;<b>Covers</b><br/>

;http://xxxyyyyzzzz.com/somefile.html <- i want that

;</div>

;</body>

HtmlDoc d.InitFromText(s)

;str s2=d.GetHtml("div" 0)

str s2=d.GetText("div" 0)

;out s2

str s3

if(findrx(s2 "\bhttp:\S+" 0 1 s3)<0) ret

out s3

Often you can easily extract required strings from whole page HTML using findrx. Use HtmlDoc only when it is too difficult. HtmlDoc uses IE HTML parsing engine to parse page HTML into smaller elements. Then you find required elements, and work with their text or HTML using string functions.

containerTag is HTML tag name, like div. To find first div, use d.GetText("div" 0), to find next div, use d.GetText("div" 1), and so on.

HtmlDoc.d and d3 are variables of type IHTMLDocument2 and IHTMLDocument3. Both can be used to access MSHTML DOM. Documented in MSDN library.

ldarrambide · 07-14-2009, 11:26 AM

the problem with example code 1, is that i can't know before using the macro what is the number of the item to use.

HtmlDoc d.InitFromText(s)
;str s2=d.GetHtml("div" 0)
str s2=d.GetText("div" 0) This can be ("div" 25 or 3458 or 1250)
;out s2
str s3
if(findrx(s2 "\bhttp:\S+" 0 1 s3)<0) ret
out s3

Is there a way to have the numbers of "div" tag?
In that example, i *DO* search for text in a <div class='text'> tag.

Out to find it?

Quote:HtmlDoc.d and d3 are variables of type IHTMLDocument2 and IHTMLDocument3. Both can be used to access MSHTML DOM. Documented in MSDN library.

sorry but it's cryptic to me, i did not even got the differece between IHTMLDocument2 and IHTMLDocument3. So far too much for my skills.

***Gintaras*** · 07-14-2009, 11:40 AM

Macro Macro1090

Code: Copy      Help
str s=

;<body>

;<div>a</div>

;<div>b</div>

;<div class='text'>

;<b>Covers</b><br/>

;http://xxxyyyyzzzz.com/somefile.html <- i want that

;</div>

;<div>x</div>

;<div>y</div>

;</body>

HtmlDoc d.InitFromText(s)

ARRAY(MSHTML.IHTMLElement) a

d.GetHtmlElements(a "div")

int i

for i 0 a.len

,out "----------"

,str s2=a[i].innerText

,out s2

,

ldarrambide · 07-14-2009, 11:47 AM

yes, i tried that.

but it's too much time consuming, and i know what part of tags i want (<div class='text'> or <div class="image"> or <a href=).

So i'd like to search for those specific tags i need.

If not possible, i'll go the findrx way, which i'd like to avoid. I though html classes could make the job easier.

Sorry for that.

ldarrambide · 07-14-2009, 12:47 PM

BTW,

how test for tags with " in it with findrx, i can't find the trick.

findrx(text "href"=" 0 16 found).

Would help much.

***Gintaras*** · 07-14-2009, 12:57 PM

Read QM help topic "Constants".

ldarrambide · 07-14-2009, 02:32 PM

Ok, when a good idea is not.

In fact, i switched back to my prior way of doing it, by grep'ing html text via findrx.

Thought a dedicated class could help, but not in my case.

Sometimes, old classic ways are the way to go.

Thanks.