Login

gin · 01-12-2014, 07:53 AM

I'm trying to write a screen scraper for various sites using firefox. I don't use internet explorer because it's constantly under attack and I'm very worried of getting viruses etc.

Where can I learn more about the best practices (methods, functions, setups, etc) to use? Thanks

***Gintaras*** · 01-12-2014, 08:37 AM

Get text, or all HTML elements?
You can get page source HTML, and parse it, for example with HtmlDoc class.

gin · 01-12-2014, 09:33 AM

For now, I'm not trying to get text.

I'm trying to to navigate a site to get certain links that follow a pattern eg latest posts.

I've been doing it by finding a keyword, then context menu, then a (for copy link), then paste into notepad, then f3 to find next match, but it doesn't work in many cases because the position of what the keyword and the link are in different places (this is why I asked about getting caret/mouse position earlier)

Ideally, I'd use regex to grab certain urls.

I'm unsure if there's an easier way so I'm asking now Big Grin

Thank you.

***Gintaras*** · 01-12-2014, 09:49 AM

gets links where name contains "post"
Macro Macro2218

Code: Copy      Help
out

int w=wait(3 WV win(" - Mozilla Firefox" "Mozilla*WindowClass" "" 0x4))

Acc a.FindFF(w "#document" "" "" 0x1000 3)

ARRAY(Acc) aa

a.GetChildObjects(aa -1 "LINK" "*post*" "" 1)

int i

for i 0 aa.len

,Acc& r=aa[i]

,str name=r.Name

,str url=r.WebAttribute("href")

,out F"<>{name} <c 0x8000>{url}</c>"

gin · 01-12-2014, 11:19 AM

I couldn't get the code to work.

I modded the help example from a.GetChildObjects and got it work but I don't understand how the search works - what parts does it search?.

Here's my code:

Code:

Copy Help

get all links in web page in Firefox

int w=win("Mozilla Firefox" "Mozilla*WindowClass" "" 0x804)

Acc a.Find(w "DOCUMENT" "" "" 0x3010 2)

ARRAY(Acc) c; int i

a.GetChildObjects(c -1 "LINK" "*gintaras*" "" 1)

for i 0 c.len

    out c[i].Value

Now, how would I find links that are formatted with h2 tags, or a certain color or style, etc? Is that possible? Thanks.

***Gintaras*** · 01-12-2014, 12:38 PM

Macro Macro2219

Code: Copy      Help
int w=win("Mozilla Firefox" "Mozilla*WindowClass" "" 0x804)

Acc a.Find(w "DOCUMENT" "" "" 0x3010 2)

str html

a.WebPageProp(0 0 html)

out html

;now findrx

gin · 01-12-2014, 09:50 PM

Sorry, why did you give that code? It seems to grab the source code for the page.

I was playing with "find accessible object, wait" dialog just now.

1. I understand how to pick certain div containers.
2. How do I use regex to grab certain text (eg links) from step 1 above?

***Gintaras*** · 01-13-2014, 07:57 PM

I give the code because you asked "how would I find links that are formatted with h2 tags, or a certain color or style, etc?". Accessible objects cannot get these tags and styles, but you can get it from HTML. The GetChildObjects in the example uses wildcard characters on link name; also can use regular expression on link name or value (URL). In this example, uses regex on value to find links where URL contains "reply":
Macro Macro2220

Code: Copy      Help
int w=win("Mozilla Firefox" "Mozilla*WindowClass" "" 0x804)

Acc a.Find(w "DOCUMENT" "" "" 0x3010 2)

ARRAY(Acc) c; int i

a.GetChildObjects(c -1 "LINK" "" "value=.+reply.*" 0x2000|16|8)

for i 0 c.len

,out c[i].Value

gin · 01-17-2014, 06:56 AM

I wrote some regex code to grab all the matching links from the web page's source code, but it gives me an "unknown member" problem with the value piece of the last line. Please help

Code:

Copy Help

;;grab source code - this part works



int w=win("Mozilla Firefox" "Mozilla*WindowClass" "" 0x804)

Acc a.Find(w "DOCUMENT" "" "" 0x3010 2)

str html

a.WebPageProp(0 0 html)



;;findrx - find links matching the regex pattern and show it

ARRAY(str) l

int i

str pattern=

;(?<=<a\ class="comment-count"\ href=").*(?="\ title="Comments\ for:\ )



findrx(html pattern 0 4 l)



for i 0 l.len

out l[i].value

***Gintaras*** · 01-17-2014, 03:00 PM

Why ".value"? str does not have such member.

gin · 01-17-2014, 03:39 PM

I tried without value, but that does nothing.

***Gintaras*** · 01-17-2014, 03:44 PM

With findrx, always check the return value, to know when it finds and when not. With flag 4 it returns 0 if 0 matches found.
Macro Macro2124

Code: Copy      Help
;;grab source code - this part works

int w=win("Mozilla Firefox" "Mozilla*WindowClass" "" 0x804)

Acc a.Find(w "DOCUMENT" "" "" 0x3010 2)

str html

a.WebPageProp(0 0 html)

;;findrx - find links matching the regex pattern and show it

ARRAY(str) l

int i

str pattern=

;(?<=<a\ class="comment-count"\ href=").*(?="\ title="Comments\ for:\ )

if(0=findrx(html pattern 0 4 l)) end "not found"

for i 0 l.len

,out l[0 i]

gin · 01-17-2014, 04:34 PM

Thanks. It says not found. I have a regex builder from some other software. Is my regex wrong? I made a new one and I tested it - it grabbed the url fine but it doesn't work in qm.

START
<a class="comment-count" href="

END
" title="Comments for

REGEX
(?<=<a\ class="comment-count"\ href=")http://thechive\.com.*/\#comments(?="\ title="Comments\ for)

***Gintaras*** · 01-17-2014, 04:57 PM

Don't know, need html string to test.

gin · 01-17-2014, 05:14 PM

Here you go,

Code:

Copy Help

                </ul>

                        </div>

                    <div class="clear"></div>

                </div>

              </section><!-- #carousel -->

                        <article class="post-box clearfix" id="post-678089">



            <a class="comment-count" href="http://thechive.com/2014/01/17/so-its-a-bit-windy-in-nebraska-today-video/#comments" title="Comments for: So it’s a bit windy in Nebraska today&nbsp;(Video)"><span class="dsq-postid" rel="678089 http://thechive.com/?p=678089">View Comments</span></a>



            <h2 class="post-title"><a href="http://thechive.com/2014/01/17/so-its-a-bit-windy-in-nebraska-today-video/" title="Continue Reading: So it’s a bit windy in Nebraska today&nbsp;(Video)">So it’s a bit windy in Nebraska today&nbsp;(Video)</a></h2>



            <p class="post-meta">

                        January 17, 2014 <span class="typ-spacer">|</span>

            In: <a href="http://thechive.com/category/funny_hilarious_photos_pictures/" title="View all posts in Funny" rel="category tag">Funny</a>, <a href="http://thechive.com/category/video/" title="View all posts in Video" rel="category tag">Video</a>                        </p>



                        <p class="post-author clearfix">

                <span class="avatar-sm"><img alt='' src='http://0.gravatar.com/avatar/9819cc2e8d1b7bbc8c692de980698fcc?s=50&d=http%3A%2F%2F0.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D50&r=X' class='avatar avatar-50' height='50' width='50' /></span>

                Follow <a href="http://thechive.com/author/macfaulkner/" title="Posts by Mac" rel="author">Mac</a> on <a href="https://twitter.com/macfaulkner" target="_blank" class="author-link">Twitter</a>

            </p>

***Gintaras*** · 01-17-2014, 05:31 PM

It seems this in regular expression does not work:
(?<=something).*

The second regex works.

Don't need (?<= etc.

Code: Copy      Help
str pattern=

;<a class="comment-count" href="(.*)" title="Comments for: 

if(0=findrx(html pattern 0 4 l)) end "not found"

for i 0 l.len

,out l[1 i]

gin · 01-17-2014, 06:29 PM

I tried the code but still gives an error "Error (RT) in get all links in web page in Firefox: not found. ? "

I can't test the regex in the regex builder because it's used for a proprietary software. Hence, it had the extra weird codes that only works for it and not qm.

Code:

Copy Help

;;grab source code - this part works



int w=win("Mozilla Firefox" "Mozilla*WindowClass" "" 0x804)

Acc a.Find(w "DOCUMENT" "" "" 0x3010 2)

str html

a.WebPageProp(0 0 html)



;;findrx - find links matching the regex pattern and show it

ARRAY(str) l

int i

str pattern=

;<a class="comment-count" href="(.*)" title="Comments for:



if(0=findrx(html pattern 0 4 l)) end "not found"



for i 0 l.len

    out l[1 i]

Kevin · 01-19-2014, 07:50 PM

the problem is not in the rx it is in how the html is returned using a.WebPageProp(0 0 html)
html using a.WebPageProp(0 0 html) looks like this

<a sl-processed="1" data-disqus-identifier="678089 http://thechive.com/?p=678089" class="comment-count" href="http://thechive.com/2014/01/17/so-its-a-bit-windy-in-nebraska-today-video/#disqus_thread" title="Comments for: So it’s a bit windy in Nebraska today (Video)">31</a>

where firefox shows this
<a class="comment-count" href="http://thechive.com/2014/01/17/so-its-a-bit-windy-in-nebraska-today-video/#comments" title="Comments for: So it’s a bit windy in Nebraska today (Video)"><span class="dsq-postid" rel="678089 http://thechive.com/?p=678089">View Comments</span></a>

so it fails
if you use

Code: Copy      Help
str pattern=

;class="comment-count" href="(.*)" title="Comments for:

it works

gin · 01-19-2014, 10:06 PM

Ahh, thanks Kevin.

Firas · 05-23-2017, 12:09 PM

Gintaras Wrote:Get text, or all HTML elements?
You can get page source HTML, and parse it, for example with HtmlDoc class.

when I tried to execute this macro, which I found the help

Code:

Copy Help

HtmlDoc d.InitFromWeb("http://www.quickmacros.com/index.html")

str s=d.GetText

out s

out d.GetText("title")

out d.GetText("table" 3)

I run into error
Error in HtmlDoc.Delete: type mismatch.

***Gintaras*** · 05-24-2017, 08:05 AM

Try to run this macro in empty QM file. Also try to restart QM.

To create empty file: menu File -> Open/New File, select a folder, type a file name, OK.