Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
[solved] screen scraping with firefox - best practices?
#1
I'm trying to write a screen scraper for various sites using firefox. I don't use internet explorer because it's constantly under attack and I'm very worried of getting viruses etc.

Where can I learn more about the best practices (methods, functions, setups, etc) to use? Thanks
#2
Get text, or all HTML elements?
You can get page source HTML, and parse it, for example with HtmlDoc class.
#3
For now, I'm not trying to get text.

I'm trying to to navigate a site to get certain links that follow a pattern eg latest posts.

I've been doing it by finding a keyword, then context menu, then a (for copy link), then paste into notepad, then f3 to find next match, but it doesn't work in many cases because the position of what the keyword and the link are in different places (this is why I asked about getting caret/mouse position earlier)

Ideally, I'd use regex to grab certain urls.

I'm unsure if there's an easier way so I'm asking now Big Grin

Thank you.
#4
gets links where name contains "post"
Macro Macro2218
Code:
Copy      Help
out
int w=wait(3 WV win(" - Mozilla Firefox" "Mozilla*WindowClass" "" 0x4))
Acc a.FindFF(w "#document" "" "" 0x1000 3)
ARRAY(Acc) aa
a.GetChildObjects(aa -1 "LINK" "*post*" "" 1)
int i
for i 0 aa.len
,Acc& r=aa[i]
,str name=r.Name
,str url=r.WebAttribute("href")
,out F"<>{name} <c 0x8000>{url}</c>"
#5
I couldn't get the code to work.

I modded the help example from a.GetChildObjects and got it work but I don't understand how the search works - what parts does it search?.

Here's my code:

Code:
Copy      Help
get all links in web page in Firefox
int w=win("Mozilla Firefox" "Mozilla*WindowClass" "" 0x804)
Acc a.Find(w "DOCUMENT" "" "" 0x3010 2)
ARRAY(Acc) c; int i
a.GetChildObjects(c -1 "LINK" "*gintaras*" "" 1)
for i 0 c.len
    out c[i].Value

Now, how would I find links that are formatted with h2 tags, or a certain color or style, etc? Is that possible? Thanks.
#6
Macro Macro2219
Code:
Copy      Help
int w=win("Mozilla Firefox" "Mozilla*WindowClass" "" 0x804)
Acc a.Find(w "DOCUMENT" "" "" 0x3010 2)
str html
a.WebPageProp(0 0 html)
out html
;now findrx
#7
Sorry, why did you give that code? It seems to grab the source code for the page.

I was playing with "find accessible object, wait" dialog just now.

1. I understand how to pick certain div containers.
2. How do I use regex to grab certain text (eg links) from step 1 above?
#8
I give the code because you asked "how would I find links that are formatted with h2 tags, or a certain color or style, etc?". Accessible objects cannot get these tags and styles, but you can get it from HTML. The GetChildObjects in the example uses wildcard characters on link name; also can use regular expression on link name or value (URL). In this example, uses regex on value to find links where URL contains "reply":
Macro Macro2220
Code:
Copy      Help
int w=win("Mozilla Firefox" "Mozilla*WindowClass" "" 0x804)
Acc a.Find(w "DOCUMENT" "" "" 0x3010 2)
ARRAY(Acc) c; int i
a.GetChildObjects(c -1 "LINK" "" "value=.+reply.*" 0x2000|16|8)
for i 0 c.len
,out c[i].Value
#9
I wrote some regex code to grab all the matching links from the web page's source code, but it gives me an "unknown member" problem with the value piece of the last line. Please help


Code:
Copy      Help
;;grab source code - this part works

int w=win("Mozilla Firefox" "Mozilla*WindowClass" "" 0x804)
Acc a.Find(w "DOCUMENT" "" "" 0x3010 2)
str html
a.WebPageProp(0 0 html)

;;findrx - find links matching the regex pattern and show it
ARRAY(str) l
int i
str pattern=
;(?<=<a\ class="comment-count"\ href=").*(?="\ title="Comments\ for:\ )

findrx(html pattern 0 4 l)

for i 0 l.len
out l[i].value
#10
Why ".value"? str does not have such member.
#11
I tried without value, but that does nothing.
#12
With findrx, always check the return value, to know when it finds and when not. With flag 4 it returns 0 if 0 matches found.
Macro Macro2124
Code:
Copy      Help
;;grab source code - this part works

int w=win("Mozilla Firefox" "Mozilla*WindowClass" "" 0x804)
Acc a.Find(w "DOCUMENT" "" "" 0x3010 2)
str html
a.WebPageProp(0 0 html)

;;findrx - find links matching the regex pattern and show it
ARRAY(str) l
int i
str pattern=
;(?<=<a\ class="comment-count"\ href=").*(?="\ title="Comments\ for:\ )

if(0=findrx(html pattern 0 4 l)) end "not found"

for i 0 l.len
,out l[0 i]
#13
Thanks. It says not found. I have a regex builder from some other software. Is my regex wrong? I made a new one and I tested it - it grabbed the url fine but it doesn't work in qm.

START
<a class="comment-count" href="

END
" title="Comments for

REGEX
(?<=<a\ class="comment-count"\ href=")http://thechive\.com.*/\#comments(?="\ title="Comments\ for)
#14
Don't know, need html string to test.
#15
Here you go,

Code:
Copy      Help
                </ul>
                        </div>
                    <div class="clear"></div>
                </div>
              </section><!-- #carousel -->
                        <article class="post-box clearfix" id="post-678089">

            <a class="comment-count" href="http://thechive.com/2014/01/17/so-its-a-bit-windy-in-nebraska-today-video/#comments" title="Comments for: So it’s a bit windy in Nebraska today&nbsp;(Video)"><span class="dsq-postid" rel="678089 http://thechive.com/?p=678089">View Comments</span></a>

            <h2 class="post-title"><a href="http://thechive.com/2014/01/17/so-its-a-bit-windy-in-nebraska-today-video/" title="Continue Reading: So it’s a bit windy in Nebraska today&nbsp;(Video)">So it’s a bit windy in Nebraska today&nbsp;(Video)</a></h2>

            <p class="post-meta">
                        January 17, 2014 <span class="typ-spacer">|</span>
            In: <a href="http://thechive.com/category/funny_hilarious_photos_pictures/" title="View all posts in Funny" rel="category tag">Funny</a>, <a href="http://thechive.com/category/video/" title="View all posts in Video" rel="category tag">Video</a>                        </p>

                        <p class="post-author clearfix">
                <span class="avatar-sm"><img alt='' src='http://0.gravatar.com/avatar/9819cc2e8d1b7bbc8c692de980698fcc?s=50&d=http%3A%2F%2F0.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D50&r=X' class='avatar avatar-50' height='50' width='50' /></span>
                Follow <a href="http://thechive.com/author/macfaulkner/" title="Posts by Mac" rel="author">Mac</a> on <a href="https://twitter.com/macfaulkner" target="_blank" class="author-link">Twitter</a>
            </p>
#16
It seems this in regular expression does not work:
(?<=something).*

The second regex works.

Don't need (?<= etc.
Code:
Copy      Help
str pattern=
;<a class="comment-count" href="(.*)" title="Comments for:

if(0=findrx(html pattern 0 4 l)) end "not found"

for i 0 l.len
,out l[1 i]
#17
I tried the code but still gives an error "Error (RT) in get all links in web page in Firefox: not found. ? "

I can't test the regex in the regex builder because it's used for a proprietary software. Hence, it had the extra weird codes that only works for it and not qm.

Code:
Copy      Help
;;grab source code - this part works

int w=win("Mozilla Firefox" "Mozilla*WindowClass" "" 0x804)
Acc a.Find(w "DOCUMENT" "" "" 0x3010 2)
str html
a.WebPageProp(0 0 html)

;;findrx - find links matching the regex pattern and show it
ARRAY(str) l
int i
str pattern=
;<a class="comment-count" href="(.*)" title="Comments for:

if(0=findrx(html pattern 0 4 l)) end "not found"

for i 0 l.len
    out l[1 i]
#18
the problem is not in the rx it is in how the html is returned using a.WebPageProp(0 0 html)
html using a.WebPageProp(0 0 html) looks like this

<a sl-processed="1" data-disqus-identifier="678089 http://thechive.com/?p=678089" class="comment-count" href="http://thechive.com/2014/01/17/so-its-a-bit-windy-in-nebraska-today-video/#disqus_thread" title="Comments for: So it’s a bit windy in Nebraska today&nbsp;(Video)">31</a>

where firefox shows this
<a class="comment-count" href="http://thechive.com/2014/01/17/so-its-a-bit-windy-in-nebraska-today-video/#comments" title="Comments for: So it’s a bit windy in Nebraska today&nbsp;(Video)"><span class="dsq-postid" rel="678089 http://thechive.com/?p=678089">View Comments</span></a>

so it fails
if you use
Code:
Copy      Help
str pattern=
;class="comment-count" href="(.*)" title="Comments for:
it works
#19
Ahh, thanks Kevin.
#20
Gintaras Wrote:Get text, or all HTML elements?
You can get page source HTML, and parse it, for example with HtmlDoc class.

when I tried to execute this macro, which I found the help
Code:
Copy      Help
HtmlDoc d.InitFromWeb("http://www.quickmacros.com/index.html")
str s=d.GetText
out s
out d.GetText("title")
out d.GetText("table" 3)

I run into error
Error in HtmlDoc.Delete: type mismatch.
[Image: Capture.jpg]
#21
Try to run this macro in empty QM file. Also try to restart QM.

To create empty file: menu File -> Open/New File, select a folder, type a file name, OK.


Forum Jump:


Users browsing this thread: 1 Guest(s)