Posts: 197
Threads: 60
Joined: Dec 2013
I'm trying to write a screen scraper for various sites using firefox. I don't use internet explorer because it's constantly under attack and I'm very worried of getting viruses etc.
Where can I learn more about the best practices (methods, functions, setups, etc) to use? Thanks
Posts: 12,071
Threads: 140
Joined: Dec 2002
Get text, or all HTML elements?
You can get page source HTML, and parse it, for example with HtmlDoc class.
Posts: 197
Threads: 60
Joined: Dec 2013
For now, I'm not trying to get text.
I'm trying to to navigate a site to get certain links that follow a pattern eg latest posts.
I've been doing it by finding a keyword, then context menu, then a (for copy link), then paste into notepad, then f3 to find next match, but it doesn't work in many cases because the position of what the keyword and the link are in different places (this is why I asked about getting caret/mouse position earlier)
Ideally, I'd use regex to grab certain urls.
I'm unsure if there's an easier way so I'm asking now
Thank you.
Posts: 12,071
Threads: 140
Joined: Dec 2002
gets links where name contains "post"
Macro Macro2218
out
int w=wait(3 WV win(" - Mozilla Firefox" "Mozilla*WindowClass" "" 0x4))
Acc a.FindFF(w "#document" "" "" 0x1000 3)
ARRAY(Acc) aa
a.GetChildObjects(aa -1 "LINK" "*post*" "" 1)
int i
for i 0 aa.len
,Acc& r=aa[i]
,str name=r.Name
,str url=r.WebAttribute("href")
,out F"<>{name} <c 0x8000>{url}</c>"
Posts: 197
Threads: 60
Joined: Dec 2013
I couldn't get the code to work.
I modded the help example from a.GetChildObjects and got it work but I don't understand how the search works - what parts does it search?.
Here's my code:
get all links in web page in Firefox
int w=win("Mozilla Firefox" "Mozilla*WindowClass" "" 0x804)
Acc a.Find(w "DOCUMENT" "" "" 0x3010 2)
ARRAY(Acc) c; int i
a.GetChildObjects(c -1 "LINK" "*gintaras*" "" 1)
for i 0 c.len
out c[i].Value
Now, how would I find links that are formatted with h2 tags, or a certain color or style, etc? Is that possible? Thanks.
Posts: 12,071
Threads: 140
Joined: Dec 2002
Macro Macro2219
int w=win("Mozilla Firefox" "Mozilla*WindowClass" "" 0x804)
Acc a.Find(w "DOCUMENT" "" "" 0x3010 2)
str html
a.WebPageProp(0 0 html)
out html
;now findrx
Posts: 197
Threads: 60
Joined: Dec 2013
Sorry, why did you give that code? It seems to grab the source code for the page.
I was playing with "find accessible object, wait" dialog just now.
1. I understand how to pick certain div containers.
2. How do I use regex to grab certain text (eg links) from step 1 above?
Posts: 12,071
Threads: 140
Joined: Dec 2002
I give the code because you asked "how would I find links that are formatted with h2 tags, or a certain color or style, etc?". Accessible objects cannot get these tags and styles, but you can get it from HTML. The GetChildObjects in the example uses wildcard characters on link name; also can use regular expression on link name or value (URL). In this example, uses regex on value to find links where URL contains "reply":
Macro Macro2220
int w=win("Mozilla Firefox" "Mozilla*WindowClass" "" 0x804)
Acc a.Find(w "DOCUMENT" "" "" 0x3010 2)
ARRAY(Acc) c; int i
a.GetChildObjects(c -1 "LINK" "" "value=.+reply.*" 0x2000|16|8)
for i 0 c.len
,out c[i].Value
Posts: 197
Threads: 60
Joined: Dec 2013
I wrote some regex code to grab all the matching links from the web page's source code, but it gives me an "unknown member" problem with the value piece of the last line. Please help
;;grab source code - this part works
int w=win("Mozilla Firefox" "Mozilla*WindowClass" "" 0x804)
Acc a.Find(w "DOCUMENT" "" "" 0x3010 2)
str html
a.WebPageProp(0 0 html)
;;findrx - find links matching the regex pattern and show it
ARRAY(str) l
int i
str pattern=
;(?<=<a\ class="comment-count"\ href=").*(?="\ title="Comments\ for:\ )
findrx(html pattern 0 4 l)
for i 0 l.len
out l[i].value
Posts: 12,071
Threads: 140
Joined: Dec 2002
Why ".value"? str does not have such member.
Posts: 197
Threads: 60
Joined: Dec 2013
I tried without value, but that does nothing.
Posts: 12,071
Threads: 140
Joined: Dec 2002
With findrx, always check the return value, to know when it finds and when not. With flag 4 it returns 0 if 0 matches found.
Macro Macro2124
;;grab source code - this part works
int w=win("Mozilla Firefox" "Mozilla*WindowClass" "" 0x804)
Acc a.Find(w "DOCUMENT" "" "" 0x3010 2)
str html
a.WebPageProp(0 0 html)
;;findrx - find links matching the regex pattern and show it
ARRAY(str) l
int i
str pattern=
;(?<=<a\ class="comment-count"\ href=").*(?="\ title="Comments\ for:\ )
if(0=findrx(html pattern 0 4 l)) end "not found"
for i 0 l.len
,out l[0 i]
Posts: 197
Threads: 60
Joined: Dec 2013
Thanks. It says not found. I have a regex builder from some other software. Is my regex wrong? I made a new one and I tested it - it grabbed the url fine but it doesn't work in qm.
START
<a class="comment-count" href="
END
" title="Comments for
REGEX
(?<=<a\ class="comment-count"\ href=") http://thechive\.com.*/\#comments(?="\ title="Comments\ for)
Posts: 12,071
Threads: 140
Joined: Dec 2002
Don't know, need html string to test.
Posts: 197
Threads: 60
Joined: Dec 2013
Here you go,
</ul>
</div>
<div class="clear"></div>
</div>
</section><!-- #carousel -->
<article class="post-box clearfix" id="post-678089">
<a class="comment-count" href="http://thechive.com/2014/01/17/so-its-a-bit-windy-in-nebraska-today-video/#comments" title="Comments for: So it’s a bit windy in Nebraska today (Video)"><span class="dsq-postid" rel="678089 http://thechive.com/?p=678089">View Comments</span></a>
<h2 class="post-title"><a href="http://thechive.com/2014/01/17/so-its-a-bit-windy-in-nebraska-today-video/" title="Continue Reading: So it’s a bit windy in Nebraska today (Video)">So it’s a bit windy in Nebraska today (Video)</a></h2>
<p class="post-meta">
January 17, 2014 <span class="typ-spacer">|</span>
In: <a href="http://thechive.com/category/funny_hilarious_photos_pictures/" title="View all posts in Funny" rel="category tag">Funny</a>, <a href="http://thechive.com/category/video/" title="View all posts in Video" rel="category tag">Video</a> </p>
<p class="post-author clearfix">
<span class="avatar-sm"><img alt='' src='http://0.gravatar.com/avatar/9819cc2e8d1b7bbc8c692de980698fcc?s=50&d=http%3A%2F%2F0.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D50&r=X' class='avatar avatar-50' height='50' width='50' /></span>
Follow <a href="http://thechive.com/author/macfaulkner/" title="Posts by Mac" rel="author">Mac</a> on <a href="https://twitter.com/macfaulkner" target="_blank" class="author-link">Twitter</a>
</p>
Posts: 12,071
Threads: 140
Joined: Dec 2002
It seems this in regular expression does not work:
(?<=something).*
The second regex works.
Don't need (?<= etc.
str pattern=
;<a class="comment-count" href="(.*)" title="Comments for:
if(0=findrx(html pattern 0 4 l)) end "not found"
for i 0 l.len
,out l[1 i]
Posts: 197
Threads: 60
Joined: Dec 2013
I tried the code but still gives an error "Error (RT) in get all links in web page in Firefox: not found. ? "
I can't test the regex in the regex builder because it's used for a proprietary software. Hence, it had the extra weird codes that only works for it and not qm.
;;grab source code - this part works
int w=win("Mozilla Firefox" "Mozilla*WindowClass" "" 0x804)
Acc a.Find(w "DOCUMENT" "" "" 0x3010 2)
str html
a.WebPageProp(0 0 html)
;;findrx - find links matching the regex pattern and show it
ARRAY(str) l
int i
str pattern=
;<a class="comment-count" href="(.*)" title="Comments for:
if(0=findrx(html pattern 0 4 l)) end "not found"
for i 0 l.len
out l[1 i]
Posts: 1,336
Threads: 61
Joined: Jul 2006
the problem is not in the rx it is in how the html is returned using a.WebPageProp(0 0 html)
html using a.WebPageProp(0 0 html) looks like this
<a sl-processed="1" data-disqus-identifier="678089 http://thechive.com/?p=678089" class="comment-count" href="http://thechive.com/2014/01/17/so-its-a-bit-windy-in-nebraska-today-video/#disqus_thread" title="Comments for: So it’s a bit windy in Nebraska today (Video)">31</a>
where firefox shows this
<a class="comment-count" href="http://thechive.com/2014/01/17/so-its-a-bit-windy-in-nebraska-today-video/#comments" title="Comments for: So it’s a bit windy in Nebraska today (Video)"><span class="dsq-postid" rel="678089 http://thechive.com/?p=678089">View Comments</span></a>
so it fails
if you use
str pattern=
;class="comment-count" href="(.*)" title="Comments for:
it works
Posts: 197
Threads: 60
Joined: Dec 2013
Posts: 28
Threads: 7
Joined: Apr 2016
Gintaras Wrote:Get text, or all HTML elements?
You can get page source HTML, and parse it, for example with HtmlDoc class.
when I tried to execute this macro, which I found the help
HtmlDoc d.InitFromWeb("http://www.quickmacros.com/index.html")
str s=d.GetText
out s
out d.GetText("title")
out d.GetText("table" 3)
I run into error
Error in HtmlDoc.Delete: type mismatch.
Posts: 12,071
Threads: 140
Joined: Dec 2002
Try to run this macro in empty QM file. Also try to restart QM.
To create empty file: menu File -> Open/New File, select a folder, type a file name, OK.
|