Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
html Extraction
#1
Hello QMers,

I have been searching the forum for another topic that talks about setting up a web scrape with Quick Macros. I have a webpage that is behind a login page so none of the typical scrapers out there work. In other words, I need to create a GUI hacked scraper.

The data I want to scrape is in the same format with multiple rows of data. I would like to write one scraper for complete row and then put that in a loop to grab the rest of the rows throughout the page and then move on to the next page.

I have already used the html element actions to create the scraper but the elements change to the next page. I am not sure how to attack this one and I would greatly appreciate some help.

Let me know what other information I can provide to help you understand what I am trying to do.

Thanks,

Paul
#2
Alright apparently I am not speaking the right language to get some ideas flowing here. I realized now after further searching the forum that the proper term for what I am trying to do is html extraction. I have put together this code (below) from some of the other posts I have read and now I am able to get just the text out of the webpage I am using. However, I need to format the text coming out into a csv format and I am a little at a loss of how to do this. The main problem is knowing which text is coming out so that I can put it in the correct column of the csv file.

Code:
Copy      Help
int w=wait(3 WV win("List Details - Windows Internet Explorer" "IEFrame"))
Acc a1.Find(w "PANE" "List Details" "" 0x3001 3)
str html
a1.WebPageProp(0 0 html)


HtmlDoc d.InitFromText(html)
ARRAY(MSHTML.IHTMLElement) a
d.GetHtmlElements(a "")
int i
for i 0 a.len
    out "----------"
    str s2=a[i].innerText
    out s2

I have also posted a excerpt of the html I am trying to extract from. This excerpt is one row of the csv file and there are 20 more blocks of html just like this one on the page that would like to extract. Any help on capturing these unique pieces of information would be a huge help. You can also see below this code a look at how I would like to format the csv file as well.

Code:
Copy      Help
<div class="search-result-container contact-result row-fluid"><div class="span12">
    <div class="item-actions-container">
        <div class="actions-row long-line">
<div class="actions-container inline-block" style="width: 80px;">
    <div class="touch-button-container inline-block pull-right">
        <div title="Pin" class="pin-this"></div>
    </div>
    <div class="touch-button-container inline-block pull-right">
        
    </div>
    <div class="touch-button-container touch-right-divider inline-block pull-right">
        <div title="Quick View" class="quick-view"></div>
        <div class="right-divider"></div>
    </div>
</div><div class="social-row">

    <div class="search-result-google search-result-social  pull-right">
        
        <a href="https://plus.google.com/s/Alex%20Abadi" target="_blank"></a>
    </div>

    <div class="search-result-facebook search-result-social  pull-right">
        
        <a href="https://www.facebook.com/search/more/?q=Alex%20Abadi" target="_blank"></a>
    </div>

    <div class="search-result-twitter search-result-social  pull-right">
        
        <a href="https://twitter.com/search?q=Alex%20Abadi&amp;mode=users" target="_blank"></a>
    </div>

    <div class="search-result-linkedin search-result-social  pull-right">
        
        <a href="http://www.linkedin.com/vsearch/f?keywords=Alex+Abadi" target="_blank"></a>
    </div>

    <!--<div class="search-result-companyURL inline-block">-->
    <div class="search-result-url search-result-social  pull-right">
        <a href="http://www.imagemicrosystems.com" target="_blank"></a>
    </div>

</div><div class="connection-meter list-only pull-left">
  <!--<div class="left-side"></div>-->
  <!--<div class="middle"></div>-->
  <!--<div class="right-side"></div>-->
</div>
        </div>
    </div>
    <div class="logo-container">
        <div class="selected-status  pull-left"></div>
        <input class="pull-left" type="checkbox" name="searchResults-10611e14-c5b5-3cac-9679-7b69997eb75d" id="10611e14-c5b5-3cac-9679-7b69997eb75d" data-primitive-type="contact">
        <div class="image-wrapper">
            <!--<div class="p-meter-wrapper"><i class="icon p-meter list-only" ></i></div>-->
            <div class="search-result-icon contact-icon"></div>
            <div class="favicon-container">
            </div>
        </div>
        <i class="icon ideal-prospect-img list-only"></i>

        <div class="ideal-prospect-val list-only">
            0
        </div>
    </div>
    <div class="detail-container">
        <div class="name-row">
            <a href="/contact/10611e14-c5b5-3cac-9679-7b69997eb75d">Alex  Abadi</a>
        </div>
        <div class="search-result-subheadline">
            <span class="large-black-text">Chief Executive Officer at </span>
            <span class="contact-company-name"><a href="/company/d0a95324-611b-36b7-8a5b-b753ab957e36" class="clickable">Image Microsystems, Inc.</a></span>
        </div>
        <div class="compact-section">
            <div class="location">Austin,
                Texas,
                United States
                <div class="contact-industry">Computer and Peripheral Equipment Manufacturing</div>
            </div>

            <div class="compact-section">
                  <div class="small-data-label">Main:</div>
                  <div class="inline-block black-text"><span id="gc-number-24" class="gc-cs-link" title="Call with Google Voice">512-623-5621</span></div>
                  <div>
                      <div class="small-data-label">Direct:</div>
                      <div class="inline-block black-text"><span id="gc-number-25" class="gc-cs-link" title="Call with Google Voice">512-623-5642</span></div>
                  </div>
                <div>
                    <div class="small-data-label">Email:</div>
                    <a class="black-text" href="mailto:alex_abadi@imagemicrosystems.com">alex_abadi@imagemicrosystems.com</a>
                </div>
            </div>

            <div class="">
            </div>
        </div>
    </div>
<div class="right-wrapper">
    <div class="stick-bottom pull-right">
        <div class="notification-container list-only">
            <a class="trigger-wrapper pull-right hidden" href="/contact/10611e14-c5b5-3cac-9679-7b69997eb75d?report=company_triggers">
                <span class="trigger-count pull-right"></span>
                <div class="trigger-icon-color pull-right"></div>
            </a>
  <div class="notes-wrapper dropdown text-right">
      <a class="notes dropdown-toggle" data-toggle="dropdown" role="button" data-target="dropdown" data-item-id="10611e14-c5b5-3cac-9679-7b69997eb75d">
      Notes <span class="note-count"></span></a>
    <div class="dropdown-menu text-left">
      <form class="noteEditForm">
        <div class="helptext noteActionLabel">Add a New Note:</div>
        <input type="text" name="label" class="noteLabel" placeholder="Title">
        <textarea name="messageBody" class="noteBody" placeholder="Body"></textarea>
        <input type="hidden" name="entityId" class="entityId" value="10611e14-c5b5-3cac-9679-7b69997eb75d">
        <input type="hidden" name="entityType" class="entityType" value="contact">
        <input type="hidden" name="id" class="noteId">
        <div class="button-wrapper pull-right">
          <a class="cancelNoteButton cancel-link" data-dismiss="dropdown" aria-hidden="true">Cancel</a>
          <input type="submit" class="saveNoteButton btn btn-blue-small" value="Save">
        </div>
        <div class="clearfix"></div>
      </form>

      <div class="existing-notes hide">
        <div class="helptext">Open an Existing Note:</div>
        <ul>
        </ul>
      </div>

    </div>
  </div>
        </div>
        <div class="crm-status" data-id="10611e14-c5b5-3cac-9679-7b69997eb75d">
        </div>
        <div class="list-add-date text-right pull-right list-only">Added 6-Jan-2016</div>
    </div>
</div></div></div>

CSV File Example - This csv needs to be separated by Tabs because there are "," in the data coming out of the html that I don't want to separate.

Code:
Copy      Help
Name    Title    Company    City/State    Industry    Main Phone    Direct Phone    Email    Added
Alex Abadi    Chief Executive Officer at    Image Microsystems, Inc.    Austin, Texas, United States    Computer and Peripheral Equipment Manufacturing    512-623-5621    512-623-5642    alex_abadi@imagemicrosystems.com    Added 6-Jan-2016

Any help that you can provide in helping me identify the particular html elements to pull out would be great as currently I am only able to pull all text into a text file which isn't helpful for the project I am working on.

Really appreciate any help you give me.

Best Regards,

Paul
#3
I have made it further after finding the following code on the forum. However, this still doesn't fire on all cylinders for me because I'm missing the industry, city, state and the date the contact was added. Both of which I need to have from my extraction.

Gintaras, I sure would appreciate it if you could help me figure out the last piece of the puzzle here. I was trying to use the .className to identify the "location" and "industry" classes but for some reason the for loop being used doesn't allow for a sel case to be used to capture this data separately. Finally the last piece of this puzzle is getting the data into columns and rows of a tab deliminated csv file. Any help you could provide with this would be great too.

Code:
Copy      Help
str s=
<BODY>
<div class="detail-container">
        <div class="name-row">
            <a href="/contact/b070f5e9-30d7-3da5-bc39-780c3455b71e">Mitch  Acker</a>
        </div>
        <div class="search-result-subheadline">
            <span class="large-black-text">President, Sales Executive at </span>
            <span class="contact-company-name"><a href="/company/66819229-e58e-36e8-a282-c11f68eb2453" class="clickable">Martinaire Inc</a></span>
        </div>
        <div class="compact-section">
            <div class="location">Addison,
                Texas,
                United States
                <div class="contact-industry">Airlines</div>
            </div>
            <div class="compact-section">
                  <div class="small-data-label">Main:</div>
                  <div class="inline-block black-text"><span id="gc-number-20" class="gc-cs-link" title="Call with Google Voice">972-349-5700</span></div>
                <div>
                    <div class="small-data-label">Email:</div>
                    <a class="black-text" href="mailto:macker@martinaire.com">macker@martinaire.com</a>
                </div>
            </div>
            <div class="">
            </div>
        </div>
    </div>
<div class="detail-container">
        <div class="name-row">
            <a href="/contact/10611e14-c5b5-3cac-9679-7b69997eb75d">Alex  Abadi</a>
        </div>
        <div class="search-result-subheadline">
            <span class="large-black-text">Chief Executive Officer at </span>
            <span class="contact-company-name"><a href="/company/d0a95324-611b-36b7-8a5b-b753ab957e36" class="clickable">Image Microsystems, Inc.</a></span>
        </div>
        <div class="compact-section">
            <div class="location">Austin,
                Texas,
                United States
                <div class="contact-industry">Computer and Peripheral Equipment Manufacturing</div>
            </div>
            <div class="compact-section">
                  <div class="small-data-label">Main:</div>
                  <div class="inline-block black-text"><span id="gc-number-24" class="gc-cs-link" title="Call with Google Voice">512-623-5621</span></div>
                  <div>
                      <div class="small-data-label">Direct:</div>
                      <div class="inline-block black-text"><span id="gc-number-25" class="gc-cs-link" title="Call with Google Voice">512-623-5642</span></div>
                  </div>
                <div>
                    <div class="small-data-label">Email:</div>
                    <a class="black-text" href="mailto:alex_abadi@imagemicrosystems.com">alex_abadi@imagemicrosystems.com</a>
                </div>
            </div>
            <div class="">
            </div>
        </div>
    </div>
</BODY>

out
s.findreplace("span" "a")
HtmlDoc d.InitFromText(s)
ARRAY(MSHTML.IHTMLElement) h2 div
int i j
d.GetHtmlElements(div "div")
for i 0 div.len
    str cn=div[i].className
    if cn="detail-container"
        d.GetHtmlElements(h2 "a" "" div[i].sourceIndex)
        for j 0 h2.len
            out h2[j].innerText


Thanks Again,

Paul


Forum Jump:


Users browsing this thread: 1 Guest(s)