Tutorial Contents
[hide]Tutorial 2
In our second tutorial we will describe the process of extracting information from the web site Demo Store . It will be a more difficult extraction, a so-called deep-level extraction when the data is "in the depth" of the web site. We will configure the agent to visit all the product pages and scrape the data including the product names, price, description, SKU, etc. and save it into the Excel file.
1 Creating New Project
We create a new project for each new web site we want to extract data from. It is also possible to create several
extraction agents inside of one project, but it is not so comfortable. Click the New Project
button in the Tool Bar
or
in the File > New Project
menu.
Enter a project name in the dialog window. Click the Finish
button. A new project will appear in the Workspace
view in
the right upper corner of the window.
1.1 Starting Page Navigation
Navigate to the starting page from which the agent will start working. For this enter URL of the starting page into
the Navigation Bar
. The target web site URL is http://www.websundew.com/demo/
. Type this address.
Press Enter
or click
button.
Wait until the navigation process is over. Now we are ready to start creating the Agent
that will capture product information from Demo Store
1.2 Creating New Agent
Click the Agent
button in the Tool Bar
or in the File > New Agent
menu.
You will see an Agent Configuration Wizard
. Select Start Up mode. In our case we have only one URL so choose the first option - Single URL
.
Click Next
. Type the Agent's
name and click the Finish
button. You will see an Agent Editor
.
1.3 Configuring the Agent
On the left hand corner of the Agent Editor
there is an Agent Diagram
. This diagram shows the Agent's
state. Init State
is the initial state from which the Agent
starts working. This state loads the initial web page. Page 1
State reflects the
loaded page. To the right there is a Browser Window
linked to the selected state in the Diagram Editor.
1.4 Second Level Navigation
To collect the data from all of the detailed pages we need to visit all of them. We need to create a loop that will iterate over links and click each of them.
Click the Deep Crawl
button in the Tool Bar
.
Select Data Iterator Pattern
in the dialog window that appears. Click the Finish
button. Iterator Pattern Wizard
will
appear in the left hand side of the window. Click the first link in the browser, it will be highlighted in light blue. Click the Add
button in the pattern wizard.
Click the Find
button. Wait until the program completes looking for the patterns. Select the proper result.
All the links will be highlighted in blue. Click Next
at the bottom of the wizard.
Enter a pattern name. Click the Finish
button. The Loop
statement (that iterates other all of the links) will be added to the current state.
Inside Loop
you can find Click
statement
that leads to the new state. Click the new state. The browser window will show the product details web page.
1.5 Capturing Data from the Product Detail Page
Click Capture
in the Tool Bar
. Select Simple Data Pattern
. Click the Finish
button. On the left hand corner there will appear a Simple Details Wizard.
Click on the product name. The product name will be highlighted in light blue. Click Add
in the pattern wizard. The new field will be added. You can change
the file name name by clicking on it. Repeat the action for the other fields: price, model, product SKU.
Click the Next
button in the bottom of the pattern wizard. Type the pattern name and click Finish
.
The Capture Block
statment will be added to the current state. You can see capture data in Preview
view.
1.6 Capturing Data from Linked Pages
The web page associated with Page 1
state contains linked pages. We need to visit all of them. For that purpose we can use Paginator
.
Click the Paginator
button in the Tool Bar
to create a Paginator
which will enable the Agent to visit all the linked pages
and extract data from all of them. Select Simple Next Page Pattern
in paginator dialog window.
Click the Finish
button. Simple Next Page Wizard
will appear on the left hand window. Click Next
page link in the browser window.
Click the Next
button in the Paginator Wizard. Enter the name of the pattern. Click the Finish
button. The Paginator
statement will be added to the current state.
1.7 Saving Data
Click Datasource
in the Tool Bar
to create Datasource
. Datasource Wizard
will appear.
Select the format you want. In our case it will Excel
. Select Excel
.
Click the Next
button. Select the Agent and mark the fields you want to save. Click Next
if you want to use default settings. Enter Datasource
name. Click the Finish
button.
We have created the Data Extraction Agent . Now we can use it to extract and save the data.
1.8 Running the Agent
Click the Run
button in the Tool Bar
. Wait till the Agent
completes working. A dialog window will appear.
You can see the results
of the agent's work and the path to the saved file. Select file name and click Open
to view result.