Tutorial 2

In our second tutorial we will describe the process of extracting information from the web site Demo Store . It will be a more difficult extraction, a so-called deep-level extraction when the data is "in the depth" of the web site. We will configure the agent to visit all the product pages and scrape the data including the product names, price, description, SKU, etc. and save it into the Excel file.

1 Creating New Project

We create a new project for each new web site we want to extract data from. It is also possible to create several extraction agents inside of one project, but it is not so comfortable. Click the New Project button in the Tool Bar or in the File > New Project menu.

Enter a project name in the dialog window. Click the Finish button. A new project will appear in the Workspace view in the right upper corner of the window.

1.1 Starting Page Navigation

Navigate to the starting page from which the agent will start working. For this enter URL of the starting page into the Navigation Bar . The target web site URL is http://www.websundew.com/demo/ . Type this address.

Press Enter or click button. Wait until the navigation process is over. Now we are ready to start creating the Agent that will capture product information from Demo Store

1.2 Creating New Agent

Click the Agent button in the Tool Bar or in the File > New Agent menu.

You will see an Agent Configuration Wizard . Select Start Up mode. In our case we have only one URL so choose the first option - Single URL .

Click Next . Type the Agent's name and click the Finish button. You will see an Agent Editor .

1.3 Configuring the Agent

On the left hand corner of the Agent Editor there is an Agent Diagram . This diagram shows the Agent's state. Init State is the initial state from which the Agent starts working. This state loads the initial web page. Page 1 State reflects the loaded page. To the right there is a Browser Window linked to the selected state in the Diagram Editor.

1.4 Second Level Navigation

To collect the data from all of the detailed pages we need to visit all of them. We need to create a loop that will iterate over links and click each of them.

Click the Deep Crawl button in the Tool Bar .

Select Data Iterator Pattern in the dialog window that appears. Click the Finish button. Iterator Pattern Wizard will appear in the left hand side of the window. Click the first link in the browser, it will be highlighted in light blue. Click the Add button in the pattern wizard.

Click the Find button. Wait until the program completes looking for the patterns. Select the proper result. All the links will be highlighted in blue. Click Next at the bottom of the wizard.

Enter a pattern name. Click the Finish button. The Loop statement (that iterates other all of the links) will be added to the current state. Inside Loop you can find Click statement that leads to the new state. Click the new state. The browser window will show the product details web page.

1.5 Capturing Data from the Product Detail Page

Click Capture in the Tool Bar . Select Simple Data Pattern . Click the Finish button. On the left hand corner there will appear a Simple Details Wizard. Click on the product name. The product name will be highlighted in light blue. Click Add in the pattern wizard. The new field will be added. You can change the file name name by clicking on it. Repeat the action for the other fields: price, model, product SKU.

Click the Next button in the bottom of the pattern wizard. Type the pattern name and click Finish . The Capture Block statment will be added to the current state. You can see capture data in Preview view.

1.6 Capturing Data from Linked Pages

The web page associated with Page 1 state contains linked pages. We need to visit all of them. For that purpose we can use Paginator .

Click the Paginator button in the Tool Bar to create a Paginator which will enable the Agent to visit all the linked pages and extract data from all of them. Select Simple Next Page Pattern in paginator dialog window. Click the Finish button. Simple Next Page Wizard will appear on the left hand window. Click Next page link in the browser window.

Click the Next button in the Paginator Wizard. Enter the name of the pattern. Click the Finish button. The Paginator statement will be added to the current state.

1.7 Saving Data

Click Datasource in the Tool Bar to create Datasource . Datasource Wizard will appear. Select the format you want. In our case it will Excel . Select Excel .

Click the Next button. Select the Agent and mark the fields you want to save. Click Next if you want to use default settings. Enter Datasource name. Click the Finish button.


We have created the Data Extraction Agent . Now we can use it to extract and save the data.

1.8 Running the Agent

Click the Run button in the Tool Bar . Wait till the Agent completes working. A dialog window will appear.

You can see the results of the agent's work and the path to the saved file. Select file name and click Open to view result.

Page Modified 03.02.12 12:50