Web Content Extractor Demo 1
Shows how to extract data from the detail pages
Closed Caption:
This demo shows you how to extract data from detail pages using Web Content Extractor.
First, click the "New Project" button to create a new project.
During the first step of the "New Project" wizard, enter the web address from which the program will start the crawling process.
Then identify the links that the program should follow. While basic rules allow you to identify links by the links' position on the page, advanced rules allow you to identify links by URL patterns.
If the links do not change position on any pages of the web site, try using the basic rules. If the links do change position, you should instead use the advanced rules.
In our sample, the links to the detail pages do not change position on the listing pages. We can identify them using the basic rules. Click the "+" button.
Wait until the page is loaded and click on the links to the detail pages.
After you select two links the program analyzes the page, and if it finds similar links the program automatically prompts you to select them.
The "Next Page" link can change its position on the next listing pages, but because the text of this link is not changed, you can identify it by the link text pattern.
To specify the link text pattern, click on the "Follow links if link text equals" option and then click the "Edit" button.
Wait until the page is loaded and click on the "Next" link.
The program extracts the link text and saves it in the patterns window.
Then you need to create an extraction pattern. Click the "Define..." button.
An extraction pattern is a set of data fields that define the positions of text and images on the web page.
First, enter the address of the web page with the target data. You can navigate to the target page via the built-in browser.
To add new data fields, click the "+" button.
Wait until the page is loaded and click on the text or image you need to extract. In our sample, I need to extract the property name.
The program defines the HTML path of the element that contains the title text and displays the "New Data Field" window, which allows you to specify the other parameters of the new data field.
In the preview box you can see that this field contains both the property name and the price. To extract the property name on its own, you need to generate the text processing script. Select the text before the comma and click the "Edit Script" button.
In the script wizard window, click the "Generate" button.
You can create other data fields in a similar fashion.
If you click text that has a label, the program will automatically define the HTML path of the parent element and generate a script that extracts the text that comes after the label.
Once all the data fields are created. Click OK.
In the preview window you can see the selected data fields on the web page.
And you can see the extracted text.
Once the extraction pattern is created. Click OK.
Enter the name of the project and click "Finish".
To start the extraction process, click "Start". The program starts crawling from the starting page, follows the specified links and extracts data using the extraction pattern.
Thank you for your attention.
Video Length: 06:17
Uploaded By: Newprosoft
View Count: 42,813