# Scrape Feeds
PHPScraper can identify and process feeds (RSS feeds, sitemaps, etc.) for you. The following feed-specific features are implemented:
# Identify RSS Feed URLs
Websites can define RSS feeds in the head section of their markup. PHPScraper allows to identify any RSS feeds noted on the current page using rssUrls
:
$web = new \Spekulatius\PHPScraper\PHPScraper;
/**
* Navigate to the test page. It contains:
*
* <link
* ref="alternative"
* type="application/rss+xml"
* href="https://test-pages.phpscraper.de/absolute.xml"
* />
* <link
* ref="alternative"
* type="application/rss+xml"
* href="/relative.xml"
* />
*/
print_r(
$web
->go('https://test-pages.phpscraper.de/meta/feeds.html')
->rssUrls
);
/**
* [
* 'https://test-pages.phpscraper.de/absolute.xml',
* 'https://test-pages.phpscraper.de/relative.xml'
* ]
*/
# Parse RSS feeds
The rss()
-method can be used to parse RSS feeds. If called without any parameter rssUrls
will be used:
// Init and go to any page of the domain. This sets the base URL.
$web = new \Spekulatius\PHPScraper\PHPScraper;
$web->go('https://test-pages.phpscraper.de/meta/feeds.html');
// Same as `$web->rss(...$web->rssUrls)`
$rss = $web->rss();
You can also parse RSS feeds by passing one or more URLs in:
// Single URL.
$rss = $web->rss($web->rssUrls[0]);
// Multiple URLs
$rss = $web->rss(
'https://example.com/feed_1.xml',
'https://example.com/feed_2.xml'
);
This result contains an array structure with selected properties. The array structure contains instances of DataTransferObjects\FeedEntry
with properties for link
and title
.
Complete Details
If you need all details, please fallback on $web->rssRaw(...)
. It can be called like $web->rss(...)
and returns an array structure.
# Parse XML Sitemaps
You can parse XML sitemaps using sitemap()
:
$web = new \Spekulatius\PHPScraper\PHPScraper;
/**
* Get the sitemap for the current website (if it exists).
* This assumes the default URL `/sitemap.xml` is used.
*
* @throws \Exception (e.g. network).
*/
$sitemap = $web
->go('https://example.com')
->sitemap();
// You can pass in the desired URL:
$sitemap = $web->sitemap('https://example.com/custom_sitemap.xml');
This result contains only selected properties. It returns an array of DataTransferObjects\FeedEntry
with the link
property.
Complete Details
If you need all details, please fallback on $web->sitemapRaw(...)
. It can be called like $web->sitemap()
and returns an array structure.
# Parse Static Search Indexes
You can parse static search indexes using searchIndex()
:
$web = new \Spekulatius\PHPScraper\PHPScraper;
// Get the search index for the current website (if it exists).
// This assumes the default URL `/index.json` is used.
$searchIndex = $web
->go('https://example.com')
->searchIndex();
// You can pass in the desired URL:
$searchIndex = $web->searchIndex('https://example.com/custom_index.json');
This result contains only selected properties. It returns an array of DataTransferObjects\FeedEntry
with properties link
, title
, and description
.
Complete Details
If you need all details, please fallback on $web->searchIndexRaw(...)
. It can be called like $web->searchIndex()
and returns an array structure.