# Scrape CSV, XML and JSON
PHPScraper can process common plain file types such as csv
, json
, xml
from strings or URLs for you. Most functionality described below works for all three types. Special cases are noted. The following topics are covered:
# Parsing of CSV/XML/JSON strings
If you have a string that represents a CSV, XML or JSON, PHPScraper can assist in validating and parsing it:
$web = new \Spekulatius\PHPScraper\PHPScraper;
// Parse a JSON string
$json = $web->parseJson($jsonString);
// Parse an XML string
$xml = $web->parseXml($xmlString);
// Parse a CSV string
$csv = $web->parseCsv($csvString);
This can be useful when chaining steps or accessing embedded elements such as schema data.
# Fetching and Parsing of CSV/XML/JSON URLs
PHPScraper can assist with fetching and parsing the contents of remote resources (URLs) containing JSON-, CSV- or XML data:
$web = new \Spekulatius\PHPScraper\PHPScraper;
// Fetches URL and parses contents to JSON.
$json = $web
->parseJson('https://test-pages.phpscraper.de/index.json');
// Fetches URL and parses contents to XML.
$xml = $web
->parseXml('https://test-pages.phpscraper.de/sitemap.xml');
// Fetches URL and parses contents into a simple array.
$csv = $web
->parseCsv('https://test-pages.phpscraper.de/test.csv');
// Fetches URL and generates an asso. array (map) with the first line as keys.
$csv = $web
->parseCsvWithHeader('https://test-pages.phpscraper.de/test.csv');
Each of the methods above can be accessed in various ways. Using parseCsv
as an example, you can use any of the methods as following:
$web = new \Spekulatius\PHPScraper\PHPScraper;
// Option 1: Pass in the absolute URL
$csv = $web
->parseCsv('https://test-pages.phpscraper.de/test.csv');
// Option 2: Navigate to a relative URL for parsing.
$csv = $web
->go('https://test-pages.phpscraper.de/meta/feeds.html')
->parseCsv('/test.csv');
// Option 3: Navigate with `go` or `clickLink` and call the parser.
$csv = $web
->go('https://test-pages.phpscraper.de/test.csv')
->parseCsv();
Multiple Methods
The examples above apply to the following methods:
parseJson
parseXml
parseCsv
parseCsvWithHeader
(resolves into an asso. array)
# Parsing a CSV String with Headers
CSV can be parsed into various data structures. PHPScraper comes with two options built-in to parse CSV. Given the following example file:
$ curl https://test-pages.phpscraper.de/test.csv
date,value
1945-02-06,4.20
1952-03-11,42
The standard parser parseCsv
returns a simple array with casted values:
$web = new \Spekulatius\PHPScraper\PHPScraper;
print_r(
$web->parseCsv('https://test-pages.phpscraper.de/test.csv')
);
/**
* [
* ['date', 'value'],
* ['1945-02-06', 4.20],
* ['1952-03-11', 42],
* ]
*/
parseCsvWithHeader
parses the content and uses the first line as headers and returns an associative array (map):
$web = new \Spekulatius\PHPScraper\PHPScraper;
print_r(
$web->parseCsvWithHeader('https://test-pages.phpscraper.de/test.csv')
);
/**
* [
* ['date' => '1945-02-06', 'value' => 4.20],
* ['date' => '1952-03-11', 'value' => 42],
* ]
*/
Type Casting
Native types such as int
and float
are automatically cast to PHP-native types.
# Providing CSV Parsing Parameters
You might want to define which separate, enclosure, and escape to use. You can do so by passing an options array along:
$web = new \Spekulatius\PHPScraper\PHPScraper;
// Direct access:
$csv = $web
->parseCsv('https://test-pages.phpscraper.de/test-custom.csv', '|', '"');
// Alternative syntax using `go` first:
$csv = $web
->go('https://test-pages.phpscraper.de/test.csv')
->parseCsv(null, '|', '"');