Use Python to read HTML element by XPath

Ubuntu 16.04.4 LTS

Use Python 3 to read an HTML element attribute: data-endpoint by XPath and lxml

HTML file

   
...<div data-listing="article" data-endpoint="https://www.sample.com/article-list.json" ...

Use the data-endpoint to fetch and parse json data of article list

JSON of an article list

[
    {
        chucks:
        [
            {
                title: "Test Title",
                url: "https://www.sample.com/article1.html"
            },
            {...},
            {...}
        ]
    }
]

Install lxml
1
sudo apt-get install python3-lxml

Python script

test.py

from lxml import html
import requests

urls = ["https://www.sample.com/sample.html"]

page = requests.get(url)
content = html.fromstring(page.content)
endpoints = content.xpath('//div[@data-listing="article"]/@data-endpoint')
for endpoint in endpoints:
    r = requests.get(endpoint)
    data = r.json()[0]
    for article in data['chunks']:
       print (url, "\t", article['url'], "\t", article['title'])

Run the script
1
python3 test.py