Ubuntu 16.04.4 LTS

  • Use Python 3 to read an HTML element attribute: data-endpoint by XPath and lxml
    HTML file
       
    ...<div data-listing="article" data-endpoint="https://www.sample.com/article-list.json" ...
  • Use the data-endpoint to fetch and parse json data of article list
    JSON of an article list
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    [
    {
    chucks:
    [
    {
    title: "Test Title",
    url: "https://www.sample.com/article1.html"
    },
    {...},
    {...}
    ]
    }
    ]
  • Install lxml
    1
    sudo apt-get install python3-lxml
  • Python script
    test.py
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    from lxml import html
    import requests

    urls = ["https://www.sample.com/sample.html"]

    page = requests.get(url)
    content = html.fromstring(page.content)
    endpoints = content.xpath('//div[@data-listing="article"]/@data-endpoint')
    for endpoint in endpoints:
    r = requests.get(endpoint)
    data = r.json()[0]
    for article in data['chunks']:
    print (url, "\t", article['url'], "\t", article['title'])
  • Run the script
    1
    python3 test.py
2021-08-09