Web scraping with Python

Web scraping is very powerful methodology to extract and track “things of interest” from websites.

Let us see how it works with an example. Here I am performing search for items on an online retailer and fetch price of searched items.

To do this I wrote a module having one class having two abstract functions. “Shopping.Flipkart” module has “flipkart” class and “flipkart” class has “finditem” and “getDetails” functions. Here is the file structure.


.
├── flptest.py
└── Shopping
    ├── FlipKart.py

Lets look into module code first.


cat Shopping/FlipKart.py
#!/usr/bin/python3
import requests
import bs4 as bs
import sys
import json

class flipkart:
   "Flipkart class"
   Agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"

   def finditem(searchfor):
       searchbase = "https://www.flipkart.com/search?q="
       sort="&sort=popularity"
       searchobj = searchbase+searchfor+sort

       try:
          res = requests.get(searchobj, headers={"User-Agent": flipkart.Agent})
       except:
          return(-1)

       soup = bs.BeautifulSoup(res.text,'html.parser')

       for tag in soup.find_all('script'):
           if "jsonLD" and "ItemList" in str(tag):
               items = str(tag).split('\n')
               itemlist = eval(items[1].lstrip())
               items = itemlist["itemListElement"]
               return(list(items))
               break

   def getDetails(itemurl):
       try:
            res = requests.get(itemurl, headers={"User-Agent": flipkart.Agent})
       except:
           return(-1)

       soup = bs.BeautifulSoup(res.text,'html.parser')
       for tag in soup.find_all('script'):
           if "__INITIAL_STATE__" in str(tag):
               itemDetailraw = str(tag).split('\n')
               itemDetails=itemDetailraw[1].lstrip().split("__ = ")
               return(itemDetails[1].rstrip(';'))
               break

Here is the main code which calls functions to perform search and get price of searched items.


cat flptest.py 
#!/usr/bin/python3

from Shopping.FlipKart import flipkart
import sys
import json

if len(sys.argv) > 1:
   searchfor = sys.argv[1]
else:
   print("No search object provided")
   quit()

myitems = flipkart.finditem(searchfor)
if myitems != -1:
   for myitem in myitems:
      itemurl = myitem["url"]
#      print(itemurl)
      try:
          itemDetails = json.loads(flipkart.getDetails(itemurl))
      except:
          print("Flipkart cannot get item details of", itemurl)

      try:
           itemPrice = itemDetails["pageDataV4"]["page"]["data"]["10002"][2]["widget"]["data"]["emiDetails"]["metadata"]["price"]
      except:
           try:
               itemPrice = itemDetails["pageDataV4"]["page"]["data"]["10002"][2]["widget"]["data"]["pricing"]["value"]["finalPrice"]["value"]
           except:
               itemPrice = "Cannot get it !!"

      print(itemurl, itemPrice)
#      break

else:
    print("Flipkart search down, cannot proceed finding", searchfor)
    quit()

Execute and verify results


./flptest.py "Beer mugs"
https://www.flipkart.com/ftafat-inductive-rainbow-color-cup-led-flashing-7-changing-light-pour-water-tea-lighting-cup-easy-battery-replace-glass-mug/p/itmf9dqs93sptkbv?pid=MUGF9DN2HRQ8F9PZ&lid=LSTMUGF9DN2HRQ8F9PZ3LFRUF&marketplace=FLIPKART 199

https://www.flipkart.com/ftafat-pair-inductive-rainbow-color-cup-led-flashing-7-changing-light-lighting-cup-easy-battery-replace-2-cups-glass-mug/p/itmf9dr5pcbfjzpb?pid=MUGF9DN4V9ZHKJQG&lid=LSTMUGF9DN4V9ZHKJQGZUFBKA&marketplace=FLIPKART 254

https://www.flipkart.com/alemkip-inductive-rainbow-color-cup-led-flashing-7-changing-light-pour-water-tea-lighting-cup-easy-battery-replace-glass-250-ml-mug/p/itmff7kfevfd8ntz?pid=MUGFF76WRGTGVX3D&lid=LSTMUGFF76WRGTGVX3DWU3PVI&marketplace=FLIPKART 196

https://www.flipkart.com/red-renge-enterprises-rrbm1-brass-mug/p/itmfhtnn8x8bhw93?pid=MUGFHSQZCW2ZYBKA&lid=LSTMUGFHSQZCW2ZYBKARTPWYG&marketplace=FLIPKART 1850

https://www.flipkart.com/sachdeva-s-traders-color-changing-led-light-glass-mug/p/itmf6z4zpgsgg28z?pid=MUGF6YY6UBP9JZGS&lid=LSTMUGF6YY6UBP9JZGS86FOTS&marketplace=FLIPKART 185

https://www.flipkart.com/lucky-thailand-lg115-glass-set/p/itme9dx8bxtq2eqp?pid=GLSE9DX8ZKGCFAWP&lid=LSTGLSE9DX8ZKGCFAWPRURGMS&marketplace=FLIPKART 246

https://www.flipkart.com/somil-trandy-new-design-stylish-glass-beer-mug-handle-set-2/p/itmeqg8abrwhhgne?pid=GLSEQG8ASNC8P7MM&lid=LSTGLSEQG8ASNC8P7MMH1KNBE&marketplace=FLIPKART 375

https://www.flipkart.com/mega-shine-4-pcs-colour-changing-liquid-activated-lights-multi-purpose-use-cup-300-ml-plastic-mug/p/itmf4s5symnhg499?pid=MUGF4RZ44G3G3ZHK&lid=LSTMUGF4RZ44G3G3ZHKLX4EA5&marketplace=FLIPKART 446

https://www.flipkart.com/caryn-beer-party-led-light-plastic-mug/p/itmee79a6jnhggmk?pid=MUGEE79AQ3CNVECW&lid=LSTMUGEE79AQ3CNVECWYYX5P7&marketplace=FLIPKART&spotlightTagId=BestvalueId_upp 326

https://www.flipkart.com/nvcollections-3d-led-magic-cup-cool-stylish-colorful-design-300ml-plastic-mug/p/itmfa39vh3nhyzex?pid=MUGFA2V2B7E3Y7ZD&lid=LSTMUGFA2V2B7E3Y7ZDKMGWHI&marketplace=FLIPKART 160

Verify first url result in web browser as well.

Looks good, price matches with what we see from python code.

Remember, scraping has to be crafted after carefully looking into the content and source of web pages. This article is just to give an idea of web scraping.