Project Duration/Compensation - Less than a week - TBD (less than $1K)
I'm posting this here in the hopes that short term projects are ok/valid and welcome. (If not, I aplogize)
A crawling project is running into consistency issues in terms of the returned data/content. The crawler targets a dynamic site (no curl/wget) requiring a headless browser solution. The apparent issue - the crawler runs into "issues", and as a result returns inconsistent content. However, if the process iterates/loops it will eventually get the correct content. The test URLs work with a live browser FF/Chrome/Etc and return the result in a few secs. The test crawler often takes minutes!
The current stack for the crawler -- Centos7/Py/Selenium/Chrome (headless)
I'm looking for someone who has serious skills in the domain of headless browser crawling, with a deep/thorough understanding of possible issues with crawling. The goal is to have the crawler return the correct results in a minimum amount of time.
Current possible issues to investigate/solve/handle: -Gateway Timeout Issues -Page Not Found Issues -Other Incorrect/Weird Content!
I'm also willing to contemplate that a consistent crawl can't be achieved, but I'm fairly certain the goal can be accomplished.
If anyone wants to reply for more information, or to discuss, feel free to ping me and let's see what happens.
Thanks
-bruce badouglas@gmail.com