←back to thread

425 points whoishiring | 1 comments | | HN request time: 0.207s | source

Please state the job location and include the keywords REMOTE, INTERNS and/or VISA when the corresponding sort of candidate is welcome. When remote work is not an option, include ONSITE.

Please only post if you personally are part of the hiring company—no recruiting firms or job boards. Only one post per month, please. If it isn't a household name, explain what your company does.

Commenters: please don't reply to job posts to complain about something. It's off topic here.

Readers: please only email submitters if you personally are interested in the job—no recruiters or sales calls.

To search the thread, try kennytilton's WhoIsHiring browser at https://kennytilton.github.io/whoishiring/ or kristopolous' console script at https://news.ycombinator.com/item?id=10313519.

1. mdouglas_1 ◴[] No.17507696[source]
Personal Project - Self-funded Cali/Fl - Remote is fine as long as results are obtained

Project Duration/Compensation - Less than a week - TBD (less than $1K)

I'm posting this here in the hopes that short term projects are ok/valid and welcome. (If not, I aplogize)

A crawling project is running into consistency issues in terms of the returned data/content. The crawler targets a dynamic site (no curl/wget) requiring a headless browser solution. The apparent issue - the crawler runs into "issues", and as a result returns inconsistent content. However, if the process iterates/loops it will eventually get the correct content. The test URLs work with a live browser FF/Chrome/Etc and return the result in a few secs. The test crawler often takes minutes!

The current stack for the crawler -- Centos7/Py/Selenium/Chrome (headless)

I'm looking for someone who has serious skills in the domain of headless browser crawling, with a deep/thorough understanding of possible issues with crawling. The goal is to have the crawler return the correct results in a minimum amount of time.

Current possible issues to investigate/solve/handle: -Gateway Timeout Issues -Page Not Found Issues -Other Incorrect/Weird Content!

I'm also willing to contemplate that a consistent crawl can't be achieved, but I'm fairly certain the goal can be accomplished.

If anyone wants to reply for more information, or to discuss, feel free to ping me and let's see what happens.

Thanks

-bruce badouglas@gmail.com