Yes, the planet (hypothetically, a very earth-like planet with a 1-year, 1 AU orbit, 24-hour rotation, and 23-degree tilt at 4 light years' distance like Proxima Centauri b) moves at most 2 AU in a 6 month integration, and the telescope ~700 AU behind the center of the lens would have to move more to keep it opposed. But that 4 light year distance means it's 250,000 AU from the Sun, so some basic geometry says it only has to translate laterally by on the order of 2 * 700 / 250,000 = 0.0056 AU. You're right that that is far larger than an image sensor would be, and larger than the solar sails that would push this craft would be, but inconsequential for a vehicle that's just flown 700 AU.
Planets not only move relative to their star, but they also rotate and tilt. I see a number of artists' depictions of the planet (eg at [1]) that look like the satellite just flew into space, illuminated a circular planet with a giant flash bulb, and returned a pixellated photo. I've only thought about this for a minute, but I don't think it would look anything like that.
Trying to integrate an image of over the course of a 6-month exposure means not only tracking where the planet is in its orbit but also discerning the longitude on the planet from which a given photon was emitted at a particular time. Plus, if it's tilted at all, we might get many images of the north pole and none of the south pole, or an underexposed image of some polar regions that were only aligned with us for a small duration of the exposure. Finally, even though this gravitational lens is enormous and can collect many more light rays that happen to be aimed at the sun on the image sensor than a physical lens or mirror could, light still has to come from somewhere - specifically, the host star, so only half of the sphere can potentially receive photons that might bounce in our direction at any time, and that half may or may not be aligned with us. Finally, over the course of 6 months, the planet might experience seasons, with changes in the atmosphere and surface ice!
Assembling the raw data into a sharp image would be far more challenging than just opening and closing a shutter then grabbing a serial stream of X by Y pixel data from an image sensor, but the output might be much more than a single image.
[1] https://www.nasa.gov/general/direct-multipixel-imaging-and-s...