
Apple SHARP: When Any Photograph Becomes 3D
I’ve been experimenting with 3D reconstruction from archival photographs for the past year at Spatial Lab, and when we started in early 2025, I wasn’t sure high-quality single-image-to-3D was even possible. Gradually I developed processes with diffusion models that created convincing 3D scenes from 2D images, but they were painstaking manual processes that took hours to complete. Now, with Apple’s SHARP, there is a fast, opensource, image-to-3D process freely available.
SHARP converted a single 1930s Dorothea Lange photograph I fed it into a three dimensional object in seconds. A flat image from the Library of Congress became something I could manipulate and view from different angles.
From a single photograph to 3D. No multi-angle capture or orbiting video required. For someone like me who is used to doing 3D capture with multi-camera rigs, capturing hundreds and sometimes tens of thousands of images to reconstruct scenes, this felt astonishing.
Apple SHARP output
In December, Apple released SHARP, which they describe as “photorealistic view synthesis from a single image.” A more practical description: it’s a 2D-to-3D depth extraction process that turns any photograph you feed it into a 3D scene. It works on a mobile phone photo shot today as well as it does on a hundred-year-old archival image.
The code is open, it runs on Mac, Linux, and Windows, and you can install it yourself or use the prepackaged version in Pinokio. If you’ve ever struggled to get a breakthrough research paper’s code running on your own machine, only to discover it requires an 8-GPU cluster and a terabyte of RAM, you’ll appreciate what Apple has done here. SHARP is genuinely accessible, and that matters because accessibility is what turns a research curiosity into an infrastructure problem. When anyone can do this, everyone will.
And everyone did. Shortly after its release, my LinkedIn feed and 3D Gaussian splat hosting sites like SuperSplat were flooded with SHARP outputs of photos of all kinds turned into 3D.
To be honest, at first I had mixed feelings about SHARP. While it does feel magical in the way it adds depth to 2D photos, the results are fairly limited. The resultant 3D is only really viewable a few degrees off center. After that, the limits of monocular depth extraction become obvious. Faces and objects are curiously squashed when viewed from the side, and the more you move away from the original viewpoint of the photo the more the reconstruction deteriorates. But within the limitations the results are truly sharp, retaining the photorealistic and crisp character of source images where diffusion model processes often turn mushy.
Part of me was also frustrated that the painstaking processes were now obsolete, and a unique capability I’d once enjoyed was now available to everyone and their dog at no effort. But that’s just life in a fast-moving discipline, I suppose.
Apple SHARP Limitations
SHARP doesn’t generate worlds. It doesn’t hallucinate scenery beyond the frame or fill in what the camera didn’t see. What it does is take the flat image you give it and infer depth: Deciding what’s in the foreground, what’s in the background, and the general shape of objects in the scene. The result is a 3D relief of the original photograph, not an environment you can walk through.
That sounds modest. But it sits at one end of a spectrum that extends to tools like World Labs’ Marble, Google’s Project Genie, and Tencent’s HunyuanWorld, which produce full 360° 3D environments from text and image prompts. At that end of the spectrum, the provenance question is blunt: how much of what I’m seeing exists in any source material at all? The answer, often, is very little.
SHARP’s provenance question is subtler, and in some ways harder. Every pixel you see in a SHARP output comes from the original photograph. Nothing has been added. But the spatial relationships between those pixels — what’s in front of what, how far apart objects are, the curvature of surfaces — are entirely inferred by the model. The image is real. The depth is a guess. And the guess is informed not by the photograph itself, but by the model’s understanding of how the world generally looks.
For cultural memory and education, this distinction might not matter much. A 3D rendering of a historical photograph that adds a sense of depth without fabricating content is a powerful tool for engagement. But for evidence and documentation, the distinction matters enormously. If a legal team presents a 3D reconstruction of a crime scene photograph, the jury needs to know that the spatial relationships they’re perceiving, such as which object was closer to which, whether a doorway was within reach, are inferences, not measurements. The image says “this is what was there.” The depth says “this is roughly where we think it was.” Those are different claims with different evidentiary weight.
What we observed. In our experiments converting archival photographs with SHARP, a few things stood out.
The depth estimation is more convincing than you’d expect, and that’s potentially a problem. When we converted FSA photographs from the 1930s, the results felt spatially plausible. Objects appeared to sit at reasonable distances from each other. Rooms had a sense of volume. But nothing in the output distinguishes estimated depth from measured depth. A viewer has no way of knowing whether the spatial arrangement they’re seeing reflects the actual scene or the model’s best guess about scenes that generally look like this.
Right now, SHARP’s own limitations provide a kind of accidental honesty when you move a few degrees off center and the illusion breaks, signaling the boundary between what the model knows and what it’s guessing. But it won’t always be this way. As monocular depth estimation improves, the artifacts will shrink, and the line between measured and inferred will become invisible. The time to build trust infrastructure is now, while the seams are still showing.
And regardless of how good the depth estimation gets, metadata doesn’t survive the conversion. A photograph with C2PA content credentials enters the SHARP pipeline; what exits is a 3D object with no connection to the original provenance chain.
SHARP monocular depth detail
The spectrum from monocular depth estimation to full world generation is filling in fast, and the provenance challenges are different at each point. For SHARP, the question is: are the spatial relationships documented or inferred? For world models, the question is: how much of this environment exists in any source material at all? Both need answers, and both need different labels.
At Spatial Lab we’re working on a provenance layer that persists through reconstruction, for any given 3D creation spelling out what the inputs were and what processes and transformations occurred. We’ve been prototyping what we call a “nutrition label” for synthetic media, and the need for it just became a lot more urgent.
If you’re working with archival photographs, spatial documentation, or digital evidence and grappling with these questions, we’d like to hear from you — especially if you’ve tried to maintain provenance through a 3D reconstruction pipeline. Reach out at info@starlinglab.org.
Spatial Lab is a publication of Starling Lab, a joint initiative of Stanford University and USC focused on data integrity. We cover spatial intelligence technologies for journalism, law, and historical documentation.