A Generative MultiModal Augmented Reality Art app for the ODSC West MultiModal Hackathon (sponsored and facilitated by Weaviate)
- Artists can generate 3D objects, we are using Meshy text to 3D generative AI technology, the alpha version API got released at the end of October 2023 (a week before the hackathon).
- Artists can also generate music or audio using Meta's MusicGen which is part of AudioCraft technology family offering, which became available just lately (2023 August).
- Users can search for artworks in a multimodal manner (searching for salsa should return both artworks with salsa music or artwork depicting a bowl salsa sauce, searching for metal should both match on metal genre music or art installations made of iron and metal components), powered by Weaviate, more specifically Weaviate's multimodal ImageBind embedding capabilities.
- Users can examine a chosen artwork in Augmented Reality, powered by 8thwall.
- Users may choose to mint an artwork as an NFT to become part of the related NFT collection, powered by NFTPort.
Everyone had many key competencies benefiting the hackathon, I'll list the main role for the particular project
| Name | GitHub | Ref | Hackathon role | 
|---|---|---|---|
| Csaba Toth | @MrCsabaToth | https://csaba.page | back-end | 
| Avi Rao | https://www.linkedin.com/in/avi-nav-rao-39a567205/ | front-end | |
| Yvonne Fang | https://yvonnef.net/ | @y1vonnef | front-end | 
| Andrew Savala | @redswimmer | https://www.linkedin.com/in/redswimmer/ | back-end | 
| Quinton Mills | @quintonmills | https://www.linkedin.com/in/quinton-mills/ | front-end | 
Due to the extremely tight time constraints of the hackathon we were primarily looking for readily available solutions and easily accessible APIs.
Weaviate provided a multimodal generative search engine for us. Since it's a cutting edge new development it was not available on the managed WCS (Weaviate Cloud Services). The ImageBind component requires 6GB memory (configuring 8GB memory for safety) First Csaba tried to spin up an EC2 instance, but he realized that he only had CPU quota policy increase for SageMaker
Text to 3D and image to 3D are extremely emerging and developing areas within generative AI. One of the biggest challenges is the last mile, when a mesh should be generated. Inherently stable diffusion models could generate a point cloud or NeRF, however that's not usable for traditional 3D systems, we'd need a mesh model. This step is - just like many problems such as tessellation - is NP hard. I've tried Shap-E (available on HuggingFace and Replicate as well) however seemingly these models generated the mesh from voxels, and my rubber ducky example resulted in a 13 MB obj file without a texture. We haven't found DreamFusion accessible on HuggingFace or Replicate. In the end the quality of the generated model and the assets are important for a real world usage. After testing many endpoints we stumbled across the Meshy, which was able to generate sub megabyte size glb models which even contain the texture! According to our tests it takes less than 5 seconds to receive a model. In the future we may give another try to DreamGaussian/Craft which are newer techniques than Shap-E.
Google introduced MusicML to the wide public in May, and tests via GUI and samples showed satsifying results, however there's no API yet. An unofficial "API" uses web scraping to drive the web GUI, that's too much of an ugly solution for a back-end. After going through some more models and looking for readily available solutions we settled on Meta's MusicGen as part of their AudioCraft technology family offerings. It is also available on Replicate in several multiple variations. THe official meta model only offers the large and the melody model flavors. According to our tests creating our own replica and serving it from a beefy box didn't yield a faster generation. It seems that a 10-20 second long music sample could take about 15-20 seconds to generate. The melody flavor respected a chillout + ambient + no beat request better, whereas the large model still generated beats into it. For a speedier generation we may still resort to the small model which could be faster.
Csaba has experience with ARCore and SceneView (see BWSS ARMap or AR Physics Experiments), however we wanted a multi platform solution. PWA (Progressive Web Apps) could offer seamless experience both from developer (no need to build for the native platform and struggle with store submissions and updates) and from user perspective (no need to search for the app in an app store and downloading, installing it). Our concept doesn't need any native sensors or features which are not provided by a web browser, and we can possibly even have a better integration towards the NFT minting feature later (let's just think about web3 which is synergy with crypto currencies and blockchains). Avi advised 8thwall as a great platform which fulfills all of our needs. Its main focus is WebAR, the tech stack is React which is a dominant web front-end technology these days, and all of our front-end team members had this knowledge (along with AR) in their competency skills. As a bonus, 8thwall applications can also connect to user's wallets and make any crypto or blockchain features easier to implement.
NFTPort is a platform made for developers to add NFT related features into applications. They are centered around API, and shield a lot of crypto technical details which are hard to tackle otherwise.