25.4 C
New York

“A PR stunt”, says X Square Robot CEO, humanoid robotics are not suitable for factories and should be avoided.

Published:

If you only aim to follow others, then you’re already behind. This is a weak-minded approach to building tech.

Wang Qian exudes the calm, measured, and composed presence of a scholar. When the topic of embodied intelligence is discussed, a new side of him emerges. He becomes sharp, adamant and unflinching.

Wang said, “Starting a business takes real resolve.” “If you have a back-up plan on the first day, your mindset is flawed.” He completed his bachelor’s, master’s, and doctorate degrees at Tsinghua University. He briefly ran a hedge fund in the US but found it difficult to leave robotics. “I couldn’t get to sleep for days.” He said he regretted not pursuing robots full-time.

He shut down the fund in 2023, returned to China and founded

X Square Robot
in Shenzhen.

In less than 18 months, the startup raised more than RMB 1 billion (USD140 million) in seven funding rounds. Meituan led a new investment in the company on May 12 that was described as a nine figure RMB sum.

China’s embodied artificial-intelligence sector began to gain definition around the time. Nvidia CEO Jensen Huang said it would be the future of technology. In 2023, companies like Galbot and Agibot also began to be founded.

X Square attracted little attention in the beginning. With each funding round it has gained more attention.

An unnamed investor told the 36Kr website that humanoid robots in China are now categorized into distinct tiers. Unitree Robotics is at the top of the list, with Agibot and Galbot each having raised over RMB 1.5 billion. X Square, with its current funding total is at the edge of this group.

The embodied AI landscape is also polarized in China, as it was with the foundational AI models. Some investors, such as venture capitalist Zhu Xiaohu question the field’s prospects for commercial success, despite robots performing flashy demonstrations. Others make large bets and back companies in the race to mass production.

Wang has been a strong believer.

From the outset, X Square has pursued a specific technical path: an end-to-end vision-language-action (VLA) model. New versions are released every 2 to 3 months.

When US-based Physical Intelligence launched its own VLA model, it was already an industry standard.

While most companies focused on simple tasks such as pick-and-place robots, X Square’s Wall-A robot performed more complex functions, including clothing management, home organizing, and cable routing.

Some critics argue that embodied AI for general-purpose applications is still premature. Wang disagrees. He believes that progress is exceeding expectations and that models with capabilities comparable with GPT-3 will be available within a year. Commercial use could follow one or two more years after.

The current deployments are limited to research, educational, and concierge-style roles. Wang, however, sees this as a limited scope. “Putting humanoids into factories to perform repetitive tasks? He said, “That’s just a publicity stunt.”

Wang argues meaningful commercialization is dependent on improving generalization – the model’s ability adapt to different tasks and environments.

At this time, monetization does not appear to be a priority for X Square. Around two-thirds is its budget is dedicated for model development and related fields.

Wang said, “To put it bluntly, we are leading the domestic AI field in embodied AI model development.” Investors prefer to invest in the frontrunners. They believe we’ll deliver outsized gains and want us to focus on building a general purpose model.

Photo of Wang Qian. He is the founder and CEO of X Square Robot. Photo source: X Square Robot.

This transcript has been edited for clarity and brevity.

What technical progress has X Square achieved over the last six months?

Wang Qian: Our progress has been rapid, with new model versions being released every two to three month. Earlier, models only produced actions: multimodal input and unimodal output. Since October and November last year, we have switched to “any-to any” models: multimodal output and input. Now, our models produce not only actions but also language and visual outputs.

In addition, we’ve developed long chains (COT). Around the time we received our last two funding rounds, COT was working in our full-modality Framework.

Google’s Gemini Robotics published similar results in March: any-to -any and COT. The PI Pi-0.5 also followed this structure. We were ahead of the curve and kept up with global leaders. We’re technically on par with PI, Google and Microsoft.

36Kr : Has the VLA architecture model become the industry standard?

Yes. Especially after PI launched its model last October. Everyone understood E2E was the way to go.

Everyone is waving the flag now, but execution varies widely. Some companies redefine E2E according to their needs.

Two main approaches are available. The first is a two layer system: a high level vision-language model (VLA) for reasoning and planning and a low level VLA model (VLA) for generating actions. The other uses a unified single model. We tested both and found that the single-layer version offered a higher performance limit.

What is the alternative to E2E?

Q: Some people still use traditional setups, such as 3D vision for perception and rule-based control. This is fine for basic pick and place tasks, such as those in legacy industrial automaton. But that’s not where we’re going. Even Figure AI, Boston Dynamics and other companies have moved past that.

If we compare embodied AI to large language models, where is the field today?

We are at the GPT-2 stage. GPT-3 had some scale-dependent features that we haven’t yet reached, nor have other programs like PI or Google. All of it is governed by scaling laws.

When will commercialization in China really take off?

Within a year if things go well. Two years if the pace is slower. I’m talking about customers paying for solutions. Household robots are going to take longer: probably 3 to 5 years.

People tend overestimate the possibilities of the short-term and underestimate the possibilities of the long-term. I think embodied artificial intelligence will arrive sooner than most people expect.

“36Kr”: Everyone says that data is the bottleneck. Do you have enough data?

The issue is more about the timeline than data. Collecting more data won’t help if you don’t know embodied AI models. You might even be slowed down. Many of it will be of low quality or irrelevant.

Having a lot of data is not enough. Knowing what type of data is important is key. We have focused on quality and targetting. This is more efficient.

The public datasets are often inadequate. We rely primarily on our own data.

36Kr: Startups have been spending less recently. Are you preparing for an upcoming cooldown?

WQ: We’re frugal. We will not spend money where it is not needed. To build long-term value you must invest. It is not only unambitious to copy open-source robots and models, but it will also prevent you from achieving general-purpose robots.

Lack of confidence is usually a reflection of a lack in capability. If you believe in yourself, you will act accordingly. Why wait for a boom when you can be the one to lead it?

36Kr – How do investors evaluate your technical progress? Videos or live demos.

Q: Always Live. Since the beginning, we’ve been insisting on real-time demonstrations. Videos can be faked. Hands-on interaction is the only way to see true performance, particularly when investors try and throw robots off balance or introduce stress conditions.

Are investors pressing you to commercialize at this valuation?

HQ: Depends. Some are more concerned with the model’s potential over the long term. Others are more concerned with commercialization in the near future. We’ve earned more flexibility because we’re leading the way in technology. Investors expect us to go beyond superficial milestones and pursue meaningful commercialization.

36Kr : But you still haven’t released any robot hardware?

36Kr: But you haven’t released your robot hardware yet? It’s just not been widely released. Some units have already been deployed in service roles. We’ll be releasing additional models soon.

Is the technology ready for use in the service sector?

WQ: We’re still conducting proof-of concept pilots with seed clients. We plan to deploy the system in its entirety by the end of this year or early next. We’re not just limiting ourselves to simple pick-and-place jobs.

Simple jobs don’t test a model’s capabilities. Legacy tech could handle them. We’re aiming for complex, open-ended, and varied scenarios.

What margins can you expect when real deployments begin?

WQ Traditional Service Robots are task-specific. Ours are general purpose, so their value depends on what can be done. Early profitability is not the goal. We are refining our product based on actual usage.

Your peers are focusing on education and retail concierge. Are these mature markets?

WQ These are marketable but their value is questionable. They are mainly used to reassure investors. They are too small to serve as the endgame.

These are good as byproducts, but if they take up too much time, you lose focus on the main goal.

If general-purpose AI is too difficult, why not settle on niche commercialization?

Q: Why enter this field? There’s no sense in starting if you don’t have a goal. Some say that Figure AI’s BMW Factory deployment was overhyped. What are your thoughts on factory use cases?

Humanoids in factories for repetitive tasks? It’s a PR stunt. Older tech is often more reliable due to the current demands for speed and accuracy.

Factory environments are closed and structured, which are not ideal for training generalist model. Embodied AI requires complexity, randomness, as well as open-ended interactions. This is where models grow. In economics, people debate if supply creates demand, or vice versa. In embodied AI, supply clearly creates demand.

36Kr: US counterparts have higher valuations, and deeper pockets. How large is the gap between China and the US?

The gap is still significant. We keep a close eye on PI, Google and Tesla.

We have a good chance of catching up in this year or the next. China’s mentality of being a follower is based on past habits. But that’s not required in embodied artificial intelligence. We’ve matched and even surpassed PI on many metrics.

PI has open-sourced their Pi-0 model. Will this level the playing field?

HQ: After six months, several Chinese companies have tried to fine-tune it. But the results haven’t matched PI’s proprietary setup. Cross-platform adaption remains a major challenge.

What can Pi-0 commercially do?

Performance drops sharply on new hardware. This makes it difficult to commercialize. PI probably opened-sourced the code because it couldn’t deploy independently. They don’t make hardware and rely on partners for integration.

35Kr: Is it a good strategy to wait for open-source models, and then follow along?

IQ: This may sound practical but it is misguided. Embodied AI doesn’t work like LLMs. You can’t fine-tune your way to success. You’ll still run into the same roadblocks.

Even worse, your team’s morale will suffer. How can a team believe in building from scratch if the leadership doesn’t?

Innovation requires conviction and creativity. Copying is not enough.

Could embodied AI be split into open and close ecosystems, like LLMs.

IQ: No, it’s not the exact same. Open source is a flawed approach for integrated systems, especially when it’s time to commercialize. This has been demonstrated in drones, self-driving vehicles, and other areas. LLMs are expected to be the key to open-source success for embodied AI. LLMs, however, are software. Embodied AI requires physical hardware and world interactions. This reduces transparency and engagement of the public.

PI’s Pi-0 was world-class but didn’t make a big splash. These models are academic without real-world interaction.

You can’t reproduce open-source models exactly. No lab can replicate another’s environment. Hardware-based data cannot be distilled. This is a big difference to software.

The open source model won’t help the market for embodied AI. Not in this domain.

KrASIA Connection includes translated and adapted material that was originally published 36Kr. This article has been written by Wang Fangyu, for 36Kr.



www.roboticsobserver.com

Related articles

spot_img

Recent articles

spot_img